-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
From the way gcoap memos are set up, I suspect that timeouts can under race conditions slip to a new request:
- A response event timeout is set on a memo.
- Very late, a response comes in.
- The timeout fires, and the timeout event gets placed to on the queue.
- The response is processed, and event_timeout_clear gets called. That function's documentation says that it may be a no-op if the event has been queued already. (This is where I found it: I wanted to do the same also on empty ACKs and was astounded by event_timeout_clear not giving clear indication whether the CB was stopped or not in gcoap: Process CON responses #14178)
- After the response was processed by the application code, the memo is set to GCOAP_MEMO_UNUSED.
- Now there is a window of admittedly not much time until the event loop will process the timeout -- a bit more after gcoap: Process CON responses #14178, but a race is a race, and there could also be other events in the queue that delay things.
- In that time, a user thread could send a gcoap message. This either sets up a new timeout event in the old's place, or nulls it. (The violation of documentation happens here already: event_timeout_set demands that the used event must stay valid until the event has been processed.)
- The event gets processed, and _on_resp_timeout gets fired.
- The new request now (immediately after being sent) receives a timeout event, or the event fires with its callback address NULLed.
Steps to reproduce
were not taken yet. I have not observed this, just went through either the code paths or the documentation.
I expect this to be rather rare in practice, and hard to reproduce (even locally, let alone remotely, which is why I'm not giving this the security issue treatment).
Expected results and steps forward
gcoap should be free of such races.
I had briefly hoped that there could be a bipartite scheme of access (we only clear gcoap's timeouts in the message handler, leave the memos as ALMOST_UNUSED, and only set the memos to UNUSED when all timeout events were processed), but it seems that the event queue doesn't work that way.
My preferred design would be to allow event_timeout_clear to say "sorry events have been set in motion" and then cancel whatever would touch that memo. (That would mean that incoming responses to just timing-out requests would be discarded even though they arrived at the host, but it's already a race). That has the downside of being comparatively intrusive (as the events timer inherits its uncommunicative behavior from xtimer).
In the mean time, I will continue #14178 disregarding that race, as however this is resolved can still be applied there.
Still evaluating other possibilities.
Versions
master of May (8a2b089) is where I did the code path checks
[edit: formatting]