Skip to content

Conversation

seeseemelk
Copy link
Contributor

Contribution description

This PR fixes a rare race-condition in xtimer_msg_receive_timeout which can lead to corruption of the timer list and subsequent hard faults.
The race condition is triggered when:

  1. A message is sent to a thread using msg_send.
  2. The xtimer which sents the timeout message expires and executes before xtimer_remove(t) is called. This will cause the message queue for the thread to contain first a real message and second the timeout message. The timer will not be queued anymore, but the timeout message will still be in the queue.
  3. The xtimer_msg_receive_timeout function is called again. This will queue a new xtimer while the timeout message of the previous timer is still in the buffer. _msg_wait will see the old timeout message, think that the current xtimer has already expired and will not remove the timer. When xtimer_msg_receive_timeout returns, the timer will still be queued. However, as it is allocated on the stack it is no longer valid. This causes the timer_list_head to now point to invalid memory. Crashes ensue.

Testing procedure

This bug was found, validated, and fixed using a proprietary application. I have not written a separate example application which exhibits the problem which I could publish.

@benpicco benpicco added Area: sys Area: System Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) labels Apr 22, 2021
@seeseemelk
Copy link
Contributor Author

This issue seems to be related to #13504 .

@benpicco
Copy link
Contributor

I can confirm that with this #13345 (comment) no longer crashes.

@kaspar030
Copy link
Contributor

Description also makes total sense.

@kaspar030 kaspar030 added the CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR label Apr 22, 2021
Copy link
Contributor

@kaspar030 kaspar030 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK.

@MichelRottleuthner do you want to give this a second look?

@kaspar030 kaspar030 added the Process: needs backport Integration Process: The PR is required to be backported to a release or feature branch label Apr 22, 2021
@kaspar030 kaspar030 added this to the Release 2021.04 milestone Apr 22, 2021
@kaspar030
Copy link
Contributor

(ztimer also removes in any case)

@kaspar030 kaspar030 added the Process: needs >1 ACK Integration Process: This PR requires more than one ACK label Apr 22, 2021
@kaspar030
Copy link
Contributor

I now re-read the discussion in #13345 and I'm not sure if there's still a race when the timeout triggers between msg_receive() and xtimer_remove(). IIUC, there'd still be the timer message in the message queue. But as that's copied (and won't cause a wrong pointer into stack), at least a crash is avoided.

So I'd say let's go with this fix already (which was first discussed more than a year ago...).

@MichelRottleuthner
Copy link
Contributor

The unconditional remove makes sense to me.

I now re-read the discussion in #13345 and I'm not sure if there's still a race when the timeout triggers between msg_receive() and xtimer_remove(). IIUC, there'd still be the timer message in the message queue. But as that's copied (and won't cause a wrong pointer into stack), at least a crash is avoided.

Yes, AIUI the unpleasant queuing of the timer message could still happen. But I also agree with your reasoning that a copied message is preferable over a pointer to invalid stack memory...

So I'd say let's go with this fix already (which was first discussed more than a year ago...).

Yep.

@benpicco benpicco added Area: core Area: RIOT kernel. Handle PRs marked with this with care! and removed Area: core Area: RIOT kernel. Handle PRs marked with this with care! labels Apr 22, 2021
@kaspar030 kaspar030 merged commit 2229e95 into RIOT-OS:master Apr 22, 2021
@kaspar030
Copy link
Contributor

Backport provided in #16376

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: sys Area: System CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Process: needs backport Integration Process: The PR is required to be backported to a release or feature branch Process: needs >1 ACK Integration Process: This PR requires more than one ACK Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sys/xtimer: segmentation fault: in function xtimer_msg_received_timeout
4 participants