-
Notifications
You must be signed in to change notification settings - Fork 2.1k
net_sock_async_event: cancel async event on sock_*_close() #21180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net_sock_async_event: cancel async event on sock_*_close() #21180
Conversation
What's missing?
I'm afraid that is inherent to the sock API in RIOT - there is no shared code between different types of sockets. |
I think I have seen this in TCP (only supported by lwIP as of now). There the connection sockets are handed over to the network stack and are reused to accept new connections on, which will zero out the sockets. But since they have been enqueued in a circular list, this brakes the invariant of the list being a circle (e.g. not ending in a NULL pointer), which causes crashes. Thanks for looking into this! |
I updated this, I think now it's covering all the socket implementations I could find. gnrc TCP seems to be the only one not implementing async. I also found a way to (partially) avoid code duplication. |
Please squash! |
15d65d3
to
d0a230e
Compare
c07602e
to
efd8c98
Compare
efd8c98
to
0f25aec
Compare
@benpicco anything else to add here? |
* subsystem from posting socket events until the socket has been | ||
* unregistered, which is only happening when closing the socket and the | ||
* reason we're here in the first place. */ | ||
assume(!queue->waiter || thread_getpid_of(queue->waiter) == thread_getpid()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the SUIT worker thread is triggering this assertion
2025-04-03 17:09:37,454 # suit_worker: started.
2025-04-03 17:09:37,462 # suit_worker: downloading "coap://[fdea:dbee:f::1]/fw/1/m/6ik?hwr=0&fwt=20&fwr=2004003"
2025-04-03 17:09:37,475 # queue 0x12f has a waiter: 0x54d00
2025-04-03 17:09:37,479 # queue 0x12f pid (0) != thread pid (18)
2025-04-03 17:09:37,485 # sys/net/sock/async/event/sock_async_event.c:95 => FAILED ASSERTION.
I added
if (queue->waiter && (thread_getpid_of(queue->waiter) != thread_getpid())) {
if (queue->waiter) {
printf("queue %p has a waiter: %p\n", queue, queue->waiter);
}
if (thread_getpid_of(queue->waiter) != thread_getpid()) {
printf("queue %p pid (%u) != thread pid (%u)\n", queue,
thread_getpid_of(queue->waiter),
thread_getpid());
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like uninitialized memory to me
74 if (queue->waiter && (thread_getpid_of(queue->waiter) != thread_getpid())) {
(gdb) p queue
$1 = (event_queue_t *) 0xaa012f
(gdb) p queue->waiter
$2 = (thread_t *) 0x79
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there's no way there's a waiter on the queue and its PID is 0, unless the waiter is the main thread. Is this happening in the SUIT worker? Cause at a glance it doesn't use async sock, and the sock_async_ctx_t::queue
is non-NULL only if the socket is async.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out sock_udp_create()
leaves most fields of sock_udp_t
uninitialized :/
Fortunately this is an easy fix.
Contribution description
There's nothing preventing an async callback event from any type of socket to be enqueued and called after
sock_*_close()
is called.This PR is just a fix idea. It only fixes it for UDP, but the problem is there for all kind of sockets. The principle should stay the same everywhere, but I don't know how to integrate it best to avoid code duplication.I added a call to
sock_event_close()
for all sockets that implement the async API.Also, this fix is probably racy ifsock_udp_close()
is not called from the same event queue, as the event is canceled after the sock is de-initialized. The sock is, however, at least still in scope.If I understand correctly, the RIOT socket API is not thread-safe. By this assumption, a socket used in async context can only be safely closed from the event thread. This is because there is no way of preventing the networking subsystem from posting a socket event before calling
sock_*_close()
. This makes the matter easy: just callsock_event_close()
(which cancels any socket callback events) at the end ofsock_*_close()
.Testing procedure
Briefly tested this on the application I'm currently developed. This is not a final solution.Only tested this with gnrc's UDP socket implementation.
Issues/PRs references
I'm not sure to what extent this is related to #21093.