Skip to content

net_sock_async_event: cancel async event on sock_*_close() #21180

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 20, 2025

Conversation

derMihai
Copy link
Contributor

@derMihai derMihai commented Jan 31, 2025

Contribution description

There's nothing preventing an async callback event from any type of socket to be enqueued and called after sock_*_close() is called.

This PR is just a fix idea. It only fixes it for UDP, but the problem is there for all kind of sockets. The principle should stay the same everywhere, but I don't know how to integrate it best to avoid code duplication.
I added a call to sock_event_close() for all sockets that implement the async API.

Also, this fix is probably racy if sock_udp_close() is not called from the same event queue, as the event is canceled after the sock is de-initialized. The sock is, however, at least still in scope.

If I understand correctly, the RIOT socket API is not thread-safe. By this assumption, a socket used in async context can only be safely closed from the event thread. This is because there is no way of preventing the networking subsystem from posting a socket event before calling sock_*_close(). This makes the matter easy: just call sock_event_close() (which cancels any socket callback events) at the end of sock_*_close().

Testing procedure

Briefly tested this on the application I'm currently developed. This is not a final solution.
Only tested this with gnrc's UDP socket implementation.

Issues/PRs references

I'm not sure to what extent this is related to #21093.

@github-actions github-actions bot added Area: network Area: Networking Area: sys Area: System labels Jan 31, 2025
@benpicco benpicco requested a review from maribu February 4, 2025 13:52
@benpicco
Copy link
Contributor

benpicco commented Feb 4, 2025

This is not a final solution.

What's missing?

It only fixes it for UDP, but the problem is there for all kind of sockets.

I'm afraid that is inherent to the sock API in RIOT - there is no shared code between different types of sockets.

@maribu
Copy link
Member

maribu commented Feb 4, 2025

There's nothing preventing an async callback event from any type of socket to be enqueued and called after sock_*_close() is called.

I think I have seen this in TCP (only supported by lwIP as of now). There the connection sockets are handed over to the network stack and are reused to accept new connections on, which will zero out the sockets. But since they have been enqueued in a circular list, this brakes the invariant of the list being a circle (e.g. not ending in a NULL pointer), which causes crashes.

Thanks for looking into this!

@github-actions github-actions bot added the Area: pkg Area: External package ports label Feb 11, 2025
@derMihai
Copy link
Contributor Author

This is not a final solution.

What's missing?

It only fixes it for UDP, but the problem is there for all kind of sockets.

I'm afraid that is inherent to the sock API in RIOT - there is no shared code between different types of sockets.

I updated this, I think now it's covering all the socket implementations I could find. gnrc TCP seems to be the only one not implementing async.

I also found a way to (partially) avoid code duplication.

@derMihai derMihai changed the title gnrc/sock/udp: cancel async event on sock_udp_close() net_sock_async_event: cancel async event on sock_*_close() Feb 11, 2025
@benpicco
Copy link
Contributor

Please squash!

@derMihai derMihai force-pushed the mir/sock_udp/cancel_async_mainline branch from 15d65d3 to d0a230e Compare February 12, 2025 14:56
@benpicco benpicco added the CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR label Feb 12, 2025
@riot-ci
Copy link

riot-ci commented Feb 12, 2025

Murdock results

✔️ PASSED

0f25aec sock/async: cancel sock async events when closing the socket

Success Failures Total Runtime
10271 0 10271 10m:10s

Artifacts

@derMihai derMihai force-pushed the mir/sock_udp/cancel_async_mainline branch 2 times, most recently from c07602e to efd8c98 Compare February 12, 2025 15:15
@derMihai derMihai force-pushed the mir/sock_udp/cancel_async_mainline branch from efd8c98 to 0f25aec Compare February 12, 2025 15:42
@derMihai
Copy link
Contributor Author

@benpicco anything else to add here?

@maribu maribu added this pull request to the merge queue Mar 20, 2025
Merged via the queue into RIOT-OS:master with commit 5869c2c Mar 20, 2025
26 checks passed
* subsystem from posting socket events until the socket has been
* unregistered, which is only happening when closing the socket and the
* reason we're here in the first place. */
assume(!queue->waiter || thread_getpid_of(queue->waiter) == thread_getpid());
Copy link
Contributor

@benpicco benpicco Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the SUIT worker thread is triggering this assertion

2025-04-03 17:09:37,454 # suit_worker: started.
2025-04-03 17:09:37,462 # suit_worker: downloading "coap://[fdea:dbee:f::1]/fw/1/m/6ik?hwr=0&fwt=20&fwr=2004003"
2025-04-03 17:09:37,475 # queue 0x12f has a waiter: 0x54d00
2025-04-03 17:09:37,479 # queue 0x12f pid (0) != thread pid (18)
2025-04-03 17:09:37,485 # sys/net/sock/async/event/sock_async_event.c:95 => FAILED ASSERTION.

I added

    if (queue->waiter && (thread_getpid_of(queue->waiter) != thread_getpid())) {

        if (queue->waiter) {
            printf("queue %p has a waiter: %p\n", queue, queue->waiter);
        }

        if (thread_getpid_of(queue->waiter) != thread_getpid()) {
            printf("queue %p pid (%u) != thread pid (%u)\n", queue,
                                                             thread_getpid_of(queue->waiter),
                                                             thread_getpid());
        }
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like uninitialized memory to me

74	    if (queue->waiter && (thread_getpid_of(queue->waiter) != thread_getpid())) {
(gdb) p queue
$1 = (event_queue_t *) 0xaa012f
(gdb) p queue->waiter
$2 = (thread_t *) 0x79

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's no way there's a waiter on the queue and its PID is 0, unless the waiter is the main thread. Is this happening in the SUIT worker? Cause at a glance it doesn't use async sock, and the sock_async_ctx_t::queue is non-NULL only if the socket is async.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out sock_udp_create() leaves most fields of sock_udp_t uninitialized :/

Fortunately this is an easy fix.

@mguetschow mguetschow added this to the Release 2025.04 milestone Apr 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: network Area: Networking Area: pkg Area: External package ports Area: sys Area: System CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants