Skip to content

sys/net/gcoap: Use socket _buf API to recognize truncated requests #16378

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 27, 2021

Conversation

chrysn
Copy link
Member

@chrysn chrysn commented Apr 22, 2021

Contribution description

Gcoap used to behave quite bad with large requests (see #14167); this uses the experimental _buf API to detect overflows and react accordingly.

Testing procedure

  • Have a CoAP server on the host that serves something large, eg. aiocoap's ./server.py.
  • Run the gcoap example.
  • coap get 2a02:b18:c13b:8010::907 5683 /other/block
    • previously have given a timeout, and (when using -c and /other/separate, a very annoyed server because it's server that retries several times in vain);
    • now a part of the response is shown, and a warning about the truncation is shown.
  • From outside, send a large request (TBD: didn't get this to run on gcoap yet): ./aiocoap-client 'coap://[2a02:b18:c13b:8012:3c63:beff:fe85:ca96]/.well-known/core' -m POST --payload ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd
    • Previously, this would just time out.
    • Now, you get "4.13 Request Entity Too Large" written in friendly, red letters.

To test with lwIP (where sock_udp_recv_buf may actually have scatter-gather behavior, so it's a good test), I liberally copied USEMODULEs from tests/lwip over to examples/gcoap and ripped out anything that looked like gnrc (without any claims of necessity):

+USEMODULE += ipv6_addr
+USEMODULE += lwip_ipv6_autoconfig
+USEMODULE += lwip_netdev
+USEMODULE += lwip_ipv6
+USEMODULE += lwip
+USEMODULE += lwip_udp
+USEMODULE += lwip_tcp
+USEMODULE += sock_async_event
+USEMODULE += sock_ip
+USEMODULE += sock_tcp
+USEMODULE += sock_udp
+USEMODULE += sock_util
+USEMODULE += od
+USEMODULE += sock_aux_local
+USEMODULE += netdev_default

I didn't manage to send requests in for reasons out of the scope of this PR, but getting (coap get 2a02:b18:c13b:8010::907 5683 /other/block) produces the expected truncated result, and tracing through the newly introduced copy loop shows that it is indeed used as a loop and not as an if. When increasing CONFIG_GCOAP_PDU_BUF_SIZE to 1280, we get out of the "buffer too short" domain but into an area where the correctness of the reassembled buffer can be seen. (Squeeze your eyes and you see the pattern of the hex dump is uninterrupted).

(I do see that the blockwise transfer causes trouble of its own, but AFAICT it's unrelated to this issue.)

Issues/PRs references

Closes: #14167

See also #16377 for the worse alternative resolution, and a hint at how often this comes up.

[edit: Testing procedure filled out]

@chrysn chrysn marked this pull request as draft April 22, 2021 21:01
@chrysn chrysn added Area: CoAP Area: Constrained Application Protocol implementations Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR labels Apr 22, 2021
Copy link
Member

@maribu maribu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me. Some style suggestions.

@maribu maribu added Reviewed: 1-fundamentals The fundamentals of the PR were reviewed according to the maintainer guidelines Reviewed: 2-code-design The code design of the PR was reviewed according to the maintainer guidelines Reviewed: 4-code-style The adherence to coding conventions by the PR were reviewed according to the maintainer guidelines Reviewed: 5-documentation The documentation details of the PR were reviewed according to the maintainer guidelines labels Apr 23, 2021
@kfessel
Copy link
Contributor

kfessel commented Apr 26, 2021

I did a short test: a large coap put: time.wasm 566 byte

$coap-client -m put -p 5683 -f time.wasm coap://[fe80::xxxx%tapbr0]/WASM
4.13
coap-client -m put -p 5683 -f time.wasm -b 100 coap://[fe80::xxxx%tapbr0]/WASM

(there is nothing written for success unless verbosity 6 the value is 2.04)

from my POV this works as expected, (prior this patch the transfer without blocks just got stuck)

👍

@chrysn chrysn removed the CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR label Apr 28, 2021
@chrysn chrysn force-pushed the gcoap-handle-truncation branch from 541bda9 to 3fcfa27 Compare April 28, 2021 11:32
@chrysn
Copy link
Member Author

chrysn commented Apr 28, 2021

Style issues addressed (most did indeed fit the 100 char limit).

The TBD Size1 I'm leaving out right now as there are open questions as to what should be indicated there; not sending one is violating a SHOULD of 7252 but all still better than not responding at all, and IMO also better than sending the wrong value.

Are you happy enough with the state of things that one of you can ACK this? [edit: provided a test was done on non-GNRC, see below]

@chrysn
Copy link
Member Author

chrysn commented Apr 28, 2021

@kfessel, I'm not quite sure reading your comment what you tested. Did you find an example where gcoap-on-lwIP or similar can be run, or just tested the default GNRC gcoap case?

@kfessel
Copy link
Contributor

kfessel commented Apr 28, 2021

i tested my app (the one i had the issue with that lead me to talk about that in matrix) i tested with native
-> i only tested gnrc

the app takes wasm bytecode with is often more than 100 byte and therefor has that problem but works well with block transfers

this PR is a solution for my case (error response enables me to react and use block transfers no error just silence is bad)

@kfessel
Copy link
Contributor

kfessel commented Apr 28, 2021

The TBD Size1 I'm leaving out right now as there are open questions as to what should be indicated there; not sending one is violating a SHOULD of 7252 but all still better than not responding at all, and IMO also better than sending the wrong value.

couldn't we just send the limit by buffersize and if the app requires a smaller size it creates its own 4.13 error

@chrysn
Copy link
Member Author

chrysn commented Apr 28, 2021

What I think 4.13 with Size1 should is to report the payload size the client could use with this request. That means that the message needs to be parsed to the end of options, and then it needs to be determined how much space is left. Doing this is relatively complex, and may need additional error reporting to reliably see if the message was maybe even truncated in the options (in which case ... I don't know if anything sensible can be reported at all). But I'm not even sure that is the right interpretation, whereas not sending a Size1 is bad but at least debuggable.

@chrysn
Copy link
Member Author

chrysn commented Apr 29, 2021

There are now proper testing descriptions, and I ran them.

A few things worth mentioning happened there, and are expressed in follow-up commits:

  • If the message is truncated before the options end, things will still fail. This is because we run the message through the nanocoap coap_parse, and can't (easily and compatibly, at least) tell whether parsing was "successful enough" that we at least get the header populated. So for the time being, requests truncated already in the options will behave as before. This is now documented in the code, but honestly it's not where I expect the trouble to come from.
  • The gcoap client had a nasty behavior where rather than checking whether memo->state == GCOAP_MEMO_RESP, it checked for all its known error conditions. I'd say this is OK and call it bad judgement on the application developer side (error checking in enums should always be "accept what's known", not "reject specific errors"), but others may call this an incompatible change.
  • Given that we know at response handling time that only the payload was trimmed, the example response handler could be smart and not even complain but just start blockwising (even if the server didn't announce it). I prefer to leave that to later, as it may need meddling with more block-wise code, and isn't essential for getting a gcoap version that doesn't spontaneously break on large messages.

@chrysn chrysn marked this pull request as ready for review April 29, 2021 08:18
@miri64
Copy link
Member

miri64 commented May 3, 2021

@kfessel, I'm not quite sure reading your comment what you tested. Did you find an example where gcoap-on-lwIP or similar can be run, or just tested the default GNRC gcoap case?

Sidenote: There is a branch by @kb2ma somewhere, that should make that possible.

@benpicco
Copy link
Contributor

Thank you I just stumbled upon this as well.
I wrongly assumed the buffer passed to gcoap_req_init() was also used to store the response and wondered why I did not receive anything even though I used a large enough buffer here.

Why can't we use a user supplied buffer here instead of the local _listen_buf?

@kb2ma
Copy link
Member

kb2ma commented May 15, 2021

I saw a message in my email from this PR. It's been a while but my recollection is below.

Sidenote: There is a branch by @kb2ma somewhere, that should make that possible.

See the kb2ma/RIOT/lwip* branches.

Why can't we use a user supplied buffer here instead of the local _listen_buf?

IIRC this is the basic design of gcoap, as a messaging hub running in its own thread and for simplicity. See the wiki page for contrast with nanocoap.

@chrysn
Copy link
Member Author

chrysn commented Aug 24, 2021

I wrongly assumed the buffer passed to gcoap_req_init() was also used to store the response and wondered why I did not receive anything even though I used a large enough buffer here.

Any incoming message needs to go through gcoap's buffers at one point.

Why can't we use a user supplied buffer here instead of the local _listen_buf?

At the time we're reading into that, we don't even know whether what's coming in is a response or a request, let alone which request this belongs to.

@chrysn
Copy link
Member Author

chrysn commented Aug 25, 2021

If a response is received through gcoap it is, as it always has been, copied into gcoap's receive buffer, and freed from the socket. When such a truncated response arrives, the tail is gone for good already.

Sure, we could offer all kinds of "yeah but there's a pointer you can follow up on and need to free" stuff, but the beauty of nanocoap and gcoap is that the user doesn't need to gather data from anywhere, and the beauty of CoAP is that this is not needed in practice.

The point of this is to allow the stack to fail gracefully (allowing the protocol to continue rather than deadlocking it) when overly long messages arrive, which generally means to ask for smaller bites of data. An application will pick its buffer in accordance with the messages it expects, so if someone wants to haul 1KiB messages, they can up their buffer.

@miri64
Copy link
Member

miri64 commented Aug 25, 2021

Mh, so you are basically "misusing" the _buf API for a different than the intended use case, gotcha!

The point of this is to allow the stack to fail gracefully (allowing the protocol to continue rather than deadlocking it) when overly long messages arrive, which generally means to ask for smaller bites of data. An application will pick its buffer in accordance with the messages it expects, so if someone wants to haul 1KiB messages, they can up their buffer.

Mh, but consider this: we only put the header of a packet into the internal buffer and for the rest an application may choose to use:

/* just a sketch! */
static uint8_t payload[SOME_ARRAY_SIZE];
int pos = 0, res;
while ((res = gcoap_get_payload(pkt, &payload[pos], sizeof(payload) - pos) > 0) {
    pos += res;
}

this way, a developer does not need to beat themselves over the head of configuring some module they may or may non know about and can do all the necessary compile-time config in the app. No '"get follow-up or "yeah but there's a pointer you can follow up on and need to free" stuff' required, if we establish that this is the way to get a packet's payload.

@chrysn
Copy link
Member Author

chrysn commented Aug 25, 2021

Yes. I'm only using _buf because without it I we couldn't even see whether truncation happened, and right here this is all I care about.

consider this

Sure, that's a change we can make. But it's something that needs coordination across nanocoap and gcoap (which I start suspecting is part of what's holding RIOT's CoAP ecosystem back). In the end, this can get us rid of the need for a payload-sized buffer for incoming messages at all (though so far our handlers place their responses where the request came in, so...)

I'm happy to work with this as a proposal, but that'll take several changes and iterations of design, while this here is only about getting the CoAP machinery unstuck when large packages arrive, without any need for applications to alter any behavior. (Unless they specifically checked for whether memo->state indicates some particular form of error rather than for success). Let's do both, but please the good before the better.

@miri64
Copy link
Member

miri64 commented Aug 25, 2021

Sure, that's a change we can make. But it's something that needs coordination across nanocoap and gcoap (which I start suspecting is part of what's holding RIOT's CoAP ecosystem back). In the end, this can get us rid of the need for a payload-sized buffer for incoming messages at all (though so far our handlers place their responses where the request came in, so...)

Does it oO? If the coap_pkt_t struct needs to be extended for gcoap's sake, it might make sense to instead provide an extended, gcoap-specific type. But granted, I am not as deep into the gcoap/nanocoap-internals (yet), to provide solid feedback on this. Up until now I understood that nanocoap is only used as a message parser and composer for gcoap, but if there is deeper entanglement, maybe some cleanup is indeed required.

@chrysn
Copy link
Member Author

chrysn commented Aug 25, 2021

nanocoap is used as message parser and composer, but exceeding that, gcoap and nanocoap also share the coap_pkt_t structure as well as the coap_handler_t -- which means that untangling them entails a type change across all user visible CoAP functions.

BTW, Murdock found that I had disregarded the new DTLS code -- that's now "fixed" easily (because if truncation happens to DTLS, tough luck, crypto won't open, and the message can't be acted on sensibly).

@miri64
Copy link
Member

miri64 commented Aug 25, 2021

nanocoap is used as message parser and composer, but exceeding that, gcoap and nanocoap also share the coap_pkt_t structure as well as the coap_handler_t -- which means that untangling them entails a type change across all user visible CoAP functions.

Ok, but so (as far as I can judge) nothing that couldn't be handled with some good old inheritance and type juggling. In the long run, I think it is a good idea to boilerplate the whole nanocoap underbelly away from the user. But for now, I think we can live with just warning the user about truncated payloads.

@chrysn
Copy link
Member Author

chrysn commented Aug 25, 2021

Thanksl. Murdock passes too now. @maribu has done some review already, do these still hold? If so, testing is what is missing. For me with the latest commit, tests on native passed as described above for GNRC; LWIP passed after commenting out sys/net/netif/netif.c line 72 (netif_get_name; that's undefined with this setup both in master and this branch and an unrelated yet-to-be-opened issue).

Please let me know if there are any more issues, and whether I can squash. (I'd keep it as the 3 commits, and just let the fixup autosquash).

@miri64
Copy link
Member

miri64 commented Aug 25, 2021

(netif_get_name; that's undefined with this setup both in master and this branch and an unrelated yet-to-be-opened issue).

See #16741 ;-).

@chrysn
Copy link
Member Author

chrysn commented Sep 2, 2021

May I rebase this?

@miri64
Copy link
Member

miri64 commented Sep 2, 2021

Yes, and squash as well, if you like :-).

@chrysn chrysn force-pushed the gcoap-handle-truncation branch from 6f8baf2 to b9a8652 Compare September 2, 2021 14:52
@chrysn
Copy link
Member Author

chrysn commented Sep 2, 2021

I'm configured to autosquash ;-)

Rebased, and all described tests still work.

@kfessel kfessel self-requested a review September 6, 2021 16:15
@kfessel
Copy link
Contributor

kfessel commented Sep 6, 2021

What I think 4.13 with Size1 should is to report the payload size the client could use with this request. That means that the message needs to be parsed to the end of options, and then it needs to be determined how much space is left. Doing this is relatively complex, and may need additional error reporting to reliably see if the message was maybe even truncated in the options (in which case ... I don't know if anything sensible can be reported at all). But I'm not even sure that is the right interpretation, whereas not sending a Size1 is bad but at least debuggable.

@chrysn
the value is already calculated by the nanocoap coap_parse it is pdu->payload_len of the truncated pkg, if that one is 0 there are too many options

@kfessel
Copy link
Contributor

kfessel commented Sep 6, 2021

I did test my application with this patch, with GNRC and LWIP it works
my application is a gcoap server that take big PUT messages without this patch i get no response, if i do not use blocktransfer, with this patch i get the information -> then I do the transfer again with blocktransfer and it is working.
i will try to do further testing (the example) later

@kfessel
Copy link
Contributor

kfessel commented Oct 15, 2021

@chrysn: Do you want to add SIZE1 reporting with what coap_parse provides in pdu->payload_len? or just have it as it is?

@kfessel
Copy link
Contributor

kfessel commented Oct 25, 2021

I tested:

  • my own server (takes message bigger than the buffer via put and processes them) with GNRC ✔️ and LWIP ✔️ (in both cases i got an appropriate resonse -> resend with small blocks working ✔️ )
  • the gcoap example for the client path with GNRC ✔️ and LWIP ✔️
    (got a warning as described in the PR ✔️ )

@kfessel kfessel added the Reviewed: 3-testing The PR was tested according to the maintainer guidelines label Oct 25, 2021
@chrysn
Copy link
Member Author

chrysn commented Oct 26, 2021

I'd prefer to do any Size1 in a follow-up.

Copy link
Contributor

@kfessel kfessel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything is green (review steps 1-5 and ci tests) -> 🚀

@chrysn chrysn merged commit a92cdb5 into RIOT-OS:master Oct 27, 2021
@chrysn chrysn deleted the gcoap-handle-truncation branch October 27, 2021 06:56
@fjmolinas fjmolinas added this to the Release 2022.01 milestone Nov 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: CoAP Area: Constrained Application Protocol implementations Area: examples Area: Example Applications Area: network Area: Networking Area: sys Area: System CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Reviewed: 1-fundamentals The fundamentals of the PR were reviewed according to the maintainer guidelines Reviewed: 2-code-design The code design of the PR was reviewed according to the maintainer guidelines Reviewed: 3-testing The PR was tested according to the maintainer guidelines Reviewed: 4-code-style The adherence to coding conventions by the PR were reviewed according to the maintainer guidelines Reviewed: 5-documentation The documentation details of the PR were reviewed according to the maintainer guidelines Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gcoap drops long packages instead of gracefully erring out
8 participants