-
Notifications
You must be signed in to change notification settings - Fork 5.1k
http2: Fix extra frames and memory underflow in MetadataEncoder. #9996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Previously, if encodeMetadata were called on a connection for which Envoy was already dispatching data, it would generate extra frames, then underflow memory when MetadataEncoder tried to populate their payload. The codec tried to fully generate each METADATA frame before moving onto the next one: first, it submitted a payload-less frame to nghttp2's outbound queue; second, it called sendPendingFrames to pop the frame off of the outbound queue, causing MetadataEncoder to fill in its payload via a callback; and third, it asked the MetadataEncoder whether it had any more frames to fill payloads for, to decide whether to submit the next frame. However, if the connection were in dispatching mode, sendPendingFrames wouldn't have any effect, so the MetadataEncoder would report more frames to fill even once enough frames had been submitted. When the MetadataEncoder tried to pack those frames, it would trigger debug ASSERT failures, or underflows in prod. Fix this by submitting all the METADATA frames to nghttp2 before trying to send any of them. Also harden MetadataEncoder against a few other library failures that were previously only checked with debug ASSERTs. Signed-off-by: Michael Behr <mkbehr@google.com>
Signed-off-by: Michael Behr <mkbehr@google.com>
Awesome! Tracking this down was such great work. Mind checking your format build and pinging back when this is good for review? |
Signed-off-by: Michael Behr <mkbehr@google.com>
Done. I do see a segfault in the coverage CI run from the last version, in BufferIntegrationTest.RouterRequestPopulateContentLengthOnTrailers/IPv4_HttpDownstream_Http2Upstream. I can't replicate it on my machine (possibly because I only have clang working there right now), and I don't see any way to get more info from the CI run. That test doesn't seem to send any metadata, so it shouldn't hit most of the changed code here. The only difference I can see is that if it runs through encodeMetadata despite having no metadata to send, it'll call sendPendingFrames when it otherwise wouldn't. I'd expect that to be safe, but maybe it's triggering some other latent bug. If that's the case, I don't want to paper over the bug by omitting the sendPendingFrames call if there's no metadata. Any ideas on what to do here? |
Maybe the segfault is #9989? |
} | ||
|
||
ASSERT(!metadata_encoder_->hasNextFrame()); | ||
for (uint8_t flags : metadata_encoder.payloadFrameFlags()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming nit: shouldn't this be
for (flag : flags) rather than
for (flags : flags)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each of these is a technically a byte of flags, even though only the one bit is ever set or read. I renamed payloadFrameFlags -> payloadFrameFlagBytes instead.
for (uint8_t flags : metadata_encoder.payloadFrameFlags()) { | ||
submitMetadata(flags); | ||
} | ||
parent_.sendPendingFrames(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it important or a perf optimization that we submit N times and send once, rather than prior submit:send logic?
If so consider commenting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not important, anymore. This order just made more sense to me once I finished mentally untangling the submit operations from the send operations. (And it makes sure the sequence of those operations is always the same regardless of the network, where the bug happened because they were usually but not always interleaved - but the submits don't depend on the sends anymore so that doesn't matter.)
ASSERT(!payload_size_queue_.empty()); | ||
ssize_t MetadataEncoder::packNextFramePayload(uint8_t* buf, const size_t len) { | ||
if (payload_size_queue_.empty()) { | ||
ENVOY_LOG(error, "No payload remaining to pack into a METADATA frame."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we unit test all the (awesome!) new error handling? I'm kicking the coverage run but if you EXPECT_LOG_CONTAINS in your unit tests it'd both regression-proof you cover the error you're aiming for and make it easy for your reviewer to tell :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are a couple of EXPECT_LOG_CONTAINS tests. I didn't see how to hit them all without somehow mocking out the nghttp2 calls:
- We could fail to deflate the metadata payload if we run out of memory, but I don't see any other way to cause that failure.
- The payload buffer could only be too small if an update to nghttp2 changes the minimum payload length it sends us. (I don't know how safe it is to assume that won't happen.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this error mean that there is some consistency issue between how many METADTA frames we have enqueued into nghttp2 and how many we actually have in the encoder? What are the conditions for this? If this is only possible due to a bug, should this be a RELEASE_ASSERT + fuzzer test instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The consistency issue you mentioned is one possible way this could happen. It could also happen if a bug somewhere in nghttp2 caused it to call this callback more times than we enqueued METADATA frames. The style guide says that bad data from an API call is the kind of thing we should handle gracefully instead of crashing, so I don't think RELEASE_ASSERT is the right tool here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would at least leave a comment explaining why this check is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I'm uncomfortable with this type of error logging without either a stat or just a RELEASE_ASSERT or normal ASSERT. Sorry just to clarify again, why do we think that RELEASE_ASSERT is not OK? Isn't this a bug in nghttp2? If we decide we want to handle this I think we need a stat vs. a log that could potentially spam?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Treating nghttp2 as an API, my reading of the style guide was that we should handle gracefully instead of crashing (see above) - but since you also think a RELASE_ASSERT is better, I've just changed it back to that.
(Would have liked to hear this from you a bit earlier in the process, though - this review has been taking a while!)
* and the flags to be set in each frame. This counts only frames that the encoder has not already | ||
* packed; to get the full sequence of frames corresponding to the metadata map vector, call this | ||
* before submitting any frames to nghttp2. | ||
* @return The number of frames in each metadata payload, in sequence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The number of frames in each metadata payload"
maybe
"A vector indicating the number of frames in each payload?"
If it were the number of frames I'd expect
[2, 3]
instead of
[0 0 [end] 0 0 0 [end]]
which I think is what we're getting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, that line was for an old version of this function.
Your case there would actually be [0 [end] 0 0 [end]] - is that clear enough from this version?
// Encode response metadata while dispatching request data from the client, so | ||
// that nghttp2 can't fill the metadata frames' payloads until dispatching | ||
// is finished. | ||
TEST_P(Http2CodecImplTest, EncodeMetadataWhileDispatchingTest) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep the old test and add the new test, or is this strictly speaking more realistic?
If the order of operations wasn't realistic in the prior unit tests I'm inclined to suggest we add an integration test as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both orders are realistic, depending on (I think) what kind of network event happens to be driving us.
Signed-off-by: Michael Behr <mkbehr@google.com>
…-fix Signed-off-by: Michael Behr <mkbehr@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code changes LGTM.
Let's sort out the testing as discussed offline.
Signed-off-by: Michael Behr <mkbehr@google.com>
…-fix Signed-off-by: Michael Behr <mkbehr@google.com>
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
Ping to keep this un-stale! I think Alyssa is away right now and can get back to this next week. |
ASSERT(!payload_size_queue_.empty()); | ||
ssize_t MetadataEncoder::packNextFramePayload(uint8_t* buf, const size_t len) { | ||
if (payload_size_queue_.empty()) { | ||
ENVOY_LOG(error, "No payload remaining to pack into a METADATA frame."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this error mean that there is some consistency issue between how many METADTA frames we have enqueued into nghttp2 and how many we actually have in the encoder? What are the conditions for this? If this is only possible due to a bug, should this be a RELEASE_ASSERT + fuzzer test instead?
const uint64_t current_payload_size = | ||
std::min(METADATA_MAX_PAYLOAD_SIZE, payload_size_queue_.front()); | ||
|
||
// nghttp2 guarantees len is at least 16KiB. If the check fails, please verify | ||
// NGHTTP2_MAX_PAYLOADLEN is consistent with METADATA_MAX_PAYLOAD_SIZE. | ||
ASSERT(len >= current_payload_size); | ||
if (len < current_payload_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be a static_assert somewhere that verifies that macro values are consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can't, for two reasons: NGHTTP2_MAX_PAYLOADLEN isn't in the nghttp2 public API, and the static_assert wouldn't guarantee that nghttp2 respected that macro.
Hi Yan, can you have a look? |
…-fix Signed-off-by: Michael Behr <mkbehr@google.com>
Signed-off-by: Michael Behr <mkbehr@google.com>
Signed-off-by: Michael Behr <mkbehr@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'm going to add Yang to this one as she authored the metadata code and might catch something I missed. If she doesn't weigh in I'll merge tomorrow :-)
er, please sort out CI and merge so I can actually merge tomorrow :-P |
…-fix Signed-off-by: Michael Behr <mkbehr@google.com>
Okay, hopefully this fixes the CI. I'm seeing a few failing tests locally, but they look like they're all unrelated flakes. |
Thanks for the awesome find and fix! Submitting metadata frames all together and call sendPendingFrames() once at the end is a great idea! LGTM. Could you please add bug # to the description if there is a bug? Thanks! |
I didn't open a bug for this, just the fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question from me. Thank you!
/wait-any
ASSERT(!payload_size_queue_.empty()); | ||
ssize_t MetadataEncoder::packNextFramePayload(uint8_t* buf, const size_t len) { | ||
if (payload_size_queue_.empty()) { | ||
ENVOY_LOG(error, "No payload remaining to pack into a METADATA frame."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I'm uncomfortable with this type of error logging without either a stat or just a RELEASE_ASSERT or normal ASSERT. Sorry just to clarify again, why do we think that RELEASE_ASSERT is not OK? Isn't this a bug in nghttp2? If we decide we want to handle this I think we need a stat vs. a log that could potentially spam?
Signed-off-by: Michael Behr <mkbehr@google.com>
…-fix Signed-off-by: Michael Behr <mkbehr@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks and sorry for the late review!
Description:
Previously, if encodeMetadata were called on a connection for which Envoy was
already dispatching data, it would generate extra frames, then underflow memory
when MetadataEncoder tried to populate their payload.
The codec tried to fully generate each METADATA frame before moving onto the
next one: first, it submitted a payload-less frame to nghttp2's outbound queue;
second, it called sendPendingFrames to pop the frame off of the outbound queue,
causing MetadataEncoder to fill in its payload via a callback; and third, it
asked the MetadataEncoder whether it had any more frames to fill payloads for,
to decide whether to submit the next frame. However, if the connection were in
dispatching mode, sendPendingFrames wouldn't have any effect, so the
MetadataEncoder would report more frames to fill even once enough frames had
been submitted. When the MetadataEncoder tried to pack those frames, it would
trigger debug ASSERT failures, or underflows in prod.
Fix this by submitting all the METADATA frames to nghttp2 before trying to send
any of them.
Also harden MetadataEncoder against a few other library failures that were
previously only checked with debug ASSERTs.
Risk Level: Medium (changes the control flow in and out of nghttp2 callbacks when METADATA frames are sent)
Testing: bazel test //test/... (two failing tests under //test/extensions/tracers/dynamic_ot seem unrelated)
Docs Changes: n/a
Release Notes: Added.
Signed-off-by: Michael Behr mkbehr@google.com