Skip to content

Conversation

euroelessar
Copy link
Contributor

Description: ClientConnectionImpl::onMessageComplete() moves PendingResponse (and therefore Encoder) and tries to use it after that. As move invalidates the object - accessing it by pointer is not valid anymore.
It looks like clang keeps objects roughly intact so it's uncaught by existing envoy tests. GCC on the other hand does invalidate the memory region, which leads to crash.
Risk Level: low
Testing: n/a (was only able to repro in gcc)
Docs Changes: n/a
Release Notes: n/a

Ruslan Nigmatullin added 2 commits March 27, 2020 18:43
Signed-off-by: Ruslan Nigmatullin <elessar@dropbox.com>
Signed-off-by: Ruslan Nigmatullin <elessar@dropbox.com>
@mattklein123 mattklein123 self-assigned this Mar 28, 2020
@mattklein123
Copy link
Member

Hmm. This is scary. What is the crashing call stack exactly? I don't see encoder would be used after this call. I would like to understand the underlying issue better before we decide how to fix as it would be optimal to avoid the extra heap allocation if we don't need it.

/wait-any

@euroelessar
Copy link
Contributor Author

Sure, here is a traceback of the segmentation fault at e9248e2:

#0  0x00007f2c091d997b in raise () from /usr/drte/v3/lib64/libpthread.so.0
#1  0x00005589d3e4f9b4 in Envoy::SignalAction::sigHandler (sig=11, info=0x7f2c02c3e430, context=0x7f2c02c3e300) at external/envoy/source/common/signal/signal_action.cc:74
#2  <signal handler called>
#3  0x00005589d3939c54 in Envoy::Http::CodecClient::responseDecodeComplete (this=0x5589d7bfbc20, request=...) at external/envoy/source/common/http/codec_client.cc:110
#4  0x00005589d393ab13 in Envoy::Http::CodecClient::ActiveRequest::onPreDecodeComplete (this=0x5589d7c531c0)
    at bazel-out/k8-dbg/bin/external/envoy/source/common/http/_virtual_includes/codec_client_lib/common/http/codec_client.h:201
#5  0x00005589d382fb4b in Envoy::Http::ResponseDecoderWrapper::decodeHeaders(std::unique_ptr<Envoy::Http::ResponseHeaderMap, std::default_delete<Envoy::Http::ResponseHeaderMap> >&&, bool) (this=0x5589d7c531e0,
    headers=<unknown type in /home/elessar/.cache/bazel/_bazel_elessar/4e3ae6a7a5573d050ebb9de1044b0454/execroot/__main__/bazel-out/k8-dbg/bin/cpp/dropbox/envoy/envoy_proxy, CU 0xaf7d74e, DIE 0xaff72b0>, end_stream=true) at bazel-out/k8-dbg/bin/external/envoy/source/common/http/_virtual_includes/codec_wrappers_lib/common/http/codec_wrappers.h:20
#6  0x00005589d3a5de4e in Envoy::Http::Http1::ClientConnectionImpl::onMessageComplete (this=0x5589d7c54900) at external/envoy/source/common/http/http1/codec_impl.cc:980
#7  0x00005589d3a5b27f in Envoy::Http::Http1::ConnectionImpl::onMessageCompleteBase (this=0x5589d7c54908) at external/envoy/source/common/http/http1/codec_impl.cc:632
#8  0x00005589d3a58dfc in Envoy::Http::Http1::ConnectionImpl::<lambda(http_parser*)>::operator()(http_parser *) const (__closure=0x0, parser=0x5589d7c54928)
    at external/envoy/source/common/http/http1/codec_impl.cc:394
#9  0x00005589d3a58e20 in Envoy::Http::Http1::ConnectionImpl::<lambda(http_parser*)>::_FUN(http_parser *) () at external/envoy/source/common/http/http1/codec_impl.cc:396
#10 0x00005589d3d95f44 in http_parser_execute (parser=0x5589d7c54928, settings=0x5589d5f90640 <Envoy::Http::Http1::ConnectionImpl::settings_>,
    data=0x5589d7c78028 "HTTP/1.1 200 OK\r\nDate: Sat, 28 Mar 2020 21:25:55 GMT\r\nContent-Length: 0\r\n\r\n\nAccept-Encoding: gzip\r\n\r\n", len=75)
    at external/com_github_nodejs_http_parser/http_parser.c:1884
#11 0x00005589d3a5a2ec in Envoy::Http::Http1::ConnectionImpl::dispatchSlice (this=0x5589d7c54908,
    slice=0x5589d7c78028 "HTTP/1.1 200 OK\r\nDate: Sat, 28 Mar 2020 21:25:55 GMT\r\nContent-Length: 0\r\n\r\n\nAccept-Encoding: gzip\r\n\r\n", len=75)
    at external/envoy/source/common/http/http1/codec_impl.cc:492
#12 0x00005589d3a5a18b in Envoy::Http::Http1::ConnectionImpl::dispatch (this=0x5589d7c54908, data=...) at external/envoy/source/common/http/http1/codec_impl.cc:477
#13 0x00005589d3939dae in Envoy::Http::CodecClient::onData (this=0x5589d7bfbc20, data=...) at external/envoy/source/common/http/codec_client.cc:125
#14 0x00005589d393a984 in Envoy::Http::CodecClient::CodecReadFilter::onData (this=0x5589d7c02e50, data=...)
    at bazel-out/k8-dbg/bin/external/envoy/source/common/http/_virtual_includes/codec_client_lib/common/http/codec_client.h:174
#15 0x00005589d3526908 in Envoy::Network::FilterManagerImpl::onContinueReading (this=0x5589d7bcd1f8, filter=0x0, buffer_source=...)
    at external/envoy/source/common/network/filter_manager_impl.cc:66
#16 0x00005589d3526a7f in Envoy::Network::FilterManagerImpl::onRead (this=0x5589d7bcd1f8) at external/envoy/source/common/network/filter_manager_impl.cc:76
#17 0x00005589d351b075 in Envoy::Network::ConnectionImpl::onRead (this=0x5589d7bcd180, read_buffer_size=75) at external/envoy/source/common/network/connection_impl.cc:297
#18 0x00005589d351cd05 in Envoy::Network::ConnectionImpl::onReadReady (this=0x5589d7bcd180) at external/envoy/source/common/network/connection_impl.cc:552
#19 0x00005589d351ca66 in Envoy::Network::ConnectionImpl::onFileEvent (this=0x5589d7bcd180, events=3) at external/envoy/source/common/network/connection_impl.cc:526
#20 0x00005589d3518da7 in Envoy::Network::ConnectionImpl::<lambda(uint32_t)>::operator()(uint32_t) const (__closure=0x5589d7c72970, events=3)
    at external/envoy/source/common/network/connection_impl.cc:74
#21 0x00005589d351ef51 in std::_Function_handler<void(unsigned int), Envoy::Network::ConnectionImpl::ConnectionImpl(Envoy::Event::Dispatcher&, Envoy::Network::ConnectionSocketPtr&&, Envoy::Network::TransportSocketPtr&&, Envoy::StreamInfo::StreamInfo&, bool)::<lambda(uint32_t)> >::_M_invoke(const std::_Any_data &, <unknown type in /home/elessar/.cache/bazel/_bazel_elessar/4e3ae6a7a5573d050ebb9de1044b0454/execroot/__main__/bazel-out/k8-dbg/bin/cpp/dropbox/envoy/envoy_proxy, CU 0x8c8536b, DIE 0x8d27958>) (__functor=...,
    __args#0=<unknown type in /home/elessar/.cache/bazel/_bazel_elessar/4e3ae6a7a5573d050ebb9de1044b0454/execroot/__main__/bazel-out/k8-dbg/bin/cpp/dropbox/envoy/envoy_proxy, CU 0x8c8536b, DIE 0x8d27958>) at external/drte_v3_build_sysroot/root/bin/../lib/gcc/x86_64-linux-gnu/8.2.0/../../../../include/c++/8.2.0/bits/std_function.h:297
#22 0x00005589d3513282 in std::function<void (unsigned int)>::operator()(unsigned int) const (this=0x5589d7c72970, __args#0=3)
    at external/drte_v3_build_sysroot/root/bin/../lib/gcc/x86_64-linux-gnu/8.2.0/../../../../include/c++/8.2.0/bits/std_function.h:687
#23 0x00005589d3512deb in Envoy::Event::FileEventImpl::<lambda(int, short int, void*)>::operator()(int, short, void *) const (__closure=0x0, what=38, arg=0x5589d7c728f0)
    at external/envoy/source/common/event/file_event_impl.cc:66
#24 0x00005589d3512e54 in Envoy::Event::FileEventImpl::<lambda(int, short int, void*)>::_FUN(int, short, void *) () at external/envoy/source/common/event/file_event_impl.cc:67
#25 0x00005589d3d7f3d2 in event_persist_closure (base=0x5589d7704840, ev=0x5589d7c728f8) at external/libevent/event.c:1638
#26 0x00005589d3d7f6b2 in event_process_active_single_queue (base=0x5589d7704840, activeq=0x5589d7700ca0, max_to_process=2147483647, endtime=0x0) at external/libevent/event.c:1697
#27 0x00005589d3d7fc40 in event_process_active (base=0x5589d7704840) at external/libevent/event.c:1798
#28 0x00005589d3d804f6 in event_base_loop (base=0x5589d7704840, flags=0) at external/libevent/event.c:2040
#29 0x00005589d3534e51 in Envoy::Event::LibeventScheduler::run (this=0x5589d77f21b0, mode=Envoy::Event::Dispatcher::RunType::Block)
    at external/envoy/source/common/event/libevent_scheduler.cc:47
#30 0x00005589d350d04d in Envoy::Event::DispatcherImpl::run (this=0x5589d77f2160, type=Envoy::Event::Dispatcher::RunType::Block)
    at external/envoy/source/common/event/dispatcher_impl.cc:197
#31 0x00005589d34f3073 in Envoy::Server::WorkerImpl::threadRoutine (this=0x5589d78232d0, guard_dog=...) at external/envoy/source/server/worker_impl.cc:110
#32 0x00005589d34f2958 in Envoy::Server::WorkerImpl::<lambda()>::operator()(void) const (__closure=0x5589d7c440c8) at external/envoy/source/server/worker_impl.cc:75
#33 0x00005589d34f3902 in std::_Function_handler<void(), Envoy::Server::WorkerImpl::start(Envoy::Server::GuardDog&)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
    at external/drte_v3_build_sysroot/root/bin/../lib/gcc/x86_64-linux-gnu/8.2.0/../../../../include/c++/8.2.0/bits/std_function.h:297
#34 0x00005589d2c0a198 in std::function<void ()>::operator()() const (this=0x5589d7c440c8)
    at external/drte_v3_build_sysroot/root/bin/../lib/gcc/x86_64-linux-gnu/8.2.0/../../../../include/c++/8.2.0/bits/std_function.h:687
#35 0x00005589d3bced20 in Envoy::Thread::ThreadImplPosix::<lambda(void*)>::operator()(void *) const (__closure=0x0, arg=0x5589d7c440c0)
    at external/envoy/source/common/common/posix/thread_impl.cc:33
#36 0x00005589d3bced44 in Envoy::Thread::ThreadImplPosix::<lambda(void*)>::_FUN(void *) () at external/envoy/source/common/common/posix/thread_impl.cc:35
#37 0x00007f2c091cf05e in start_thread () from /usr/drte/v3/lib64/libpthread.so.0
#38 0x00007f2c08dd866f in clone () from /usr/drte/v3/lib64/libc.so.6

@euroelessar
Copy link
Contributor Author

euroelessar commented Mar 28, 2020

What happens under the hood is that CodecClient::ActiveRequest::encoder_ stores a pointer to Http1::ClientConnectionImpl::PendingResponse::encoder_ (obtained by Http1::ClientConnectionImpl::newStream).
Http1::ClientConnectionImpl::onMessageComplete invalidates encoder by moving pending_response_ and attempts to use the stream right after that by calling response.decoder_->decodeHeaders(...).

If heap allocation is costly in this scenario we can move std::move/reset after response.decoder_->decode* calls.

@euroelessar
Copy link
Contributor Author

The comment makes me wonder if simply moving reset to the end of block does not cause any other side effects:

// After calling decodeData() with end stream set to true, we should no longer be able to reset.
PendingResponse response = std::move(pending_response_.value());
pending_response_.reset();

@mattklein123 any thoughts?

@mattklein123
Copy link
Member

OK this makes sense thanks. I'm really surprised/scared that this works in clang and especially doesn't get caught by ASAN. I think clang probably does not actually clear the memory when it "moves" (copies) the value into the local before resetting the optional (or the clang does not clear the memory in optional reset and gcc does).

and attempts to use the stream right after that by calling response.decoder_->decodeHeaders(...)

I don't think this the issue. The issue is here:

// HTTP/2 can send us a reset after a complete response if the request was not complete. Users
// of CodecClient will deal with the premature response case and we should not handle any
// further reset notification.
request.encoder_->getStream().removeCallbacks(request);

We are using the encoder again after it was moved/reset. This is definitely broken though again it's scary that this seems to work OK in clang.

The comment makes me wonder if simply moving reset to the end of block does not cause any other side effects:

I think it's probably not safe to do this, since the intention of this code (IIRC) is to prevent a recursive reset during the completion callback. I think the simplest way of fixing this for now would be to:

  1. Get rid of the move into the local variable as it is clearly broken.
  2. Add a boolean that indicates that a complete response was received before dispatching any decoder callbacks.
  3. Check the boolean here:
    if (pending_response_.has_value()) {

I think this should be functionaly equivalent to what we have today and then we can potentially improve this in further follow ups. Can you potentially try that? If things continue to break / you are having trouble I would be fine to go back to the heap allocation and I can look at fixing this in a follow up.

cc @alyssawilk @antoniovicente

Ruslan Nigmatullin added 3 commits March 29, 2020 14:54
This reverts commit 3836585.

Signed-off-by: Ruslan Nigmatullin <elessar@dropbox.com>
This reverts commit dec36b5.

Signed-off-by: Ruslan Nigmatullin <elessar@dropbox.com>
Signed-off-by: Ruslan Nigmatullin <elessar@dropbox.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this LGTM for a fix with one small ASSERT comment. I would also like either @alyssawilk or @antoniovicente to take a quick look. Thank you!

/wait

PendingResponse& response = pending_response_.value();
// Encoder is used as part of decode* calls later in this function so pending_response_ can not
// be reset just yet. Preserve the state in pending_response_done_ instead.
pending_response_done_ = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't thrill me due to extra logic complexity, but I think this is an OK solution for now. Can you do me a favor and put an additional ASSERT(!pending_response_done_); every time we load successfully check that pending_response has a value in the other functions above? I think there shouldn't be too many.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: Ruslan Nigmatullin <elessar@dropbox.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mattklein123
Copy link
Member

The failures are known flakes so will go ahead and merge.

@antoniovicente
Copy link
Contributor

PR description was not updated when the implementation moved away from heap allocation to extra ASSERTs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants