Skip to content

shouldDrainReadBuffer followed up by readDisable(true) and readDisable(false) strange behaviour #12304

@ppadevski

Description

@ppadevski

The behaviour I'm about to describe requires quite a lot of setup in order to reproduce.

To begin with I'm using v1.14.4.

My listeners use the following configuration.

listener.mutable_per_connection_buffer_limit_bytes()->set_value(16384);

The default value of 1MB just seemed way too big and opened up lots of opportunities for oom so I ended up using 16k.

Then I have a local SSL client in which data in sent in chunks, e.g. 400 followed up by 16500 bytes (e.g. HTTP headers and body).

What can be seen in the trace logs is the following:

2020-07-23T14:56:23.084Z trace [source/common/network/connection_impl.cc:534] [C187] socket event: 3
2020-07-23T14:56:23.084Z trace [source/common/network/connection_impl.cc:622] [C187] write ready
2020-07-23T14:56:23.084Z trace [source/common/network/connection_impl.cc:572] [C187] read ready
2020-07-23T14:56:23.084Z trace [source/extensions/transport_sockets/tls/ssl_socket.cc:87] [C187] ssl read returns: 393
2020-07-23T14:56:23.084Z trace [source/extensions/transport_sockets/tls/ssl_socket.cc:87] [C187] ssl read returns: 15991
2020-07-23T14:56:23.084Z trace [source/extensions/transport_sockets/tls/ssl_socket.cc:167] [C187] ssl read 16384 bytes
2020-07-23T14:56:23.084Z trace [source/common/http/http1/codec_impl.cc:475] [C187] parsing 16384 bytes
...
2020-07-23T14:56:23.085Z debug [source/common/http/conn_manager_impl.cc:2344] [C187][S7433621899365225547] Read-disabling downstream stream due to filter callbacks.
2020-07-23T14:56:23.085Z trace [source/common/network/connection_impl.cc:355] [C187] readDisable: enabled=true disable=true state=0
...
2020-07-23T14:56:23.086Z debug [source/common/http/conn_manager_impl.cc:2368] [C187][S7433621899365225547] Read-enabling downstream stream due to filter callbacks.
2020-07-23T14:56:23.086Z trace [source/common/network/connection_impl.cc:355] [C187] readDisable: enabled=false disable=false state=0
2020-07-23T14:56:23.086Z trace [source/common/network/connection_impl.cc:534] [C187] socket event: 2
2020-07-23T14:56:23.086Z trace [source/common/network/connection_impl.cc:622] [C187] write ready

At that point envoy stops reading from the downstream and either the client or the server times out. According to the client the entire request has been sent and according to the server not the entire request was received.

To me it seems like the problem is that whenever we have this situation

2020-07-23T14:56:23.084Z trace [source/extensions/transport_sockets/tls/ssl_socket.cc:87] [C187] ssl read returns: 393
2020-07-23T14:56:23.084Z trace [source/extensions/transport_sockets/tls/ssl_socket.cc:87] [C187] ssl read returns: 15991
2020-07-23T14:56:23.084Z trace [source/extensions/transport_sockets/tls/ssl_socket.cc:167] [C187] ssl read 16384 bytes

only then do we execute the following code:

105 Network::IoResult SslSocket::doRead(Buffer::Instance& read_buffer) {
...
158 if (slices_to_commit > 0) {
159 read_buffer.commit(slices, slices_to_commit);
160 if (callbacks_->shouldDrainReadBuffer()) {
161 callbacks_->setReadBufferReady();
162 keep_reading = false;
163 }
164 }

and in my case I have shouldDrainReadBuffer() == true because

113 bool shouldDrainReadBuffer() override {
114 return read_buffer_limit_ > 0 && read_buffer_.length() >= read_buffer_limit_; // 16k > 0 && 16k == 16k
115 }

and then setReadBufferReady does

121 void setReadBufferReady() override { file_event_->activate(Event::FileReadyType::Read); }

which seems to immediately get eaten by the first readDisable(true)

readDisable: enabled=true disable=true state=0

and when the second readDisable(false) happens

readDisable: enabled=false disable=false state=0

the read event is not activated.

If readDisable(true/false) isn't triggered or the setReadBufferReady() code path is not executed all is well.
I tried plain TCP and wasn't able to reproduce the issue (there's a similar if shouldDrainReadBuffer in it).
I wasn't able to reproduce the issue with an external connection either (maybe my network is slow).
The code snippets that I shared may be a few lines off v1.14.4 as I have a few ERR_clear_error() calls added as I'm using OpenSSL but looking at the behaviour I would not blame them or OpenSSL.
I can't unfortunately share either the client or the server code.

I can reliable reproduce the issue once every 5 minutes or so, so if anyone has a patch in mind I'll be willing to try it out. In the meantime, I'll see if I can come up with something.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions