[DNM] Respect maxConcurrentStreams for client and server H2 dispatchers #2325

zaharidichev · 2019-09-03T11:21:19Z

This PR addresses two problems

The first commit makes sure that we respect MAX_CONCURRENT_STREAMS settings for the H2 protocol. This has two sides:

Client Side
From the perspective of an H2 client, its pretty easy. We make sure that we process the Http2Settings frame that is received from the remote peer right after the connection initiation is done. This allows us to establish the maximum allowed concurrent streams that we can initiate with the remote. If we issue a request that goes over this advertised by the remote side threshold, we fail the request with a 429 status. Note that this value can be dynamic and the remote peer can at any point send us another settings frame that overrides this threshold. That is accounted for as well.

Server side:
Things here are a bit more under our control. Namely, we take into account the maximumConcurrentStreams setting and track that in our Netty4ServerDispatcher. If the remote side tries to send us a headers frame proposing to open another stream that will go over this threshold we respond with a reset frame with a STREAM_REFUSED status.

The second commit is a bit more radical and attempts to redefine the definition of a concurrent stream from the perspective of an h2 server. Ultimately MAX_CONCURRENT_STREAMS has to do with backpressure and the avoidance of resource exhaustion. For that to take effect, we don't only need to limit the amount of streams being created but also change the definition of what is an active stream from the perspective of our H2 server. Until now, the moment we transition our stream state into fully closed, which might happen in the case of sending an EOS frame and receiving a EOS frame, we consider this stream closed and we stop accounting for it. The problem with that is that the number of frames that are in the recv queue of the stream might be arbitrary value. We loose the idea of when these are consumed (drained and released) and hope that this will somehow happen in the future either through successful consumption of the queue or through failure (e.g. service exception).

In my mind this is not a perfect strategy as it exposes us to situations where we have streams being created at a really fast pace and not being consumed quickly enough (classic fast producer, slow consumer problem), which leads to depletion of resources for linkerd. This is part of the reason #2320 is happening and why it cannot be resolved by simply setting the MAX_CONCURRENT_STREAMS setting.
All of that being said, this change makes sure that we complete a stream, when we have consumed and released its last frame (if there are frames at all). Until then this stream is counted towards active streams, although its in CLOSED state. This allows for exerting e2e backpressure when acting as an H2 server.

I am still trying to come up with a few more tests to make sure this is completely correct, but to me the behaviour sounds logical. Comments are more than welcome :)

Signed-off-by: Zahari Dichev zaharidichev@gmail.com

Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>

…released Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>

adleong · 2019-09-03T17:47:08Z

Thanks for digging into this, @zaharidichev.

I would expect MAX_CURRENT_STREAMS to be enforced by Netty, actually, so I'm surprised that this needs to be done by us. I created a simple finagle script to test sending 10 concurrent streams and configured Linkerd with a server limit of 5 concurrent streams. It seems like Finagle/Netty respected this option on the client side, at least, since after 5 requests, it started to error with this message: `

Exception in thread "main" com.twitter.finagle.ChannelWriteException: com.twitter.finagle.UnknownChannelException: Maximum active streams violated for this endpoint. at remote address: localhost/127.0.0.1:4140. Remote Info: Not Available from service: client. Remote Info: Upstream Address: Not Available, Upstream Client Id: Not Available, Downstream Address: localhost/127.0.0.1:4140, Downstream Client Id: client, Trace Id: b7c01987fcd0ad88.b7c01987fcd0ad88<:b7c01987fcd0ad88

It's harder to tell if the server is enforcing this limit since I would need a badly-behaved client to violate the limit in order to test this.

adleong · 2019-09-03T17:57:49Z

In terms of backpressure and memory usage, you're totally right that this is much less about the number of open streams and much more about the unread frames. Thankfully, HTTP/2 has a concept of window updates specifically to provide this exact kind of backpressure. When the stream is created, a window size is established which describes the maximum number of bytes the client can send to the server and the client should not send any more until the server indicates that those bytes have been read. This is one of the things that happens when the consumer of a stream calls frame.realease().

So if the client is able to send data faster than the server can consume it, this is bug in the window management code.

With the default Linkerd settings, there can be a maximum of 1000 streams per connection and the initial window size per stream is 64k. This means that Linkerd should buffer a maximum of about 64M per connection.

We should figure out exactly how the old version of strest-grpc client is able to make Linkerd violate this limit. Perhaps reproducing this in miniature would be helpful by setting Linkerd's limits very low and pointing the old version of strest-grpc at it. If everything were working correctly, Linkerd should quickly reach the stream and/or window limits and stop admitting frames.

zaharidichev · 2019-09-03T20:00:07Z

@adleong Thanks for the insightful comments. I will setup a little example to illustrate what I mean in a bit more detail.

zaharidichev · 2019-09-04T11:26:29Z

@adleong
That's right, netty will limit that but this is purely on the client side. There is nothing on the server side that enforces that. The only thing thats happening is that our linkerd is advertising the MAX_CONCURRENT_STREAMS value through the settings frame which is read here, set right here and finally used here to throw this error on the client side. So, nothing on the linkerd H2 server side that enforces this rule.

So really this change has to do with enforcing this on the server in case we face a client that does not comply with the protocol.

Now a minature example that reproduces that looks like:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: l5d-config7
data:
  config.yaml: |-
    admin:
      ip: 0.0.0.0
      port: 9997

    routers:
    - protocol: h2
      experimental: true
      label: outgoing
      dtab: /svc/* => /$/inet/strest-server7.test.svc.cluster.local/7777;
      identifier:
        kind: io.l5d.header.path
        segments: 2
      interpreter:
        kind: default
      servers:
      - port: 5757
        maxConcurrentStreamsPerConnection: 1
        ip: 0.0.0.0

    telemetry:
    - kind: io.l5d.prometheus

    usage:
      orgId: integration-test-7
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  labels:
    app: l5d7
  name: l5d7
spec:
  template:
    metadata:
      labels:
        app: l5d7
        testrun: test7
    spec:
      volumes:
      - name: l5d-config7
        configMap:
          name: "l5d-config7"

      containers:
      - name: l5d
        image: buoyantio/linkerd:1.7.0-SNAPSHOT
        env:
        - name: JVM_HEAP_MIN
          value: 384M
        - name: JVM_HEAP_MAX
          value: 384M
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        args:
        - /io.buoyant/linkerd/config/config.yaml
        ports:
        - name: outgoing
          containerPort: 5757
          hostPort: 5757
        - name: incoming
          containerPort: 6767
        - name: admin
          containerPort: 9997
        - name: debug
          containerPort: 8849
          hostPort: 8849    
        volumeMounts:
        - name: "l5d-config7"
          mountPath: "/io.buoyant/linkerd/config"
          readOnly: true

      - name: kubectl
        image: buoyantio/kubectl:v1.8.5
        args: ["proxy", "-p", "8001"]
---
apiVersion: v1
kind: Service
metadata:
  name: l5d7
spec:
  selector:
    app: l5d7
  type: LoadBalancer
  ports:
  - name: outgoing
    port: 5757
  - name: incoming
    port: 6767
  - name: admin
    port: 9997
  - name: debug
    port: 8849    
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: strest-server7
spec:
  replicas: 
  template:
    metadata:
      labels:
        app: strest-server7
        testrun: test7
    spec:
      dnsPolicy: ClusterFirst
      containers:
      - name: service
        image: buoyantio/strest-grpc:0.0.5
        command:
        - "/go/bin/strest-grpc"
        - "server"
        - "--address=0.0.0.0:7777"
        - "--metricAddr=0.0.0.0:9997"
        ports:
        - name: grpc
          containerPort: 7777
        - name: strest-server
          containerPort: 9997
---
apiVersion: v1
kind: Service
metadata:
  name: strest-server7
spec:
  selector:
    app: strest-server7
  clusterIP: None
  ports:
  - name: grpc
    port: 7777
---
apiVersion: batch/v1
kind: Job
metadata:
  name: strest-client7
spec:
  template:
    metadata:
      name: strest-client7
      labels:
        testrun: test7
    spec:
      containers:
      - name: strest-client
        image: buoyantio/strest-grpc:0.0.5
        env:
        - name: HOST_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        command:
        - "/bin/sh"
        args:
        - "-c"
        - |
          sleep 60 # wait for pods to start
          /go/bin/strest-grpc client \
            --address $HOST_IP:5757 \
            --connections 25 \
            --streams 1 \
            --interval 30s \
            --streaming \
            --errorRate 1.0 \
            --metricAddr 0.0.0.0:9997 \
            --latencyPercentiles 0=0,100=500
        ports:
        - name: strest-client
          containerPort: 9997
      restartPolicy: OnFailure

If you look closely you will see that not only we set the maxConcurrentStreamsPerConnection to 1 on our H2 server, but the strest client itself does not create more than one streams per connection. If you however run this long enough, you will see an OOM of some sort either heap based or native memory based, depending on how you configured linkerd. Why is this happening? Well because as mentioned earlier the client fails to send an RST frame, causing the stream to transition into a closed state but still preserve the frames in the rcv queue on the linkerd side. Now the client is free to create another stream since the first one failed and it does so. It does so fast enough to deplete the resources of linkerd. And all of this is because of our definition of an active stream + the fact that the client does not respect the protocol.

In terms of backpressure and memory usage, you're totally right that this is much less about the number of open streams and much more about the unread frames. Thankfully, HTTP/2 has a concept of window updates specifically to provide this exact kind of backpressure. When the stream is created, a window size is established which describes the maximum number of bytes the client can send to the server and the client should not send any more until the server indicates that those bytes have been read. This is one of the things that happens when the consumer of a stream calls frame.release().

So if the client is able to send data faster than the server can consume it, this is bug in the window management code.

Yes and no. The http2 spec defined two mechanisms for that, one is the flow control window per stream and the other is per connection. The stream one does not work in this case because it is not hit. Here its the number of pending streams that matter not the number of frames that are in the stream... On the other side the connection flow control does not work either because of the fact that it is not considered in linkerd at the moment. Namely:

  object FlowControl {

    /**
     * Controls whether connection windows are automatically updated.
     *
     * When enabled, connection-level flow control is effectively
     * disabled (i.e. constrained by SETTINGS_MAX_CONCURRENT_STREAMS *
     * SETTINGS_INITIAL_WINDOW_SIZE).
     */
    case class AutoRefillConnectionWindow(enabled: Boolean)
    implicit object AutoRefillConnectionWindow extends Stack.Param[AutoRefillConnectionWindow] {
      /**
       * By default, the connection window is ignored.
       *
       * This should be changed on resolution of
       * https://github.com/linkerd/linkerd/issues/1044
       */
      val default = AutoRefillConnectionWindow(true)
    }

So here we state that we exercise flow control by imposing the constraint of SETTINGS_MAX_CONCURRENT_STREAMS * SETTINGS_INITIAL_WINDOW_SIZE. Sure, however we are not imposing the max concurrent stream limit in the way we should in order to prevent resource depletion and hence we have the unbounded accumulation of frames.

So really, there are a few things we can do about all this:

Forget this ever happened, bump strest-grpc version to 0.0.7 and blindly rely on the fact that remote peers will respect the protocol. This would no be my first choice.
Set AutoRefillConnectionWindow(false). This will enable flow control on the connection level and will cap the connection window at 65535, because we do not deal with configuring that at the moment. This works and prevents the crash problems. But again, as I see it, this is simply an advertisement to the client and we do not take care of enforcing that at all. So if we have a badly behaved client, this will not work
Consider the change that I propose, which puts a limit on the number of active streams per connection that we maintain and redefined an active stream to be one that still has frames in the queue that have not been released.

I am not really sure I can explain this any better, but if something does not make sense or my understanding is entirely wrong, please call me out :) Its up to you which option (if any) we pick. At least we know where the source of the problem with it tests failing is now

adleong · 2019-09-04T18:48:21Z

Thanks for the detailed write-up! I agree with you that the most likely outcome here is that we'll just bump strest-grpc to 0.0.7 and move on, but I'd like to understand exactly what's going on first. So please be patient with me while I try to wrap my head around it 😃

My understanding of the strest-grpc issue is that the client leaves streams in half-closed by never sending an EOS or RST. This means that Linkerd (correctly) maintains state for these streams.

You say that the client respects Linkerd's maxConcurrentStreamsPerConnection of 1. In order to send more requests, does the client open more connections? Do we see unbounded connection growth? Or does the client tear down the connection and establish a new one? In that case I would expect Linkerd to GC the stream state when the connection goes away.

Well because as mentioned earlier the client fails to send an RST frame, causing the stream to transition into a closed state but still preserve the frames in the rcv queue on the linkerd side. Now the client is free to create another stream since the first one failed and it does so. It does so fast enough to deplete the resources of linkerd. And all of this is because of our definition of an active stream + the fact that the client does not respect the protocol.

I'm having a hard time understanding what you mean here. A half-closed stream should always count towards the stream limit. If the client does not send a RST or EOS frame, it should not attempt to create a new stream because this would violate the max concurrent streams value of 1. If the client is badly behaved and attempts to create a second stream without closing the first one, the server should recognize this and send a GO_AWAY. It sounds like that's the piece of functionality that Linkerd is missing. So based on my understanding, having Linkerd enforce the max concurrent streams on the server side should fix the leak. Am I missing something?

zaharidichev · 2019-09-05T08:31:48Z

Thanks for the detailed write-up! I agree with you that the most likely outcome here is that we'll just bump strest-grpc to 0.0.7 and move on, but I'd like to understand exactly what's going on first. So please be patient with me while I try to wrap my head around it

Yes, its interesting to understand. Pently of patience here. And yes I feel you, every time I think about it I get a bit dizzy :)

So lets take a look at what happens on both the H2Server and H2Client frame by frame as I think this is where the bones are buried.

Frames from the perspective of Server Dispatcher (Linkerd H2 server listening to strest-client)

[strest-client -> Linkerd] Receives header frame from client to initiate a stream.
[strest-client -> Linkerd] Receives a sequence of data frames from client.
[Linkerd -> strest-client] Sends EOS because of Response(200, Stream(isEmpty=true, onEnd=Some(Return(())))) coming from the stresst server (the other side of the proxy). At that point we are in SEND_CLOSED mode for the H2Server transport for that stream
[strest-client -> Linkerd] Continues to receive data frames (after all we are in half_closed mode only)
[strest-client -> Linkerd] We Receives and EOS data frame and transition to CLOSED mode, at which point the stream is removed from the "streams" map in the base dispatcher.
[strest-client -> Linkerd] A new stream is initiated from the stresst client

Frames from the perspective of Client Dispatcher (Linkerd H2 client connecting to strest-server)

[Linkerd -> strest-server] Sends Headers frame to server to initiate stream
[Linkerd -> strest-server] Send a sequence of data frame to the server (the ones received in H2 server at step 2)
[strest-server -> Linkerd] Gets a DefaultHttp2Headers[:status: 200, content-type: application/grpc, grpc-status: 2, grpc-message: strest-grpc stream error for ErrorRate: 1], endStream=true, padding=0 (what is piped to the H2 server at point 3). At this point we are in RcvClosed mode here, so nothing to get from the server
Continue trying to send data frames to the server.

Point 4 on the Linkerd H2 client will eventually fail and reset the stream, because the strest server is not accepting these frames. However, our Server dispatcher is at that point ready to accept new streams with new data frames that will accumulate in the queue. So there is a window of time where we have accumulated data frames that need to be sent to the strest server but will not. And in this very windows of time we are ready to accept new streams because our stream in the H2 server is in CLOSED state. The critical part here is that after we transition into CLOSED state on the H2 server stream, we are not guaranteeing that these frames that we ingested have reached successfully their destination and have been released.

The part where the protocol is not respected comes from the fact that in normal circumstances the client sends end-of-stream when it's done sending, and the server finishes the RPC with trailer (and also is end-of-stream). So when the client receives and end-of-stream before it sends end-of-stream, it should respond with RST_STREAM. This is what has changed in the go-grpc library and when that is working the way it is we will see an RST_STREAM coming from the strest-client in response to the server issuing EOS first and this will reset the stream on the H2 server side instead of trying to send it to the other side of the fence and hoping that it will get there.

zaharidichev · 2019-09-05T08:40:56Z

You say that the client respects Linkerd's maxConcurrentStreamsPerConnection of 1. In order to send more requests, does the client open more connections? Do we see unbounded connection growth? Or does the client tear down the connection and establish a new one? In that case I would expect Linkerd to GC the stream state when the connection goes away.

The client does not open new connections. It uses the same connection for creating a new stream. What I was refering to is that I have configured the strest client to only use one connection and one concurrent stream per connection.

I'm having a hard time understanding what you mean here. A half-closed stream should always count towards the stream limit. If the client does not send a RST or EOS frame, it should not attempt to create a new stream because this would violate the max concurrent streams value of 1.

The client needs to send a RST which will drain the stream in this case as it has received an EOS from the server before it has send its EOS, which is unexpected as far as I gather. This RST will cause the frame queue to be drained, which is what needs to happen. Instread it sends an EOS on the last data frame that it sends. Again, the client sends an EOS so we transition into a closed state on our side, but this is not what should happen because these data frames that we have accomulated, are sitting with us much longer thatn they should as already explained.

If the client is badly behaved and attempts to create a second stream without closing the first one, the server should recognize this and send a GO_AWAY. It sounds like that's the piece of functionality that Linkerd is missing. So based on my understanding, having Linkerd enforce the max concurrent streams on the server side should fix the leak. Am I missing something?

Technically this is not a GO_AWAY as that would tear down the connection. This should be a stream error not a connection error as defined in the RFC

Endpoints MUST NOT exceed the limit set by their peer. An endpoint that receives a HEADERS frame that causes its advertised concurrent stream limit to be exceeded MUST treat this as a stream error (Section 5.4.2) of type PROTOCOL_ERROR or REFUSED_STREAM. The choice of error code determines whether the endpoint wishes to enable automatic retry (see Section 8.1.4) for details).

PS: All this is super confusing and there are multiple small details. I hope I did an ok job explaining it. And I hope things are clearer now

adleong · 2019-09-05T16:54:15Z

Ahh, I see!

So the client isn't violating the HTTP/2 spec at all, it's just sending a lot of data quickly on short-lived streams. Because each stream is short-lived, stream flow control doesn't help us, and we have not enabled connection flow control.

I feel good about this now. It's good to know that this is basically a consequence of #1044 and knowing that may influence how we prioritize #1044 in the future. I don't think it's a good idea to enable connection flow control without a way to configure the window size.

I think we've established that this issue is unrelated to maxConcurrentStreamsPerConnection. However, you discovered that Linkerd is not enforcing this limit on the server side. Do you think it's worthwhile to fix that anyway?

zaharidichev · 2019-09-05T18:17:10Z

So yes, we can make it so maxConcurrentStreams is respected on the H2 server in linkerd. I already did that in the first commit.

We can also consider putting in end to end backpressure in place like I did in the second commit. The main takeaway here is that H2 is supposed to be e2e backpressured protocol. Linkerd is proxying this traffic but in essence is not really enforing end to end backpressure as its allowing this depletion of resources. In my head the fact that this issue even occurs is pointing to the fact that there is problem that needs to be addressed.

Regarding connection level flow control, its also important to keep in mind that, as far as I see it, this is still something that can be violated by the other side and I am not sure we protect against it.

So what do you suggest we do:

Simply enforce maxConcurrentStreams and deal with connection level flow control later, relying that this solves the problem.
Consider counting a stream towards active stream until it has been drained or failed from the other side (as in my second commit)

Dont want to create more problems than I am solving here.

adleong · 2019-09-09T16:45:05Z

I think we should go with option number 1. I don't think we should deviate from the spec in terms of how we define an open stream.

adleong · 2019-09-10T16:43:28Z

Closed in favor of #2327

This commit adds the functionality to respect maxConcurrentStreams for server dispatchers. The reason for this is that H2 settings such as `maxConcurrentStreams` are merely advertisement based. We as a server need to protect against misbehaving clients that do not respect the settings we have advertised. Otherwise we are prone to memory leaks and all sorts of resource depletion problems. This PR is a limited version of #2325 which was a more opinionated version of this work. Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>

Respect maxConcurrentStreams for client and server H2 dispatchers

997175d

Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>

zaharidichev requested review from adleong and olix0r as code owners September 3, 2019 11:21

Streams are considered active until rcv queue is consumed and frames …

b2b5894

…released Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>

zaharidichev mentioned this pull request Sep 10, 2019

Respect maxConcurrentStreams #2327

Merged

adleong closed this Sep 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DNM] Respect maxConcurrentStreams for client and server H2 dispatchers #2325

[DNM] Respect maxConcurrentStreams for client and server H2 dispatchers #2325

Uh oh!

zaharidichev commented Sep 3, 2019 •

edited

Loading

Uh oh!

adleong commented Sep 3, 2019

Uh oh!

adleong commented Sep 3, 2019

Uh oh!

zaharidichev commented Sep 3, 2019

Uh oh!

zaharidichev commented Sep 4, 2019 •

edited

Loading

Uh oh!

adleong commented Sep 4, 2019

Uh oh!

zaharidichev commented Sep 5, 2019

Uh oh!

zaharidichev commented Sep 5, 2019 •

edited

Loading

Uh oh!

adleong commented Sep 5, 2019

Uh oh!

zaharidichev commented Sep 5, 2019

Uh oh!

adleong commented Sep 9, 2019

Uh oh!

adleong commented Sep 10, 2019

Uh oh!

Uh oh!

[DNM] Respect maxConcurrentStreams for client and server H2 dispatchers #2325

[DNM] Respect maxConcurrentStreams for client and server H2 dispatchers #2325

Uh oh!

Conversation

zaharidichev commented Sep 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adleong commented Sep 3, 2019

Uh oh!

adleong commented Sep 3, 2019

Uh oh!

zaharidichev commented Sep 3, 2019

Uh oh!

zaharidichev commented Sep 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adleong commented Sep 4, 2019

Uh oh!

zaharidichev commented Sep 5, 2019

Uh oh!

zaharidichev commented Sep 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adleong commented Sep 5, 2019

Uh oh!

zaharidichev commented Sep 5, 2019

Uh oh!

adleong commented Sep 9, 2019

Uh oh!

adleong commented Sep 10, 2019

Uh oh!

Uh oh!

zaharidichev commented Sep 3, 2019 •

edited

Loading

zaharidichev commented Sep 4, 2019 •

edited

Loading

zaharidichev commented Sep 5, 2019 •

edited

Loading