Skip to content

IPTables Redirect rule causes tcp connection timeouts with multiple ports #38982

@klarose

Description

@klarose

Bug Description

Symptoms

A virtual service for tcp traffic occasionally fails to connect to a kubernetes pod with multiple exposed ports. Occasionally tcp connections from our istio ingress to a pod with an istio sidecar fail with a UF, URX flag.
Adding retries to our affected tcp proxies removed the connectivity problem from the perspective of the client. However, establishing the connection still took > 1.5s because the ingress would need to wait for the connection timeout to detect the failure, which isn't an acceptable perf hit when it is otherwise avoidable.

Digging in, we found that the failures were caused by the pod dropping the SYN packet and replying to a different connection with an rfc5961 Challenge Ack. In particular, we saw that the istio ingress would send a SYN to port 5002. The destination
pod did not respond to that SYN. Instead, it would send a Challenge ACK to the same ip and port, but from port 5000. After some thought, it became apparent that the iptables redirect rules mapping all connections to the same destination port was
leading to collisions in the pod's connection table, and thus timeouts on the colliding connections.

We could see TcpExtTCPChallengeACK in the linux nstat tool incrementing when this occurs, which aligns with our theory.

Setup

  • One service with port 5000 and port 5002.
  • Two virtual services: One forwarding http from host X to port 5000. The other sni forwarding from host Y to port 5002.

Analysis

The istio init container configures some iptables rules which ensure that the istio sidecar receives all inbound traffic. It does this by redirecting all connections to a specific destination port (say 15006). You can see this by viewing
the configured listeners:

0.0.0.0       15006 Trans: raw_buffer; Addr: *:5002                                                                 Cluster: inbound|5002||
0.0.0.0       15006 Trans: tls; Addr: *:5002                                                                        Cluster: inbound|5002||
0.0.0.0       15006 Trans: tls; App: istio,istio-peer-exchange,istio-http/1.0,istio-http/1.1,istio-h2; Addr: *:5002 Cluster: inbound|5002||
0.0.0.0       15006 Trans: raw_buffer; Addr: *:5000                                                                 Cluster: inbound|5000||
0.0.0.0       15006 Trans: tls; App: istio,istio-peer-exchange,istio-http/1.0,istio-http/1.1,istio-h2; Addr: *:5000 Cluster: inbound|5000||

This means that when a request arrives at the pod to port X, its destination port is rewritten to port 15006 for all ports (aside from some exclusions specific to the istio infrastructure). This is what introduces the problem.

It is perfectly possible for the following entries to exist in the connection table simultaneously:

sip sport dip dport
A X B Y
A X B Z

Note that the source port and source IP are the same. This is okay, becuase the four-tuple is what uniquely identifies them for a given protocol (e.g. tcp). However, this is not acceptable:

sip sport dip dport
A X B 15006
A X B 15006

You cannot have two entries with the same four-tuple.

This is what we see happening. A long-lived connection to 5000 (A, X, B, 5000) collides with a new connection (A, X, B, 5002). After the rewrite, the kernel sees two
entries for A|X|B|15006, so it sends a challenge ack to A, X from port 5000, discarding the SYN.

It is worth noting that as the number of concurrent connections between A and B increases, so does the chance of a collision, since the chance of a port being reused increases.

Example

For a concrete example, consider the following setup and sequence of events.

10.1.134.168 is my ingress IP
10.1.134.72 is my pod IP.

My pod exposes two services: one on port 5000, one on port 5002.

0: New connection from (10.1.134.168, 38208) arrives at (10.1.134.72 , 5000).
1: iptables redirects to port 15006. Connection is now (10.1.134.168, 38208, 10.1.134.72 , 15006)
2. Traffic runs between (10.1.134.168, 38208)->(10.1.134.72 , 15006) all fine and dandy. For a while.
3. Some time later, new a SYN from (10.1.134.168, 38208) arrives for (10.1.134.72 , 5002)
4. Iptables redirects to port 15006. Connection is now (10.1.134.168, 38208, 10.1.134.72 , 15006)
5. Kernel sees that there is an existing connection for (10.1.134.168, 38208, 10.1.134.72, 15006). It sends the challenge ack to the source of the original connection.
6. The original connection is mapped back to (10.1.134.168, 38208, 10.1.134.72, 5000)
7. The challenge ack is sent to (10.1.134.168, 38208) from port 5000.
8. The SYN is dropped.
9. ~1.5s later the connection attempt from 10.1.134.168 times out after steps 0 through 8 repeat due to the SYN retransmitting.

You can see this in my attached capture: synchallenge.pcap.zip

There is onging traffic between 38208 and 5000. Packet 17 shows the SYN. Packet 18 is the Challenge ACK. The stack
retries the SYN at packet 23, and again you see a challenge ACK.

Workaround

Exclude port 5002 from the redirect in the pod template
annotations, and set up an explicit sidecar object for this. E.g:

apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
  name: disable-redirect
  namespace: my-namespace
spec:
  workloadSelector:
    labels:
      app: my-appp
  ingress:
    - port:
        number: 5002
        protocol: TCP
      defaultEndpoint: 127.0.0.1:6002
      captureMode: NONE

What I expect to happen

I do not expect connections to randomly fail between the ingress and sidecar when my cluster is otherwise stable, and my system isn't really under load. Once in a while is fine; retries can handle that. However, I found a failure rate of around 0.1% in one of my tests which is far too high. Further, this smells like it could lead to other issues than just basic connectivity problems once in a while. Fundamentally, mapping multiple ports many to one is just asking for trouble.

I do not expect to need to apply manual workarounds in the form of Sidecar config + injection overrides in the pod template when the configuration I have deployed seems perfectly in line with the configuration model -- i.e. it's not some weird corner case.

Reproduction

I'll try to come up with a simpler scenario, but basically a pod with a sidecar and two ports exposed. A few long lived connections on one port, and a series of 'cycling' connections on the other. The easiest way to get that is to have one port be http (so the connection pool is in play), and the other be tcp or tls.

Keep a few http requests going to the http endpoint so the connection pool stays primed. Then, simply send a continuous set of new connections into the ingress targeting the tls endpoint. Some of them will fail to connect with the default configuration (which has no tcp retries). If tcp retries are enabled, you'll see the connections that would have failed take much longer to establish.

Version

$ istioctl version
client version: 1.10.6
control plane version: 1.10.6
data plane version: 1.10.6 (126 proxies)

$ istioctl version --short
client version: 1.10.6
control plane version: 1.10.6
data plane version: 1.10.6 (126 proxies)

I also reproduced the issue and workaround on Istio 1.13.3.

Additional Information

bug report

istioctl bug-report

Target cluster context: XXXXXX

Running with the following config: 

istio-namespace: istio-system
full-secrets: false
timeout (mins): 30
include: {  }
exclude: { Namespaces: kube-system, kube-public, kube-node-lease, local-path-storage } AND { Namespaces: kube-system, kube-public, kube-node-lease, local-path-storage }
end-time: 2022-05-17 17:06:18.541145161 -0400 EDT


The following Istio control plane revisions/versions were found in the cluster:
Revision default:
&version.MeshInfo{
    {
        Component: "pilot",
        Info:      version.BuildInfo{Version:"1.10.6", GitRevision:"fd053c6165d21105d66dac6e3d0649db2dde5b86", GolangVersion:"", BuildStatus:"Clean", GitTag:"1.10.6"},
    },
    {
        Component: "pilot",
        Info:      version.BuildInfo{Version:"1.10.6", GitRevision:"fd053c6165d21105d66dac6e3d0649db2dde5b86", GolangVersion:"", BuildStatus:"Clean", GitTag:"1.10.6"},
    },
}

The following proxy revisions/versions were found in the cluster:
Revision default: Versions {1.10.6}

I cannot include any more information as it contains too many sensitive details. Let me know if there is anything in particular which would be useful and I can try to gather it.

Environments tested on

  • gke 1.21, istio 1.10.6
  • microk8s v1.21.12 on ubuntu 21.10. Istio 1.10.6
  • microk8s v1.21.12 on ubuntu 21.10. Istio 1.13.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/networkinglifecycle/automatically-closedIndicates a PR or issue that has been closed automatically.lifecycle/staleIndicates a PR or issue hasn't been manipulated by an Istio team member for a while

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions