-
Notifications
You must be signed in to change notification settings - Fork 8k
Description
Bug Description
Symptoms
A virtual service for tcp traffic occasionally fails to connect to a kubernetes pod with multiple exposed ports. Occasionally tcp connections from our istio ingress to a pod with an istio sidecar fail with a UF, URX flag.
Adding retries to our affected tcp proxies removed the connectivity problem from the perspective of the client. However, establishing the connection still took > 1.5s because the ingress would need to wait for the connection timeout to detect the failure, which isn't an acceptable perf hit when it is otherwise avoidable.
Digging in, we found that the failures were caused by the pod dropping the SYN packet and replying to a different connection with an rfc5961 Challenge Ack. In particular, we saw that the istio ingress would send a SYN to port 5002. The destination
pod did not respond to that SYN. Instead, it would send a Challenge ACK to the same ip and port, but from port 5000. After some thought, it became apparent that the iptables redirect rules mapping all connections to the same destination port was
leading to collisions in the pod's connection table, and thus timeouts on the colliding connections.
We could see TcpExtTCPChallengeACK in the linux nstat tool incrementing when this occurs, which aligns with our theory.
Setup
- One service with port 5000 and port 5002.
- Two virtual services: One forwarding http from host X to port 5000. The other sni forwarding from host Y to port 5002.
Analysis
The istio init container configures some iptables rules which ensure that the istio sidecar receives all inbound traffic. It does this by redirecting all connections to a specific destination port (say 15006). You can see this by viewing
the configured listeners:
0.0.0.0 15006 Trans: raw_buffer; Addr: *:5002 Cluster: inbound|5002||
0.0.0.0 15006 Trans: tls; Addr: *:5002 Cluster: inbound|5002||
0.0.0.0 15006 Trans: tls; App: istio,istio-peer-exchange,istio-http/1.0,istio-http/1.1,istio-h2; Addr: *:5002 Cluster: inbound|5002||
0.0.0.0 15006 Trans: raw_buffer; Addr: *:5000 Cluster: inbound|5000||
0.0.0.0 15006 Trans: tls; App: istio,istio-peer-exchange,istio-http/1.0,istio-http/1.1,istio-h2; Addr: *:5000 Cluster: inbound|5000||
This means that when a request arrives at the pod to port X, its destination port is rewritten to port 15006 for all ports (aside from some exclusions specific to the istio infrastructure). This is what introduces the problem.
It is perfectly possible for the following entries to exist in the connection table simultaneously:
sip | sport | dip | dport |
---|---|---|---|
A | X | B | Y |
A | X | B | Z |
Note that the source port and source IP are the same. This is okay, becuase the four-tuple is what uniquely identifies them for a given protocol (e.g. tcp). However, this is not acceptable:
sip | sport | dip | dport |
---|---|---|---|
A | X | B | 15006 |
A | X | B | 15006 |
You cannot have two entries with the same four-tuple.
This is what we see happening. A long-lived connection to 5000 (A, X, B, 5000) collides with a new connection (A, X, B, 5002). After the rewrite, the kernel sees two
entries for A|X|B|15006, so it sends a challenge ack to A, X from port 5000, discarding the SYN.
It is worth noting that as the number of concurrent connections between A and B increases, so does the chance of a collision, since the chance of a port being reused increases.
Example
For a concrete example, consider the following setup and sequence of events.
10.1.134.168 is my ingress IP
10.1.134.72 is my pod IP.
My pod exposes two services: one on port 5000, one on port 5002.
0: New connection from (10.1.134.168, 38208) arrives at (10.1.134.72 , 5000).
1: iptables redirects to port 15006. Connection is now (10.1.134.168, 38208, 10.1.134.72 , 15006)
2. Traffic runs between (10.1.134.168, 38208)->(10.1.134.72 , 15006) all fine and dandy. For a while.
3. Some time later, new a SYN from (10.1.134.168, 38208) arrives for (10.1.134.72 , 5002)
4. Iptables redirects to port 15006. Connection is now (10.1.134.168, 38208, 10.1.134.72 , 15006)
5. Kernel sees that there is an existing connection for (10.1.134.168, 38208, 10.1.134.72, 15006). It sends the challenge ack to the source of the original connection.
6. The original connection is mapped back to (10.1.134.168, 38208, 10.1.134.72, 5000)
7. The challenge ack is sent to (10.1.134.168, 38208) from port 5000.
8. The SYN is dropped.
9. ~1.5s later the connection attempt from 10.1.134.168 times out after steps 0 through 8 repeat due to the SYN retransmitting.
You can see this in my attached capture: synchallenge.pcap.zip
There is onging traffic between 38208 and 5000. Packet 17 shows the SYN. Packet 18 is the Challenge ACK. The stack
retries the SYN at packet 23, and again you see a challenge ACK.
Workaround
Exclude port 5002 from the redirect in the pod template
annotations, and set up an explicit sidecar object for this. E.g:
apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
name: disable-redirect
namespace: my-namespace
spec:
workloadSelector:
labels:
app: my-appp
ingress:
- port:
number: 5002
protocol: TCP
defaultEndpoint: 127.0.0.1:6002
captureMode: NONE
What I expect to happen
I do not expect connections to randomly fail between the ingress and sidecar when my cluster is otherwise stable, and my system isn't really under load. Once in a while is fine; retries can handle that. However, I found a failure rate of around 0.1% in one of my tests which is far too high. Further, this smells like it could lead to other issues than just basic connectivity problems once in a while. Fundamentally, mapping multiple ports many to one is just asking for trouble.
I do not expect to need to apply manual workarounds in the form of Sidecar config + injection overrides in the pod template when the configuration I have deployed seems perfectly in line with the configuration model -- i.e. it's not some weird corner case.
Reproduction
I'll try to come up with a simpler scenario, but basically a pod with a sidecar and two ports exposed. A few long lived connections on one port, and a series of 'cycling' connections on the other. The easiest way to get that is to have one port be http (so the connection pool is in play), and the other be tcp or tls.
Keep a few http requests going to the http endpoint so the connection pool stays primed. Then, simply send a continuous set of new connections into the ingress targeting the tls endpoint. Some of them will fail to connect with the default configuration (which has no tcp retries). If tcp retries are enabled, you'll see the connections that would have failed take much longer to establish.
Version
$ istioctl version
client version: 1.10.6
control plane version: 1.10.6
data plane version: 1.10.6 (126 proxies)
$ istioctl version --short
client version: 1.10.6
control plane version: 1.10.6
data plane version: 1.10.6 (126 proxies)
I also reproduced the issue and workaround on Istio 1.13.3.
Additional Information
bug report
istioctl bug-report
Target cluster context: XXXXXX
Running with the following config:
istio-namespace: istio-system
full-secrets: false
timeout (mins): 30
include: { }
exclude: { Namespaces: kube-system, kube-public, kube-node-lease, local-path-storage } AND { Namespaces: kube-system, kube-public, kube-node-lease, local-path-storage }
end-time: 2022-05-17 17:06:18.541145161 -0400 EDT
The following Istio control plane revisions/versions were found in the cluster:
Revision default:
&version.MeshInfo{
{
Component: "pilot",
Info: version.BuildInfo{Version:"1.10.6", GitRevision:"fd053c6165d21105d66dac6e3d0649db2dde5b86", GolangVersion:"", BuildStatus:"Clean", GitTag:"1.10.6"},
},
{
Component: "pilot",
Info: version.BuildInfo{Version:"1.10.6", GitRevision:"fd053c6165d21105d66dac6e3d0649db2dde5b86", GolangVersion:"", BuildStatus:"Clean", GitTag:"1.10.6"},
},
}
The following proxy revisions/versions were found in the cluster:
Revision default: Versions {1.10.6}
I cannot include any more information as it contains too many sensitive details. Let me know if there is anything in particular which would be useful and I can try to gather it.
Environments tested on
- gke 1.21, istio 1.10.6
- microk8s v1.21.12 on ubuntu 21.10. Istio 1.10.6
- microk8s v1.21.12 on ubuntu 21.10. Istio 1.13.3