-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
From #35485 we observed how L7 proxy (envoy) manages TCP sockets.
Let's focus on this traffic model:
pod1 -> egress proxy -> eth0 -> wire -> eth0 -> ingress proxy -> pod2
Key points:
- egress proxy holds idle client socket (pod1 -> pod2, sent by egress proxy, direction egress proxy -> eth0) for 5min.
- ingress proxy hold idle accepted socket (pod1 -> pod2, sent by egress proxy, direction eth0 -> ingress proxy) for 10min.
These idle settings are normal optimization, but when it comes with cilium network policy change, things go different.
During ci-ipsec-e2e (--test "pod-to-pod-encryption"), we apply the ingress policy and egress policy, curl from pod1 to pod2, assert the connectivity, then delete the former applied policies.
However, the idle sockets are still in ESTABLISHED status, in both egress proxy and ingress proxy. What's happening next is, after 5min and 10min idle timeout, both sockets will be closed with a FIN to each other. Because the ingress and egress policies are already gone, the FINs will likely end up being dropped by kernel due to NO_SOCKET, with a TCP reset as response, from host network namespace. This TCP reset holds L3 tuple like "from pod1 to pod2" or vice versa, but without the proxy mark (0xa00 or 0xb00), cilium datapath (from_host@cilium_host, this is tunnel routing) can't recognize the TCP reset as "needs encryption", the TCP reset therefore arrive at wire bypassing the encryption.
The short-term solution is to ignore this sort of TCP reset in ci-ipsec-e2e. Long-term solution can be counting on #36345: after moving ipsec hooks to native device, the TCP reset will be caught by ipsec before leaving the node.