Skip to content

Conversation

jrajahalme
Copy link
Member

@jrajahalme jrajahalme commented Jul 8, 2025

Require TCP ACK flag to not be set when SYN is set to recreate a CT entry.

This addresses the problem where CT entry is created in the reply direction without a proxy redirect flag when a forward direction CT entry with proxy redirect flag already exists, when a packet from an Envoy upstream connection destination reaches bpf_lxc of the source pod.

The CT proxy redirect flag exists for the purpose of routing reply direction packets to the proxy when they reach the source pod's bpf_lxc program. Recreting the CT entry on the basis of the TCP SYN flag without requiring the ACK flag to be unset defeats this purpose and stalls traffic on source pod/Envoy (downstream) connection.

Example of creation of CT entry in the reply direction (only showing reply direction flows, SYN/ACK in the middle is for a proxy upstream connection that needs to be redirected to the proxy instead of the source pod):

-> endpoint 73 flow 0xba1ee241 , identity 60249->44892 state reply ifindex lxce2292d80c218 orig-ip 10.244.0.215: 10.244.0.215:80 -> 10.244.1.234:39194 tcp SYN, ACK
-> endpoint 73 flow 0xba1ee241 , identity 60249->44892 state reply ifindex lxce2292d80c218 orig-ip 10.244.0.215: 10.244.0.215:80 -> 10.244.1.234:39194 tcp ACK
-> endpoint 73 flow 0xba1ee241 , identity 60249->44892 state reply ifindex lxce2292d80c218 orig-ip 10.244.0.215: 10.244.0.215:80 -> 10.244.1.234:39194 tcp ACK
-> endpoint 73 flow 0x61a4168c , identity 60249->44892 state new ifindex lxce2292d80c218 orig-ip 10.244.0.215: 10.244.0.215:80 -> 10.244.1.234:39194 tcp SYN, ACK
-> endpoint 73 flow 0xba1ee241 , identity 60249->44892 state reply ifindex lxce2292d80c218 orig-ip 10.244.0.215: 10.244.0.215:80 -> 10.244.1.234:39194 tcp ACK
-> endpoint 73 flow 0x1fa7df2a , identity 60249->44892 state established ifindex lxce2292d80c218 orig-ip 10.244.0.215: 10.244.0.215:80 -> 10.244.1.234:39194 tcp SYN, ACK
-> endpoint 73 flow 0xba1ee241 , identity 60249->44892 state reply ifindex lxce2292d80c218 orig-ip 10.244.0.215: 10.244.0.215:80 -> 10.244.1.234:39194 tcp ACK

CT entries, the second one being created in the reply direction:

TCP OUT 10.244.1.234:44892 -> 10.244.0.215:80 expires=1595143 Packets=0 Bytes=0 RxFlagsSeen=0x1b LastRxReport=1587142 TxFlagsSeen=0x00 LastTxReport=1587115 Flags=0x0051 [ RxClosing SeenNonSyn ProxyRedirect ] RevNAT=0 SourceSecurityID=44892 BackendID=0
TCP IN 10.244.0.215:80 -> 10.244.1.234:44892 expires=1595143 Packets=0 Bytes=0 RxFlagsSeen=0x12 LastRxReport=1587116 TxFlagsSeen=0x19 LastTxReport=1587142 Flags=0x0412 [ TxClosing SeenNonSyn FromTunnel ] RevNAT=0 SourceSecurityID=60249 BackendID=0

Requiring the ACK flag be cleared when seeing the SYN flag being set prevents the second CT entry from being created.

Related: #32653

Bpf datapath TCP conntrack entries are (re)created only in the forward direction, solving an issue with freezing proxy connections when backend connection is re-opened.

@jrajahalme jrajahalme requested a review from a team as a code owner July 8, 2025 19:40
@jrajahalme jrajahalme requested a review from YutaroHayakawa July 8, 2025 19:40
@jrajahalme jrajahalme added kind/bug This is a bug in the Cilium logic. area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. labels Jul 8, 2025
@jrajahalme jrajahalme added area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. release-note/bug This PR fixes an issue in a previous release of Cilium. release-blocker/1.17 This issue will prevent the release of the next version of Cilium. needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch release-blocker/1.18 This issue will prevent the release of the next version of Cilium. needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Jul 8, 2025
@jrajahalme
Copy link
Member Author

/test

Copy link
Member

@borkmann borkmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! Do we also have a Fixes tag with the commit that changed this behavior compared to 1.15? Would be nice to add the Fixes tag to the commit desc.

@jrajahalme jrajahalme force-pushed the bpf-conntrack-recreate-fix branch from cef35d8 to d2fd754 Compare July 9, 2025 13:27
@jrajahalme
Copy link
Member Author

/test

@jrajahalme jrajahalme removed the request for review from YutaroHayakawa July 9, 2025 13:33
Require TCP ACK flag to not be set when SYN is set to recreate a CT entry.

This addresses the problem where CT entry is created in the reply
direction without a proxy redirect flag when a forward direction CT entry
with proxy redirect flag already exists, when a packet from an Envoy
upstream connection destination reaches bpf_lxc of the source pod.

The CT proxy redirect flag exists for the purpose of routing reply
direction packets to the proxy when they reach the source pod's bpf_lxc
program. Recreting the CT entry on the basis of the TCP SYN flag without
requiring the ACK flag to be unset defeats this purpose and stalls
traffic on source pod/Envoy (downstream) connection.

Example of creation of CT entry in the reply direction (only showing
reply direction flows, SYN/ACK in the middle is for a proxy upstream
connection that needs to be redirected to the proxy instead of the source
pod):

-> endpoint 73 flow 0xba1ee241 , identity 60249->44892 state reply ifindex lxce2292d80c218 orig-ip 10.244.0.215: 10.244.0.215:80 -> 10.244.1.234:39194 tcp SYN, ACK
-> endpoint 73 flow 0xba1ee241 , identity 60249->44892 state reply ifindex lxce2292d80c218 orig-ip 10.244.0.215: 10.244.0.215:80 -> 10.244.1.234:39194 tcp ACK
-> endpoint 73 flow 0xba1ee241 , identity 60249->44892 state reply ifindex lxce2292d80c218 orig-ip 10.244.0.215: 10.244.0.215:80 -> 10.244.1.234:39194 tcp ACK
-> endpoint 73 flow 0x61a4168c , identity 60249->44892 state new ifindex lxce2292d80c218 orig-ip 10.244.0.215: 10.244.0.215:80 -> 10.244.1.234:39194 tcp SYN, ACK
-> endpoint 73 flow 0xba1ee241 , identity 60249->44892 state reply ifindex lxce2292d80c218 orig-ip 10.244.0.215: 10.244.0.215:80 -> 10.244.1.234:39194 tcp ACK
-> endpoint 73 flow 0x1fa7df2a , identity 60249->44892 state established ifindex lxce2292d80c218 orig-ip 10.244.0.215: 10.244.0.215:80 -> 10.244.1.234:39194 tcp SYN, ACK
-> endpoint 73 flow 0xba1ee241 , identity 60249->44892 state reply ifindex lxce2292d80c218 orig-ip 10.244.0.215: 10.244.0.215:80 -> 10.244.1.234:39194 tcp ACK

CT entries, the second one being created in the reply direction:

TCP OUT 10.244.1.234:44892 -> 10.244.0.215:80 expires=1595143 Packets=0 Bytes=0 RxFlagsSeen=0x1b LastRxReport=1587142 TxFlagsSeen=0x00 LastTxReport=1587115 Flags=0x0051 [ RxClosing SeenNonSyn ProxyRedirect ] RevNAT=0 SourceSecurityID=44892 BackendID=0
TCP IN 10.244.0.215:80 -> 10.244.1.234:44892 expires=1595143 Packets=0 Bytes=0 RxFlagsSeen=0x12 LastRxReport=1587116 TxFlagsSeen=0x19 LastTxReport=1587142 Flags=0x0412 [ TxClosing SeenNonSyn FromTunnel ] RevNAT=0 SourceSecurityID=60249 BackendID=0

Requiring the ACK flag be cleared when seeing the SYN flag being set
prevents the second CT entry from being created.

Related: cilium#32653
Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
@jrajahalme jrajahalme force-pushed the bpf-conntrack-recreate-fix branch from d2fd754 to 2678877 Compare July 9, 2025 14:11
@jrajahalme
Copy link
Member Author

rebased in hopes to get rid of a verifier test failure:

    verifier_test.go:238: 	tail_handle_nat_fwd_ipv4: 619 insns
    verifier_test.go:245: Error: program tail_nodeport_nat_egress_ipv4: load program: permission denied: 1030: (71) r1 = *(u8 *)(r9 +23): R9 invalid mem access 'inv' (646 line(s) omitted)
        Verifier error tail: load program: permission denied:
        	(638 line(s) omitted)
        	1026: (61) r1 = *(u32 *)(r10 -240)
        	; if (tunnel_endpoint) {
        	1027: (54) w1 &= 1
        	1028: (56) if w1 != 0x0 goto pc+178
        	 R0=inv(id=0) R1_w=inv0 R2_w=map_value(id=0,off=0,ks=14,vs=40,imm=0) R3_w=ctx(id=0,off=0,imm=0) R6=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R7=inv0 R8=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R9=inv(id=0,umax_value=2147483495,var_off=(0x0; 0x7fffff67)) R10=fp0 fp-48=mmmmmmmm fp-56=mmmmmmmm fp-64=mmmmmmmm fp-72=mmmmmmmm fp-80=mmmmmmmm fp-112=??mmmmmm fp-120=mmmmmmmm fp-128=????0000 fp-136=00000000 fp-144=00000000 fp-152=00000000 fp-160=00000000 fp-168=00000000 fp-176=00000000 fp-184=0000mmmm fp-192=0000000m fp-200=??mmmmmm fp-208=mmmmmmmm fp-216=ctx fp-224=????mmmm fp-232=inv fp-240=????mmmm fp-248=????mmmm fp-256=00000000 fp-264=map_value fp-272=????mmmm fp-280=????mmmm fp-288=????mmmm
        	1029: (79) r9 = *(u64 *)(r10 -256)
        	; if (info->flag_ipv6_tunnel_ep)
        	1030: (71) r1 = *(u8 *)(r9 +23)
        	R9 invalid mem access 'inv'
        	processed 3650 insns (limit 1000000) max_states_per_insn 6 total_states 206 peak_states 205 mark_read 45

@jrajahalme
Copy link
Member Author

/test

@joestringer joestringer moved this from Proposed to Active in Release blockers Jul 9, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 9, 2025
@jrajahalme jrajahalme added this pull request to the merge queue Jul 9, 2025
Merged via the queue into cilium:main with commit 92a3319 Jul 9, 2025
68 checks passed
@jrajahalme jrajahalme deleted the bpf-conntrack-recreate-fix branch July 9, 2025 18:43
@github-project-automation github-project-automation bot moved this from Active to Done in Release blockers Jul 9, 2025
@jrajahalme jrajahalme added backport/author The backport will be carried out by the author of the PR. and removed needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Jul 9, 2025
@jrajahalme jrajahalme added backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. release-blocker/1.16 This issue will prevent the release of the next version of Cilium. backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. and removed needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Jul 9, 2025
@joestringer joestringer added backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. and removed backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. labels Jul 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. backport/author The backport will be carried out by the author of the PR. backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-blocker/1.16 This issue will prevent the release of the next version of Cilium. release-blocker/1.17 This issue will prevent the release of the next version of Cilium. release-blocker/1.18 This issue will prevent the release of the next version of Cilium. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants