-
-
Notifications
You must be signed in to change notification settings - Fork 68
bpf_metadata: Use original source address even if the destination is external #742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…external Use the original source address if permitted even if the destination is external to the cluster. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
4916203
to
c28a13f
Compare
// locally allocated identity, is not classified as WORLD, and the destination is not in the | ||
// same node. | ||
} else if (!use_original_source_address_ || npmap_->exists(other_ip)) { | ||
// Otherwise only use the original source address if permitted and the destination is not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the comment reads a little bit strange. Otherwise
would more relate to the else
case (instead of else if
).
I would rephrase this to something like this (to help documenting the actual condition)
Don't use the original source address if not configured or the destination is on the same node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intended to document the whole else if
. It now reads "Otherwise if ". I guess this depends on how you read the comment; by itself, or as "else "?
Hi @jrajahalme , may I confirm as of this PR, proxy uses original src IP even for go-to-world traffic, like pod to 1.1.1.1 ? Edit: nevermind, I was mistaken. |
We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. @lxc ingress: skb->mark is set to 0x200 and returned to stack 2. @stack iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. @Proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. @stack routing: the new skb is routed to eth0 5. @stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. @eth0 egress: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy: on the step 4 above, the new skb will be routed to cilium_host instead: 4. @stack routing: the new skb is routed to cilium_host 5. @cilium_host egress: the new skb is returned to stack 6. @cilium_net ingress: the new skb is returned to stack 7. @stack routing: the new skb is routed to eth0 8. @stack iptables: the new skb is traversing PREROUTING, FORWARD, POSTROUTING Look at step 8, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition. If we do nothing, this to-world skb will be set 0x200 mark, and then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 @cilium_host, we set a specical mark 0x800 (MARK_MAGIC_GOTO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. there is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING traversal, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. Signed-off-by: gray <gray.liang@isovalent.com>
We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. @lxc ingress: skb->mark is set to 0x200 and returned to stack 2. @stack iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. @Proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. @stack routing: the new skb is routed to eth0 5. @stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. @eth0 egress: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. @stack routing: the new skb is routed to cilium_host 5. @cilium_host egress: the new skb is returned to stack 6. @cilium_net ingress: the new skb is returned to stack 7. @stack routing: the new skb is routed to eth0 8. @stack iptables: the new skb is traversing PREROUTING, FORWARD, POSTROUTING Look at step 8, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 @cilium_host, we set a specical mark 0x800 (MARK_MAGIC_GOTO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. there is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING traversal, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. Signed-off-by: gray <gray.liang@isovalent.com>
We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. @lxc ingress: skb->mark is set to 0x200 and returned to stack 2. @stack iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. @Proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. @stack routing: the new skb is routed to eth0 5. @stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. @eth0 egress: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. @stack routing: the new skb is routed to cilium_host 5. @cilium_host egress: the new skb is returned to stack 6. @cilium_net ingress: the new skb is returned to stack 7. @stack routing: the new skb is routed to eth0 8. @stack iptables: the new skb is traversing PREROUTING, FORWARD, POSTROUTING Look at step 8, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 @cilium_host, we set a specical mark 0x800 (MARK_MAGIC_GOTO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. there is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING traversal, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. Signed-off-by: gray <gray.liang@isovalent.com>
We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. @lxc ingress: skb->mark is set to 0x200 and returned to stack 2. @stack iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. @Proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. @stack routing: the new skb is routed to eth0 5. @stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. @eth0 egress: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. @stack routing: the new skb is routed to cilium_host 5. @cilium_host egress: the new skb is returned to stack 6. @cilium_net ingress: the new skb is returned to stack 7. @stack routing: the new skb is routed to eth0 8. @stack iptables: the new skb is traversing PREROUTING, FORWARD, POSTROUTING Look at step 8, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 @cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. there is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING traversal, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. Signed-off-by: gray <gray.liang@isovalent.com>
We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. @lxc ingress: skb->mark is set to 0x200 and returned to stack 2. @stack iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. @Proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. @stack routing: the new skb is routed to eth0 5. @stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. @eth0 egress: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. @stack routing: the new skb is routed to cilium_host 5. @cilium_host egress: the new skb is returned to stack 6. @cilium_net ingress: the new skb is returned to stack 7. @stack routing: the new skb is routed to eth0 8. @stack iptables: the new skb is traversing PREROUTING, FORWARD, POSTROUTING Look at step 8, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 @cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. there is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING traversal, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. Signed-off-by: gray <gray.liang@isovalent.com>
After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. Signed-off-by: gray <gray.liang@isovalent.com>
After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack 2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. stack routing: the new skb is routed to eth0 5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. to_netdev@eth0: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. stack routing: the new skb is routed to cilium_host 5. from_host@cilium_host: the new skb is returned to stack 6. to_host@cilium_net: the new skb is returned to stack 7. stack: PREROUTING, routing, FORWARD, POSTROUTING Look at step 7, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 from_host@cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack 2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. stack routing: the new skb is routed to eth0 5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. to_netdev@eth0: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. stack routing: the new skb is routed to cilium_host 5. from_host@cilium_host: the new skb is returned to stack 6. to_host@cilium_net: the new skb is returned to stack 7. stack: PREROUTING, routing, FORWARD, POSTROUTING Look at step 7, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 from_host@cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: 3384d73 ] After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: f93a40c ] We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack 2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. stack routing: the new skb is routed to eth0 5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. to_netdev@eth0: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. stack routing: the new skb is routed to cilium_host 5. from_host@cilium_host: the new skb is returned to stack 6. to_host@cilium_net: the new skb is returned to stack 7. stack: PREROUTING, routing, FORWARD, POSTROUTING Look at step 7, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 from_host@cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: 3384d73 ] After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: f93a40c ] We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack 2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. stack routing: the new skb is routed to eth0 5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. to_netdev@eth0: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. stack routing: the new skb is routed to cilium_host 5. from_host@cilium_host: the new skb is returned to stack 6. to_host@cilium_net: the new skb is returned to stack 7. stack: PREROUTING, routing, FORWARD, POSTROUTING Look at step 7, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 from_host@cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: 3384d73 ] After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: f93a40c ] We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack 2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. stack routing: the new skb is routed to eth0 5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. to_netdev@eth0: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. stack routing: the new skb is routed to cilium_host 5. from_host@cilium_host: the new skb is returned to stack 6. to_host@cilium_net: the new skb is returned to stack 7. stack: PREROUTING, routing, FORWARD, POSTROUTING Look at step 7, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 from_host@cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: 3384d73 ] After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: f93a40c ] We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack 2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. stack routing: the new skb is routed to eth0 5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. to_netdev@eth0: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. stack routing: the new skb is routed to cilium_host 5. from_host@cilium_host: the new skb is returned to stack 6. to_host@cilium_net: the new skb is returned to stack 7. stack: PREROUTING, routing, FORWARD, POSTROUTING Look at step 7, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 from_host@cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: 3384d73 ] After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: f93a40c ] We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack 2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. stack routing: the new skb is routed to eth0 5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. to_netdev@eth0: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. stack routing: the new skb is routed to cilium_host 5. from_host@cilium_host: the new skb is returned to stack 6. to_host@cilium_net: the new skb is returned to stack 7. stack: PREROUTING, routing, FORWARD, POSTROUTING Look at step 7, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 from_host@cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: 3384d73 ] After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: f93a40c ] We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack 2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. stack routing: the new skb is routed to eth0 5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. to_netdev@eth0: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. stack routing: the new skb is routed to cilium_host 5. from_host@cilium_host: the new skb is returned to stack 6. to_host@cilium_net: the new skb is returned to stack 7. stack: PREROUTING, routing, FORWARD, POSTROUTING Look at step 7, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 from_host@cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: 3384d73 ] After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: f93a40c ] We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack 2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. stack routing: the new skb is routed to eth0 5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. to_netdev@eth0: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. stack routing: the new skb is routed to cilium_host 5. from_host@cilium_host: the new skb is returned to stack 6. to_host@cilium_net: the new skb is returned to stack 7. stack: PREROUTING, routing, FORWARD, POSTROUTING Look at step 7, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 from_host@cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: 3384d73 ] After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: f93a40c ] We have an iptables rule to set 0x200 mark for transparent socket: ``` *mangle -A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle -A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff ``` This rule is in the mangle PREROUTING which checks packets ingressed from a netdev. Let's then focus on the pod to world traffic when IPsec=on + proxy=on + tunnel=off. Currently, a pod-to-world packet will go through the path: 1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack 2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy 3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world) 4. stack routing: the new skb is routed to eth0 5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain 6. to_netdev@eth0: the new skb is going to world Please note the new skb won't hit PREROUTING chain, where there is a rule setting skb->mark=0x200. To fix #31984, we are going to change the routing for packets from egress proxy; consequently, on the step 4 above, the new skb will be routed to cilium_host instead: 4. stack routing: the new skb is routed to cilium_host 5. from_host@cilium_host: the new skb is returned to stack 6. to_host@cilium_net: the new skb is returned to stack 7. stack: PREROUTING, routing, FORWARD, POSTROUTING Look at step 7, we are hitting PREROUTING! Because of cilium/proxy#742, this to-world skb is also linked to a transparent socket, matching the "-m socket --transparent" condition, the packet will fortunately have the 0x200 mark. If we do nothing, this to-world skb marked with 0x200 will then hit routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to local. It should have gone to the world. This patch fixes this future issue as a precaution (otherwise we'll break git-bisect). This patch provides a straightforward solution: at step 5 from_host@cilium_host, we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using "-m mark ! --mark 0x800/0xf00". Signed-off-by: gray <gray.liang@isovalent.com>
[ upstream commit: 3384d73 ] After cilium/proxy#742, proxy traffic keeps original pod IP as source IP for to-world packets, which must be masqueraded to eth0 IP. There is no issue for now, but the new routing rule (0xb00 lookup 2005) to be added for #31984 will cause a side effect breaking masquerading. This patch fixes the that side effect as a precaution, otherwise git-bisect breaks. The new routing rule (0xb00 lookup 2005) will cause proxy packets going through POSTROUTING for twice: first time happens when proxy sends packets which are routed to cilium_host, these are hitting OUTPUT + **POSTROUTING**; the second time takes place after packets ingressed from cilium_net, these skbs will traverse PREROUTING + FORWARD + **POSTROUTING**. However, due to kernel's implementation details, an skb won't be processed by nat POSTROUTING for twice: after the first POSTROUTING check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status IPS_SRC_NAT_DONE to skip the further traversal at all. [1] To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first round iptables ct, just for proxy traffic which is characterized by 0xb00 mark. [1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825 [1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111 Signed-off-by: gray <gray.liang@isovalent.com>
Allow Envoy listener chaining via the loopback device by not routing transparent proxy traffic destined to the loopback device to the cilium_host device. Fixes: cilium#32683, cilium/proxy#742 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Allow Envoy listener chaining via the loopback device by not routing transparent proxy traffic destined to the loopback device to the cilium_host device. Fixes: #32683, cilium/proxy#742 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Allow Envoy listener chaining via the loopback device by not routing transparent proxy traffic destined to the loopback device to the cilium_host device. Fixes: cilium#32683, cilium/proxy#742 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
[ upstream commit 4c9cf37 ] Allow Envoy listener chaining via the loopback device by not routing transparent proxy traffic destined to the loopback device to the cilium_host device. Fixes: #32683, cilium/proxy#742 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: gray <greyschwinger@gmail.com>
[ upstream commit 4c9cf37 ] Allow Envoy listener chaining via the loopback device by not routing transparent proxy traffic destined to the loopback device to the cilium_host device. Fixes: #32683, cilium/proxy#742 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: gray <greyschwinger@gmail.com>
Use the original source address if permitted even if the destination is external to the cluster.