Skip to content

Conversation

jrajahalme
Copy link
Member

Use the original source address if permitted even if the destination is external to the cluster.

@jrajahalme jrajahalme added the wip work-in-progress, no need to review label May 2, 2024
@jrajahalme jrajahalme requested a review from a team as a code owner May 2, 2024 14:37
@jrajahalme jrajahalme requested review from mhofstetter and removed request for a team May 2, 2024 14:37
…external

Use the original source address if permitted even if the destination is
external to the cluster.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
@jrajahalme jrajahalme force-pushed the world-use-original-source-address branch from 4916203 to c28a13f Compare May 2, 2024 14:59
@jrajahalme jrajahalme added kind/enhancement and removed wip work-in-progress, no need to review labels May 8, 2024
@jrajahalme jrajahalme requested a review from mhofstetter May 8, 2024 12:57
@jrajahalme jrajahalme merged commit 7358b64 into main May 8, 2024
@jrajahalme jrajahalme deleted the world-use-original-source-address branch May 8, 2024 12:58
// locally allocated identity, is not classified as WORLD, and the destination is not in the
// same node.
} else if (!use_original_source_address_ || npmap_->exists(other_ip)) {
// Otherwise only use the original source address if permitted and the destination is not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the comment reads a little bit strange. Otherwise would more relate to the else case (instead of else if).

I would rephrase this to something like this (to help documenting the actual condition)

Don't use the original source address if not configured or the destination is on the same node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended to document the whole else if. It now reads "Otherwise if ". I guess this depends on how you read the comment; by itself, or as "else "?

@jschwinger233
Copy link
Member

jschwinger233 commented May 23, 2024

Hi @jrajahalme , may I confirm as of this PR, proxy uses original src IP even for go-to-world traffic, like pod to 1.1.1.1 ?

Edit: nevermind, I was mistaken.

jschwinger233 added a commit to cilium/cilium that referenced this pull request May 29, 2024
We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. @lxc ingress: skb->mark is set to 0x200 and returned to stack
2. @stack iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. @Proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. @stack routing: the new skb is routed to eth0
5. @stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. @eth0 egress: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy: on the step 4 above,
the new skb will be routed to cilium_host instead:

4. @stack routing: the new skb is routed to cilium_host
5. @cilium_host egress: the new skb is returned to stack
6. @cilium_net ingress: the new skb is returned to stack
7. @stack routing: the new skb is routed to eth0
8. @stack iptables: the new skb is traversing PREROUTING, FORWARD, POSTROUTING

Look at step 8, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition.

If we do nothing, this to-world skb will be set 0x200 mark, and then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5 @cilium_host,
we set a specical mark 0x800 (MARK_MAGIC_GOTO_WORLD), then iptables can
exclude this mark using "-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request May 29, 2024
After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. there is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
traversal, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all.

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request May 29, 2024
We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. @lxc ingress: skb->mark is set to 0x200 and returned to stack
2. @stack iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. @Proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. @stack routing: the new skb is routed to eth0
5. @stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. @eth0 egress: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. @stack routing: the new skb is routed to cilium_host
5. @cilium_host egress: the new skb is returned to stack
6. @cilium_net ingress: the new skb is returned to stack
7. @stack routing: the new skb is routed to eth0
8. @stack iptables: the new skb is traversing PREROUTING, FORWARD, POSTROUTING

Look at step 8, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5 @cilium_host,
we set a specical mark 0x800 (MARK_MAGIC_GOTO_WORLD), then iptables can
exclude this mark using "-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request May 29, 2024
After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. there is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
traversal, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all.

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request May 29, 2024
We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. @lxc ingress: skb->mark is set to 0x200 and returned to stack
2. @stack iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. @Proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. @stack routing: the new skb is routed to eth0
5. @stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. @eth0 egress: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. @stack routing: the new skb is routed to cilium_host
5. @cilium_host egress: the new skb is returned to stack
6. @cilium_net ingress: the new skb is returned to stack
7. @stack routing: the new skb is routed to eth0
8. @stack iptables: the new skb is traversing PREROUTING, FORWARD, POSTROUTING

Look at step 8, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5 @cilium_host,
we set a specical mark 0x800 (MARK_MAGIC_GOTO_WORLD), then iptables can
exclude this mark using "-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request May 29, 2024
After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. there is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
traversal, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all.

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request May 29, 2024
We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. @lxc ingress: skb->mark is set to 0x200 and returned to stack
2. @stack iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. @Proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. @stack routing: the new skb is routed to eth0
5. @stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. @eth0 egress: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. @stack routing: the new skb is routed to cilium_host
5. @cilium_host egress: the new skb is returned to stack
6. @cilium_net ingress: the new skb is returned to stack
7. @stack routing: the new skb is routed to eth0
8. @stack iptables: the new skb is traversing PREROUTING, FORWARD, POSTROUTING

Look at step 8, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5 @cilium_host,
we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can
exclude this mark using "-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request May 29, 2024
After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. there is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
traversal, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all.

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request May 29, 2024
We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. @lxc ingress: skb->mark is set to 0x200 and returned to stack
2. @stack iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. @Proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. @stack routing: the new skb is routed to eth0
5. @stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. @eth0 egress: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. @stack routing: the new skb is routed to cilium_host
5. @cilium_host egress: the new skb is returned to stack
6. @cilium_net ingress: the new skb is returned to stack
7. @stack routing: the new skb is routed to eth0
8. @stack iptables: the new skb is traversing PREROUTING, FORWARD, POSTROUTING

Look at step 8, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5 @cilium_host,
we set a specical mark 0x800 (MARK_MAGIC_PROXY_TO_WORLD), then iptables can
exclude this mark using "-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request May 29, 2024
After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. there is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
traversal, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all.

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request May 29, 2024
After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all.

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request May 29, 2024
After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request Jun 3, 2024
We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack
2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. stack routing: the new skb is routed to eth0
5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. to_netdev@eth0: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. stack routing: the new skb is routed to cilium_host
5. from_host@cilium_host: the new skb is returned to stack
6. to_host@cilium_net: the new skb is returned to stack
7. stack: PREROUTING, routing, FORWARD, POSTROUTING

Look at step 7, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5
from_host@cilium_host, we set a specical mark 0x800
(MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using
"-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request Jun 3, 2024
After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
github-merge-queue bot pushed a commit to cilium/cilium that referenced this pull request Jun 5, 2024
We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack
2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. stack routing: the new skb is routed to eth0
5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. to_netdev@eth0: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. stack routing: the new skb is routed to cilium_host
5. from_host@cilium_host: the new skb is returned to stack
6. to_host@cilium_net: the new skb is returned to stack
7. stack: PREROUTING, routing, FORWARD, POSTROUTING

Look at step 7, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5
from_host@cilium_host, we set a specical mark 0x800
(MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using
"-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
github-merge-queue bot pushed a commit to cilium/cilium that referenced this pull request Jun 5, 2024
After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request Jun 6, 2024
[ upstream commit: 3384d73 ]

After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request Jun 6, 2024
[ upstream commit: f93a40c ]

We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack
2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. stack routing: the new skb is routed to eth0
5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. to_netdev@eth0: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. stack routing: the new skb is routed to cilium_host
5. from_host@cilium_host: the new skb is returned to stack
6. to_host@cilium_net: the new skb is returned to stack
7. stack: PREROUTING, routing, FORWARD, POSTROUTING

Look at step 7, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5
from_host@cilium_host, we set a specical mark 0x800
(MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using
"-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request Jun 6, 2024
[ upstream commit: 3384d73 ]

After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request Jun 6, 2024
[ upstream commit: f93a40c ]

We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack
2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. stack routing: the new skb is routed to eth0
5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. to_netdev@eth0: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. stack routing: the new skb is routed to cilium_host
5. from_host@cilium_host: the new skb is returned to stack
6. to_host@cilium_net: the new skb is returned to stack
7. stack: PREROUTING, routing, FORWARD, POSTROUTING

Look at step 7, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5
from_host@cilium_host, we set a specical mark 0x800
(MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using
"-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request Jun 6, 2024
[ upstream commit: 3384d73 ]

After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request Jun 7, 2024
[ upstream commit: f93a40c ]

We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack
2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. stack routing: the new skb is routed to eth0
5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. to_netdev@eth0: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. stack routing: the new skb is routed to cilium_host
5. from_host@cilium_host: the new skb is returned to stack
6. to_host@cilium_net: the new skb is returned to stack
7. stack: PREROUTING, routing, FORWARD, POSTROUTING

Look at step 7, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5
from_host@cilium_host, we set a specical mark 0x800
(MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using
"-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request Jun 7, 2024
[ upstream commit: 3384d73 ]

After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request Jun 7, 2024
[ upstream commit: f93a40c ]

We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack
2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. stack routing: the new skb is routed to eth0
5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. to_netdev@eth0: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. stack routing: the new skb is routed to cilium_host
5. from_host@cilium_host: the new skb is returned to stack
6. to_host@cilium_net: the new skb is returned to stack
7. stack: PREROUTING, routing, FORWARD, POSTROUTING

Look at step 7, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5
from_host@cilium_host, we set a specical mark 0x800
(MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using
"-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request Jun 7, 2024
[ upstream commit: 3384d73 ]

After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
qmonnet pushed a commit to cilium/cilium that referenced this pull request Jun 7, 2024
[ upstream commit: f93a40c ]

We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack
2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. stack routing: the new skb is routed to eth0
5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. to_netdev@eth0: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. stack routing: the new skb is routed to cilium_host
5. from_host@cilium_host: the new skb is returned to stack
6. to_host@cilium_net: the new skb is returned to stack
7. stack: PREROUTING, routing, FORWARD, POSTROUTING

Look at step 7, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5
from_host@cilium_host, we set a specical mark 0x800
(MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using
"-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
qmonnet pushed a commit to cilium/cilium that referenced this pull request Jun 7, 2024
[ upstream commit: 3384d73 ]

After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request Jun 7, 2024
[ upstream commit: f93a40c ]

We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack
2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. stack routing: the new skb is routed to eth0
5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. to_netdev@eth0: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. stack routing: the new skb is routed to cilium_host
5. from_host@cilium_host: the new skb is returned to stack
6. to_host@cilium_net: the new skb is returned to stack
7. stack: PREROUTING, routing, FORWARD, POSTROUTING

Look at step 7, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5
from_host@cilium_host, we set a specical mark 0x800
(MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using
"-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
jschwinger233 added a commit to cilium/cilium that referenced this pull request Jun 7, 2024
[ upstream commit: 3384d73 ]

After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
julianwiedmann pushed a commit to cilium/cilium that referenced this pull request Jun 7, 2024
[ upstream commit: f93a40c ]

We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack
2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. stack routing: the new skb is routed to eth0
5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. to_netdev@eth0: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. stack routing: the new skb is routed to cilium_host
5. from_host@cilium_host: the new skb is returned to stack
6. to_host@cilium_net: the new skb is returned to stack
7. stack: PREROUTING, routing, FORWARD, POSTROUTING

Look at step 7, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5
from_host@cilium_host, we set a specical mark 0x800
(MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using
"-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
julianwiedmann pushed a commit to cilium/cilium that referenced this pull request Jun 7, 2024
[ upstream commit: 3384d73 ]

After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
dylandreimerink pushed a commit to cilium/cilium that referenced this pull request Jun 11, 2024
[ upstream commit: f93a40c ]

We have an iptables rule to set 0x200 mark for transparent socket:

```
*mangle
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A CILIUM_PRE_mangle -m socket --transparent -m mark ! --mark 0xe00/0xf00 -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
```

This rule is in the mangle PREROUTING which checks packets ingressed
from a netdev.

Let's then focus on the pod to world traffic when IPsec=on + proxy=on +
tunnel=off.

Currently, a pod-to-world packet will go through the path:
1. from_lxc@lxc: skb->mark is set to 0x200 and returned to stack
2. iptables: skb is hijacked by tproxy (due to 0x200), to be accepted by proxy
3. proxy process: the old skb is consumed by proxy, an new skb is sent to upstream (world)
4. stack routing: the new skb is routed to eth0
5. stack iptables: the new skb is traversing OUTPUT chain and POSTROUTING chain
6. to_netdev@eth0: the new skb is going to world

Please note the new skb won't hit PREROUTING chain, where there is a
rule setting skb->mark=0x200.

To fix #31984, we are going to
change the routing for packets from egress proxy; consequently, on the
step 4 above, the new skb will be routed to cilium_host instead:

4. stack routing: the new skb is routed to cilium_host
5. from_host@cilium_host: the new skb is returned to stack
6. to_host@cilium_net: the new skb is returned to stack
7. stack: PREROUTING, routing, FORWARD, POSTROUTING

Look at step 7, we are hitting PREROUTING! Because of
cilium/proxy#742, this to-world skb is also
linked to a transparent socket, matching the "-m socket --transparent"
condition, the packet will fortunately have the 0x200 mark.

If we do nothing, this to-world skb marked with 0x200 will then hit
routiong rule "from all fwmark 0x200/0xf00 lookup 2004" and be routed to
local. It should have gone to the world.

This patch fixes this future issue as a precaution (otherwise we'll
break git-bisect).

This patch provides a straightforward solution: at step 5
from_host@cilium_host, we set a specical mark 0x800
(MARK_MAGIC_PROXY_TO_WORLD), then iptables can exclude this mark using
"-m mark ! --mark 0x800/0xf00".

Signed-off-by: gray <gray.liang@isovalent.com>
dylandreimerink pushed a commit to cilium/cilium that referenced this pull request Jun 11, 2024
[ upstream commit: 3384d73 ]

After cilium/proxy#742, proxy traffic keeps
original pod IP as source IP for to-world packets, which must be
masqueraded to eth0 IP. There is no issue for now, but the new
routing rule (0xb00 lookup 2005) to be added for #31984
will cause a side effect breaking masquerading. This patch fixes the
that side effect as a precaution, otherwise git-bisect breaks.

The new routing rule (0xb00 lookup 2005) will cause proxy packets going
through POSTROUTING for twice: first time happens when proxy sends
packets which are routed to cilium_host, these are hitting OUTPUT +
**POSTROUTING**; the second time takes place after packets ingressed
from cilium_net, these skbs will traverse PREROUTING + FORWARD +
**POSTROUTING**.

However, due to kernel's implementation details, an skb won't be
processed by nat POSTROUTING for twice: after the first POSTROUTING
check, skb's ct `(struct nf_conn*)(skb->_nfct & ~7)` has a status
IPS_SRC_NAT_DONE to skip the further traversal at all. [1]

To avoid being set the IPS_SRC_NAT_DONE flag, this patch adds an
iptables rule `--mark 0xb00 -j CT --notrack` at OUTPUT to skip the first
round iptables ct, just for proxy traffic which is characterized by
0xb00 mark.

[1] https://elixir.bootlin.com/linux/v6.6.2/source/net/netfilter/nf_nat_core.c#L825
[1] https://elixir.bootlin.com/linux/v6.6.2/source/include/net/netfilter/nf_nat.h#L111

Signed-off-by: gray <gray.liang@isovalent.com>
jrajahalme added a commit to jrajahalme/cilium that referenced this pull request Jul 31, 2024
Allow Envoy listener chaining via the loopback device by not routing transparent proxy traffic
destined to the loopback device to the cilium_host device.

Fixes: cilium#32683, cilium/proxy#742

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
github-merge-queue bot pushed a commit to cilium/cilium that referenced this pull request Aug 8, 2024
Allow Envoy listener chaining via the loopback device by not routing transparent proxy traffic
destined to the loopback device to the cilium_host device.

Fixes: #32683, cilium/proxy#742

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
ti-mo pushed a commit to ti-mo/cilium that referenced this pull request Aug 8, 2024
Allow Envoy listener chaining via the loopback device by not routing transparent proxy traffic
destined to the loopback device to the cilium_host device.

Fixes: cilium#32683, cilium/proxy#742

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
jschwinger233 pushed a commit to cilium/cilium that referenced this pull request Aug 12, 2024
[ upstream commit 4c9cf37 ]

Allow Envoy listener chaining via the loopback device by not routing transparent proxy traffic
destined to the loopback device to the cilium_host device.

Fixes: #32683, cilium/proxy#742

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Signed-off-by: gray <greyschwinger@gmail.com>
github-merge-queue bot pushed a commit to cilium/cilium that referenced this pull request Aug 13, 2024
[ upstream commit 4c9cf37 ]

Allow Envoy listener chaining via the loopback device by not routing transparent proxy traffic
destined to the loopback device to the cilium_host device.

Fixes: #32683, cilium/proxy#742

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Signed-off-by: gray <greyschwinger@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants