Skip to content

Conversation

qmonnet
Copy link
Member

@qmonnet qmonnet commented Aug 4, 2020

Description

Multi-node NodePort traffic for EKS needs a set of specific rules that are usually set by the aws daemonset:

# sysctl -w net.ipv4.conf.eth0.rp_filter=2
# iptables -t mangle -A PREROUTING -i eth0 -m comment --comment "AWS, primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j CONNMARK --set-xmark 0x80/0x80
# iptables -t mangle -A PREROUTING -i eni+ -m comment --comment "AWS, primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
# ip rule add fwmark 0x80/0x80 lookup main

These rules mark packets coming from another node through eth0, and restore the mark on the return path to force a lookup into the main routing table. Without them, the ip rules set by the cilium-cni plugin tell the host to lookup into the table related to the VPC for which the CIDR used by the endpoint has been configured.

We want to reproduce equivalent rules to ensure correct routing, or multi-node NodePort traffic will not be routed correctly. This could be observed with the pod-to-b-multi-node-nodeport pod from connectivity check never getting ready.

This commit makes the loader and iptables module create the relevant rules when IPAM is ENI and egress masquerading is in use. The rules are nearly identical to those from the aws daemonset (different comments, different interface prefix for conntrack return path, explicit preference for ip rule):

# sysctl -w net.ipv4.conf.<egressMasqueradeInterfaces>.rp_filter=2
# iptables -t mangle -A PREROUTING -i <egressMasqueradeInterfaces> -m comment --comment "cilium: primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j CONNMARK --set-xmark 0x80/0x80
# iptables -t mangle -A PREROUTING -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
# ip rule add fwmark 0x80/0x80 lookup main pref 109

Steps for verification

Apply the patch, build and push the Docker image:

$ DOCKER_DEV_ACCOUNT=qmonnet make dev-docker-image
$ docker push qmonnet/cilium-dev:latest

Follow the instructions to install Cilium on AWS EKS. When deploying Cilium with Helm, do not forget to pass the location of the custom image for the agent:

$ helm install cilium [...] --set agent.image=docker.io/qmonnet/cilium-dev

Deploy the connectivity check. Pick the version from Cilium v1.8 (latest version does not have pod-to-b-multi-node-nodeport):

$ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/v1.8/examples/kubernetes/connectivity-check/connectivity-check.yaml

Wait a few seconds. On a cilium pod, the newly-added rules should be visible:

# sysctl net.ipv4.conf.eth0.rp_filter
net.ipv4.conf.eth0.rp_filter = 2
# iptables-save|grep -w 0x80
-A CILIUM_PRE_mangle -i eth0 -m comment --comment "cilium: primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j CONNMARK --set-xmark 0x80/0x80
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
# ip rule|grep -w 0x80
109:    from all fwmark 0x80/0x80 lookup main 

All pods from the connectivity-check should be marked as ready, in particular the pod-to-b-multi-node-nodeport:

$ kubectl get pods
NAME                                                     READY   STATUS    RESTARTS   AGE
echo-a-58dd59998d-ct49w                                  1/1     Running   0          3m8s
echo-b-865969889d-xlb2v                                  1/1     Running   0          3m8s
echo-b-host-659c674bb6-895qn                             1/1     Running   2          3m8s
host-to-b-multi-node-clusterip-6fb94d9df6-7lk9s          1/1     Running   1          3m8s
host-to-b-multi-node-headless-7c4ff79cd-9zk4c            1/1     Running   1          3m8s
pod-to-a-5c8dcf69f7-5jnmh                                1/1     Running   0          3m8s
pod-to-a-allowed-cnp-75684d58cc-flpb9                    1/1     Running   0          3m8s
pod-to-a-external-1111-669ccfb85f-mljxd                  1/1     Running   0          3m7s
pod-to-a-l3-denied-cnp-7b8bfcb66c-tnqqf                  1/1     Running   0          3m8s
pod-to-b-intra-node-74997967f8-phb8v                     1/1     Running   0          3m8s
pod-to-b-intra-node-nodeport-775f967f47-8tjkq            1/1     Running   1          3m8s
pod-to-b-multi-node-clusterip-587678cbc4-4kzpg           1/1     Running   0          3m7s
pod-to-b-multi-node-headless-574d9f5894-vsbjt            1/1     Running   1          3m7s
pod-to-b-multi-node-nodeport-7944d9f9fc-wn4j2            1/1     Running   1          3m7s
pod-to-external-fqdn-allow-google-cnp-6dd57bc859-4h4pj   1/1     Running   0          3m7s

Discussion

I am not entirely sure of my implementation. In particular, reviewers might want to consider:

  • Is this the best location to add the rules? I initially considered the AWS-related code for the cilium-cni but I don't think this is used to manage the node itself.
  • Should the rules depend on masquerading being used? I was under the impression we always use it for EKS, and that the interface we use for egress masquerading is the one to target with the rules. But I may be wrong.
  • I picked a preference for the ip rule (linux_defaults.RulePriorityEgress - 1) so it gets looked up just before the rule we want to avoid, but maybe I should create a new constant in linux_defaults for that? Not sure if this is worth.
  • Style: The processing of interface names for the ip rule addition feels a bit cumbersome (with this + prefix to process). I'm not sure we have a helper for this somewhere, I could not find one (other lxc+-style names are passed directly to iptables from what I could see). We do have a helper somewhere for setting rp_filter but it is not used in Reinitialize() from the loader, so I just stuck to local code and appended to sysSettings.

Tests

None in this PR. Ready-state can be tested with connectivity check from Cilium v1.8 (I didn't realise it was gone until writing this description).

Follow-up: Re-introduce pod-to-b-multi-node-nodeport or equivalent in connectivity checks.

Fixes: #12098

Fix bug in ENI environments where connections to NodePort would fail due to asymmetric routing

@qmonnet qmonnet added kind/bug This is a bug in the Cilium logic. area/loader Impacts the loading of BPF programs into the kernel. priority/high This is considered vital to an upcoming release. release-note/bug This PR fixes an issue in a previous release of Cilium. area/eni Impacts ENI based IPAM. needs-backport/1.7 labels Aug 4, 2020
@qmonnet qmonnet requested review from tklauser, tgraf and a team August 4, 2020 16:06
@qmonnet
Copy link
Member Author

qmonnet commented Aug 4, 2020

test-me-please

@qmonnet qmonnet marked this pull request as ready for review August 4, 2020 16:19
@qmonnet qmonnet requested a review from a team August 4, 2020 16:19
@qmonnet
Copy link
Member Author

qmonnet commented Aug 4, 2020

Sidenote: Discussing with @joestringer on Slack, it appears that the removal of pod-to-b-multi-node-nodeport from the connectivity checks was unintentional. I'll work on restoring it, but this may be for a follow-up PR and should not be blocking this one.

[Edit] See #12788 [Later edit] Now merged.

@coveralls
Copy link

coveralls commented Aug 4, 2020

Coverage Status

Coverage decreased (-0.02%) to 37.096% when pulling 2de4789 on pr/qmonnet/eks_nodeport_rulefix into 9cbc151 on master.

Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the detailed descriptions and the verification steps in the PR description! Using these steps I was able to verify that this PR indeed fixes the pod-to-b-multi-node-nodeport connectivity check on EKS.

Some small nits inline regarding coding style and some typos. Other than that LGTM.

I'm not sure (or not knowledgeable enough) to judge on the first 3 discussion bullet points in the PR description. I'd be glad if the other reviewers could comment on these.

Regarding the fourth bullet point (Style) I think it's fine as is for now and could be refactored in a follow-up PR if needed. I don't think we have an existing helper for this (or at least I couldn't find one with my grep skills).

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough explanations. Just a few minor points to address.

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable, couple of comments below on your discussion points. Please also address the feedback from other reviewers.

@@ -1140,3 +1151,25 @@ func (m *IptablesManager) addCiliumNoTrackXfrmRules() error {
}
return nil
}

func (m *IptablesManager) addCiliumENIRules() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My next question is, what's our plan for a non-iptables version of this functionality? :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No plan at the moment. I'm mostly focusing on getting a fix so NodePort works as expected on EKS, I don't know if we want or need a non-iptables version. It's only two rules per interface used between the nodes, so shouldn't be adding to much overhead, but maybe finding an alternative is desirable... What is your opinion on the topic?

Copy link
Member

@joestringer joestringer Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're aiming to provide an entirely iptables-free implementation, so that's where my question comes from. I think it's very reasonable to have this PR to solve the functional question first, then we can address the non-iptables question later. This is mostly fishing for thoughts :-)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just got bitten by this, thankfully after 2h of research I landed here 😿

I have multi interfaces nodes and wanted to be full cilium, replacing kube-proxy, kube-router and metallb.

Is there any tracking ticket to support it out of kube-proxy? should it be added as a warning on kube-proxy replacement doc page in the meantime?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n0rad Please open a bug report. This discussion is about getting rid of iptables, but it seems you are looking at getting this to work with our kube-proxy replacement, which is a different thing.

@qmonnet qmonnet force-pushed the pr/qmonnet/eks_nodeport_rulefix branch 4 times, most recently from 3aba9e4 to 2de4789 Compare August 6, 2020 14:02
@qmonnet
Copy link
Member Author

qmonnet commented Aug 6, 2020

Thanks a lot Tobias, Chris and Joe for the verification and the reviews! It's been really helpful and much appreciated.

I addressed the comments from Tobias and Chris, there are some points that I still need to sort out regarding Joe's feedback (see inline answers).

Incremental diff
diff --git a/pkg/datapath/iptables/iptables.go b/pkg/datapath/iptables/iptables.go
index c59a76f5399a..b6f424ca646d 100644
--- a/pkg/datapath/iptables/iptables.go
+++ b/pkg/datapath/iptables/iptables.go
@@ -1037,10 +1037,10 @@ func (m *IptablesManager) InstallRules(ifName string) error {
 	}
 
 	// EKS expects the aws daemonset to set up specific rules for marking
-	// packets It marks packets coming from another node and restores the
+	// packets. It marks packets coming from another node and restores the
 	// mark on the return path to force a lookup into the main routing
 	// table. We want to reproduce something similar. Please see note in
-	// Reanitialize() in package "loader" for more details.
+	// Reanitialize() in pkg/datapath/loader for more details.
 	if option.Config.Masquerade && option.Config.IPAM == ipamOption.IPAMENI {
 		if err := m.addCiliumENIRules(); err != nil {
 			return fmt.Errorf("cannot install rules for ENI multi-node NodePort: %w", err)
@@ -1153,6 +1153,8 @@ func (m *IptablesManager) addCiliumNoTrackXfrmRules() error {
 }
 
 func (m *IptablesManager) addCiliumENIRules() error {
+	nfmask := fmt.Sprintf("%#08x", linux_defaults.MarkMultinodeNodeport)
+	ctmask := fmt.Sprintf("%#08x", linux_defaults.MaskMultinodeNodeport)
 	if err := runProg("iptables", append(
 		m.waitArgs,
 		"-t", "mangle",
@@ -1160,7 +1162,7 @@ func (m *IptablesManager) addCiliumENIRules() error {
 		"-i", option.Config.EgressMasqueradeInterfaces,
 		"-m", "comment", "--comment", "cilium: primary ENI",
 		"-m", "addrtype", "--dst-type", "LOCAL", "--limit-iface-in",
-		"-j", "CONNMARK", "--set-xmark", "0x80/0x80"),
+		"-j", "CONNMARK", "--set-xmark", nfmask+"/"+ctmask),
 		false); err != nil {
 		return err
 	}
@@ -1170,6 +1172,6 @@ func (m *IptablesManager) addCiliumENIRules() error {
 		"-A", ciliumPreMangleChain,
 		"-i", getDeliveryInterface(""),
 		"-m", "comment", "--comment", "cilium: primary ENI",
-		"-j", "CONNMARK", "--restore-mark", "--nfmask", "0x80", "--ctmask", "0x80"),
+		"-j", "CONNMARK", "--restore-mark", "--nfmask", nfmask, "--ctmask", ctmask),
 		false)
 }
diff --git a/pkg/datapath/linux/linux_defaults/linux_defaults.go b/pkg/datapath/linux/linux_defaults/linux_defaults.go
index 8d9dda86755b..33040aa7324c 100644
--- a/pkg/datapath/linux/linux_defaults/linux_defaults.go
+++ b/pkg/datapath/linux/linux_defaults/linux_defaults.go
@@ -34,6 +34,15 @@ const (
 	// RouteMarkMask is the mask required for the route mark value
 	RouteMarkMask = 0xF00
 
+	// MarkMultinodeNodeport is used on AWS EKS to mark traffic from another
+	// node, so that it gets routed back through the relevant interface.
+	MarkMultinodeNodeport = 0x80
+
+	// MaskMultinodeNodeport is the mask associated with the
+	// RouterMarkNodePort for NodePort traffic from remote nodes on AWS
+	// EKS.
+	MaskMultinodeNodeport = 0x80
+
 	// IPSecProtocolID IP protocol ID for IPSec defined in RFC4303
 	RouteProtocolIPSec = 50
 
@@ -46,6 +55,13 @@ const (
 	// of endpoints. This priority is after the local table priority.
 	RulePriorityEgress = 110
 
+	// RulePriorityNodeport is the priority of the rule used on AWS EKS, to
+	// make sure that lookups for multi-node NodePort traffic are NOT done
+	// from the table for the VPC to which the endpoint's CIDR is
+	// associated, but from the main routing table instead.
+	// This priority is before the egress priority.
+	RulePriorityNodeport = RulePriorityEgress - 1
+
 	// TunnelDeviceName the default name of the tunnel device when using vxlan
 	TunnelDeviceName = "cilium_vxlan"
 
diff --git a/pkg/datapath/loader/base.go b/pkg/datapath/loader/base.go
index 629cebf118ee..75ef8197fb53 100644
--- a/pkg/datapath/loader/base.go
+++ b/pkg/datapath/loader/base.go
@@ -137,8 +137,9 @@ func getEgressMasqueradeInterfaces() ([]string, error) {
 			ifaces = append(ifaces, l.Attrs().Name)
 		}
 	} else {
+		ifPrefix := strings.TrimSuffix(*egMasqIfOpt, "+")
 		for _, l := range links {
-			if strings.HasPrefix(l.Attrs().Name, (*egMasqIfOpt)[:len(*egMasqIfOpt)-1]) {
+			if strings.HasPrefix(l.Attrs().Name, ifPrefix) {
 				ifaces = append(ifaces, l.Attrs().Name)
 			}
 		}
@@ -358,9 +359,9 @@ func (l *Loader) Reinitialize(ctx context.Context, o datapath.BaseProgramOwner,
 				setting{"net.ipv4.conf." + iface + ".rp_filter", "2", false})
 		}
 		if err := route.ReplaceRule(route.Rule{
-			Priority: linux_defaults.RulePriorityEgress - 1,
-			Mark:     0x80,
-			Mask:     0x80,
+			Priority: linux_defaults.RulePriorityNodeport,
+			Mark:     linux_defaults.MarkMultinodeNodeport,
+			Mask:     linux_defaults.MaskMultinodeNodeport,
 			Table:    route.MainTable,
 		}); err != nil {
 			return fmt.Errorf("unable to install ip rule for ENI multi-node NodePort: %w", err)

Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for my requested changes

@joestringer
Copy link
Member

From discussion on Slack, while this is important to fix and it would be great to include in a release shortly, it should not block other important fixes from being released. Removed the corresponding label.

YutaroHayakawa added a commit to YutaroHayakawa/cilium that referenced this pull request Sep 16, 2022
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also
uses 7th bit to carry the upper most bit of the security identity in
some cases (e.g. Pod to Pod communication on the same node). ENI CNI
uses CONNMARK iptables target to record the flow came from the external
network and restores it in the reply path.

When ENI "restores" the mark for the new connection (connmark == 0), it
zeros the 7th bit of the mark and changes the identity. This becomes the
problem when Cilium is carrying the identity and the upper most bit is 1.
The upper most bit becomes 1 when the ClusterID of the source endpoint is
128-255 since we are encoding ClusterID into the upper most 8bits of the
identity.

There are two possible iptables rules which causes this error.

1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770)

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

2. The rule in the PREROUTING chain of nat table (managed by AWS)

```
-A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

There are three possible setup affected by this issue

1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (Cilium
   ENI mode iptables rule exists, AWS VPC CNI iptables rule exists (can delete it))
2. Cilium is running with AWS VPC CNI with chaining mode (Cilium ENI mode iptables
   rule doesn’t exist, AWS VPC CNI iptables rule exists (cannot delete it))
3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (Cilium
   ENI mode iptables rule exists, AWS VPC iptables rule doesn’t exist)

In setup 3, we can fix the issue by only modifying rule 1. This is what this
commit focuses on. It can be resolved by adding the matching rule that don't
restore the connmark when we are carrying the mark. We can check if the
MARK_MAGIC_IDENTITY is set.

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
YutaroHayakawa added a commit to YutaroHayakawa/cilium that referenced this pull request Sep 16, 2022
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also
uses 7th bit to carry the upper most bit of the security identity in
some cases (e.g. Pod to Pod communication on the same node). ENI CNI
uses CONNMARK iptables target to record the flow came from the external
network and restores it in the reply path.

When ENI "restores" the mark for the new connection (connmark == 0), it
zeros the 7th bit of the mark and changes the identity. This becomes the
problem when Cilium is carrying the identity and the upper most bit is 1.
The upper most bit becomes 1 when the ClusterID of the source endpoint is
128-255 since we are encoding ClusterID into the upper most 8bits of the
identity.

There are two possible iptables rules which causes this error.

1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770)

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

2. The rule in the PREROUTING chain of nat table (managed by AWS)

```
-A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

There are three possible setup affected by this issue

1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that)
2. Cilium is running with AWS VPC CNI with chaining mode
3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts)

In setup 3, we can fix the issue by only modifying rule 1. This is what this
commit focuses on. It can be resolved by adding the matching rule that don't
restore the connmark when we are carrying the mark. We can check if the
MARK_MAGIC_IDENTITY is set.

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
hemanthmalla pushed a commit to DataDog/cilium that referenced this pull request Dec 8, 2022
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also
uses 7th bit to carry the upper most bit of the security identity in
some cases (e.g. Pod to Pod communication on the same node). ENI CNI
uses CONNMARK iptables target to record the flow came from the external
network and restores it in the reply path.

When ENI "restores" the mark for the new connection (connmark == 0), it
zeros the 7th bit of the mark and changes the identity. This becomes the
problem when Cilium is carrying the identity and the upper most bit is 1.
The upper most bit becomes 1 when the ClusterID of the source endpoint is
128-255 since we are encoding ClusterID into the upper most 8bits of the
identity.

There are two possible iptables rules which causes this error.

1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770)

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

2. The rule in the PREROUTING chain of nat table (managed by AWS)

```
-A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

There are three possible setup affected by this issue

1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that)
2. Cilium is running with AWS VPC CNI with chaining mode
3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts)

In setup 3, we can fix the issue by only modifying rule 1. This is what this
commit focuses on. It can be resolved by adding the matching rule that don't
restore the connmark when we are carrying the mark. We can check if the
MARK_MAGIC_IDENTITY is set.

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
hemanthmalla pushed a commit to DataDog/cilium that referenced this pull request Dec 21, 2022
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also
uses 7th bit to carry the upper most bit of the security identity in
some cases (e.g. Pod to Pod communication on the same node). ENI CNI
uses CONNMARK iptables target to record the flow came from the external
network and restores it in the reply path.

When ENI "restores" the mark for the new connection (connmark == 0), it
zeros the 7th bit of the mark and changes the identity. This becomes the
problem when Cilium is carrying the identity and the upper most bit is 1.
The upper most bit becomes 1 when the ClusterID of the source endpoint is
128-255 since we are encoding ClusterID into the upper most 8bits of the
identity.

There are two possible iptables rules which causes this error.

1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770)

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

2. The rule in the PREROUTING chain of nat table (managed by AWS)

```
-A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

There are three possible setup affected by this issue

1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that)
2. Cilium is running with AWS VPC CNI with chaining mode
3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts)

In setup 3, we can fix the issue by only modifying rule 1. This is what this
commit focuses on. It can be resolved by adding the matching rule that don't
restore the connmark when we are carrying the mark. We can check if the
MARK_MAGIC_IDENTITY is set.

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion.

Co-developed-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
hemanthmalla pushed a commit to DataDog/cilium that referenced this pull request Jan 11, 2023
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also
uses 7th bit to carry the upper most bit of the security identity in
some cases (e.g. Pod to Pod communication on the same node). ENI CNI
uses CONNMARK iptables target to record the flow came from the external
network and restores it in the reply path.

When ENI "restores" the mark for the new connection (connmark == 0), it
zeros the 7th bit of the mark and changes the identity. This becomes the
problem when Cilium is carrying the identity and the upper most bit is 1.
The upper most bit becomes 1 when the ClusterID of the source endpoint is
128-255 since we are encoding ClusterID into the upper most 8bits of the
identity.

There are two possible iptables rules which causes this error.

1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770)

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

2. The rule in the PREROUTING chain of nat table (managed by AWS)

```
-A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

There are three possible setup affected by this issue

1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that)
2. Cilium is running with AWS VPC CNI with chaining mode
3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts)

In setup 3, we can fix the issue by only modifying rule 1. This is what this
commit focuses on. It can be resolved by adding the matching rule that don't
restore the connmark when we are carrying the mark. We can check if the
MARK_MAGIC_IDENTITY is set.

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion.

Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
BenHinthorne pushed a commit to DataDog/cilium that referenced this pull request Jan 20, 2023
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also
uses 7th bit to carry the upper most bit of the security identity in
some cases (e.g. Pod to Pod communication on the same node). ENI CNI
uses CONNMARK iptables target to record the flow came from the external
network and restores it in the reply path.

When ENI "restores" the mark for the new connection (connmark == 0), it
zeros the 7th bit of the mark and changes the identity. This becomes the
problem when Cilium is carrying the identity and the upper most bit is 1.
The upper most bit becomes 1 when the ClusterID of the source endpoint is
128-255 since we are encoding ClusterID into the upper most 8bits of the
identity.

There are two possible iptables rules which causes this error.

1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770)

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

2. The rule in the PREROUTING chain of nat table (managed by AWS)

```
-A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

There are three possible setup affected by this issue

1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that)
2. Cilium is running with AWS VPC CNI with chaining mode
3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts)

In setup 3, we can fix the issue by only modifying rule 1. This is what this
commit focuses on. It can be resolved by adding the matching rule that don't
restore the connmark when we are carrying the mark. We can check if the
MARK_MAGIC_IDENTITY is set.

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion.

Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
mvisonneau pushed a commit to DataDog/cilium that referenced this pull request Jun 16, 2023
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also
uses 7th bit to carry the upper most bit of the security identity in
some cases (e.g. Pod to Pod communication on the same node). ENI CNI
uses CONNMARK iptables target to record the flow came from the external
network and restores it in the reply path.

When ENI "restores" the mark for the new connection (connmark == 0), it
zeros the 7th bit of the mark and changes the identity. This becomes the
problem when Cilium is carrying the identity and the upper most bit is 1.
The upper most bit becomes 1 when the ClusterID of the source endpoint is
128-255 since we are encoding ClusterID into the upper most 8bits of the
identity.

There are two possible iptables rules which causes this error.

1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770)

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

2. The rule in the PREROUTING chain of nat table (managed by AWS)

```
-A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

There are three possible setup affected by this issue

1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that)
2. Cilium is running with AWS VPC CNI with chaining mode
3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts)

In setup 3, we can fix the issue by only modifying rule 1. This is what this
commit focuses on. It can be resolved by adding the matching rule that don't
restore the connmark when we are carrying the mark. We can check if the
MARK_MAGIC_IDENTITY is set.

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion.

Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
jaredledvina pushed a commit to DataDog/cilium that referenced this pull request Jul 12, 2023
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also
uses 7th bit to carry the upper most bit of the security identity in
some cases (e.g. Pod to Pod communication on the same node). ENI CNI
uses CONNMARK iptables target to record the flow came from the external
network and restores it in the reply path.

When ENI "restores" the mark for the new connection (connmark == 0), it
zeros the 7th bit of the mark and changes the identity. This becomes the
problem when Cilium is carrying the identity and the upper most bit is 1.
The upper most bit becomes 1 when the ClusterID of the source endpoint is
128-255 since we are encoding ClusterID into the upper most 8bits of the
identity.

There are two possible iptables rules which causes this error.

1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770)

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

2. The rule in the PREROUTING chain of nat table (managed by AWS)

```
-A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

There are three possible setup affected by this issue

1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that)
2. Cilium is running with AWS VPC CNI with chaining mode
3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts)

In setup 3, we can fix the issue by only modifying rule 1. This is what this
commit focuses on. It can be resolved by adding the matching rule that don't
restore the connmark when we are carrying the mark. We can check if the
MARK_MAGIC_IDENTITY is set.

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion.

Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
jaredledvina pushed a commit to DataDog/cilium that referenced this pull request Jul 13, 2023
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also
uses 7th bit to carry the upper most bit of the security identity in
some cases (e.g. Pod to Pod communication on the same node). ENI CNI
uses CONNMARK iptables target to record the flow came from the external
network and restores it in the reply path.

When ENI "restores" the mark for the new connection (connmark == 0), it
zeros the 7th bit of the mark and changes the identity. This becomes the
problem when Cilium is carrying the identity and the upper most bit is 1.
The upper most bit becomes 1 when the ClusterID of the source endpoint is
128-255 since we are encoding ClusterID into the upper most 8bits of the
identity.

There are two possible iptables rules which causes this error.

1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770)

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

2. The rule in the PREROUTING chain of nat table (managed by AWS)

```
-A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

There are three possible setup affected by this issue

1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that)
2. Cilium is running with AWS VPC CNI with chaining mode
3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts)

In setup 3, we can fix the issue by only modifying rule 1. This is what this
commit focuses on. It can be resolved by adding the matching rule that don't
restore the connmark when we are carrying the mark. We can check if the
MARK_MAGIC_IDENTITY is set.

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion.

Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
jaredledvina pushed a commit to DataDog/cilium that referenced this pull request Aug 25, 2023
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also
uses 7th bit to carry the upper most bit of the security identity in
some cases (e.g. Pod to Pod communication on the same node). ENI CNI
uses CONNMARK iptables target to record the flow came from the external
network and restores it in the reply path.

When ENI "restores" the mark for the new connection (connmark == 0), it
zeros the 7th bit of the mark and changes the identity. This becomes the
problem when Cilium is carrying the identity and the upper most bit is 1.
The upper most bit becomes 1 when the ClusterID of the source endpoint is
128-255 since we are encoding ClusterID into the upper most 8bits of the
identity.

There are two possible iptables rules which causes this error.

1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770)

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

2. The rule in the PREROUTING chain of nat table (managed by AWS)

```
-A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

There are three possible setup affected by this issue

1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that)
2. Cilium is running with AWS VPC CNI with chaining mode
3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts)

In setup 3, we can fix the issue by only modifying rule 1. This is what this
commit focuses on. It can be resolved by adding the matching rule that don't
restore the connmark when we are carrying the mark. We can check if the
MARK_MAGIC_IDENTITY is set.

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion.

Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
kernel-patches-daemon-bpf bot pushed a commit to kernel-patches/bpf that referenced this pull request Mar 26, 2024
Anton Protopopov says:

====================
This patch series adds policy routing support in bpf_fib_lookup.
This is a useful functionality which was missing for a long time,
as without it some networking setups can't be implemented in BPF.
One example can be found here [1].

A while ago there was an attempt to add this functionality [2] by
Rumen Telbizov and David Ahern. I've completely refactored the code,
except that the changes to the struct bpf_fib_lookup were copy-pasted
from the original patch.

The first patch implements the functionality, the second patch adds
a few selftests, the third patch adds a build time check of the size
of the struct bpf_fib_lookup.

  [1] cilium/cilium#12770
  [2] https://lore.kernel.org/all/20210629185537.78008-2-rumen.telbizov@menlosecurity.com/

v1 -> v2:
  - simplify the selftests (Martin)
  - add a static check for sizeof(struct bpf_fib_lookup) (David)
====================

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
jaredledvina pushed a commit to DataDog/cilium that referenced this pull request Apr 1, 2024
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also
uses 7th bit to carry the upper most bit of the security identity in
some cases (e.g. Pod to Pod communication on the same node). ENI CNI
uses CONNMARK iptables target to record the flow came from the external
network and restores it in the reply path.

When ENI "restores" the mark for the new connection (connmark == 0), it
zeros the 7th bit of the mark and changes the identity. This becomes the
problem when Cilium is carrying the identity and the upper most bit is 1.
The upper most bit becomes 1 when the ClusterID of the source endpoint is
128-255 since we are encoding ClusterID into the upper most 8bits of the
identity.

There are two possible iptables rules which causes this error.

1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770)

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

2. The rule in the PREROUTING chain of nat table (managed by AWS)

```
-A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

There are three possible setup affected by this issue

1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that)
2. Cilium is running with AWS VPC CNI with chaining mode
3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts)

In setup 3, we can fix the issue by only modifying rule 1. This is what this
commit focuses on. It can be resolved by adding the matching rule that don't
restore the connmark when we are carrying the mark. We can check if the
MARK_MAGIC_IDENTITY is set.

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion.

Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
jaredledvina pushed a commit to DataDog/cilium that referenced this pull request Jul 26, 2024
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also
uses 7th bit to carry the upper most bit of the security identity in
some cases (e.g. Pod to Pod communication on the same node). ENI CNI
uses CONNMARK iptables target to record the flow came from the external
network and restores it in the reply path.

When ENI "restores" the mark for the new connection (connmark == 0), it
zeros the 7th bit of the mark and changes the identity. This becomes the
problem when Cilium is carrying the identity and the upper most bit is 1.
The upper most bit becomes 1 when the ClusterID of the source endpoint is
128-255 since we are encoding ClusterID into the upper most 8bits of the
identity.

There are two possible iptables rules which causes this error.

1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770)

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

2. The rule in the PREROUTING chain of nat table (managed by AWS)

```
-A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

There are three possible setup affected by this issue

1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that)
2. Cilium is running with AWS VPC CNI with chaining mode
3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts)

In setup 3, we can fix the issue by only modifying rule 1. This is what this
commit focuses on. It can be resolved by adding the matching rule that don't
restore the connmark when we are carrying the mark. We can check if the
MARK_MAGIC_IDENTITY is set.

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion.

Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
jaredledvina pushed a commit to DataDog/cilium that referenced this pull request Aug 8, 2024
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also
uses 7th bit to carry the upper most bit of the security identity in
some cases (e.g. Pod to Pod communication on the same node). ENI CNI
uses CONNMARK iptables target to record the flow came from the external
network and restores it in the reply path.

When ENI "restores" the mark for the new connection (connmark == 0), it
zeros the 7th bit of the mark and changes the identity. This becomes the
problem when Cilium is carrying the identity and the upper most bit is 1.
The upper most bit becomes 1 when the ClusterID of the source endpoint is
128-255 since we are encoding ClusterID into the upper most 8bits of the
identity.

There are two possible iptables rules which causes this error.

1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770)

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

2. The rule in the PREROUTING chain of nat table (managed by AWS)

```
-A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

There are three possible setup affected by this issue

1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that)
2. Cilium is running with AWS VPC CNI with chaining mode
3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts)

In setup 3, we can fix the issue by only modifying rule 1. This is what this
commit focuses on. It can be resolved by adding the matching rule that don't
restore the connmark when we are carrying the mark. We can check if the
MARK_MAGIC_IDENTITY is set.

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion.

Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
antonipp pushed a commit to DataDog/cilium that referenced this pull request Dec 4, 2024
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also
uses 7th bit to carry the upper most bit of the security identity in
some cases (e.g. Pod to Pod communication on the same node). ENI CNI
uses CONNMARK iptables target to record the flow came from the external
network and restores it in the reply path.

When ENI "restores" the mark for the new connection (connmark == 0), it
zeros the 7th bit of the mark and changes the identity. This becomes the
problem when Cilium is carrying the identity and the upper most bit is 1.
The upper most bit becomes 1 when the ClusterID of the source endpoint is
128-255 since we are encoding ClusterID into the upper most 8bits of the
identity.

There are two possible iptables rules which causes this error.

1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770)

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

2. The rule in the PREROUTING chain of nat table (managed by AWS)

```
-A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

There are three possible setup affected by this issue

1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that)
2. Cilium is running with AWS VPC CNI with chaining mode
3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts)

In setup 3, we can fix the issue by only modifying rule 1. This is what this
commit focuses on. It can be resolved by adding the matching rule that don't
restore the connmark when we are carrying the mark. We can check if the
MARK_MAGIC_IDENTITY is set.

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion.

Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
antonipp pushed a commit to DataDog/cilium that referenced this pull request Dec 5, 2024
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also
uses 7th bit to carry the upper most bit of the security identity in
some cases (e.g. Pod to Pod communication on the same node). ENI CNI
uses CONNMARK iptables target to record the flow came from the external
network and restores it in the reply path.

When ENI "restores" the mark for the new connection (connmark == 0), it
zeros the 7th bit of the mark and changes the identity. This becomes the
problem when Cilium is carrying the identity and the upper most bit is 1.
The upper most bit becomes 1 when the ClusterID of the source endpoint is
128-255 since we are encoding ClusterID into the upper most 8bits of the
identity.

There are two possible iptables rules which causes this error.

1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770)

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

2. The rule in the PREROUTING chain of nat table (managed by AWS)

```
-A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

There are three possible setup affected by this issue

1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that)
2. Cilium is running with AWS VPC CNI with chaining mode
3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts)

In setup 3, we can fix the issue by only modifying rule 1. This is what this
commit focuses on. It can be resolved by adding the matching rule that don't
restore the connmark when we are carrying the mark. We can check if the
MARK_MAGIC_IDENTITY is set.

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion.

Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
antonipp pushed a commit to DataDog/cilium that referenced this pull request Dec 5, 2024
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also
uses 7th bit to carry the upper most bit of the security identity in
some cases (e.g. Pod to Pod communication on the same node). ENI CNI
uses CONNMARK iptables target to record the flow came from the external
network and restores it in the reply path.

When ENI "restores" the mark for the new connection (connmark == 0), it
zeros the 7th bit of the mark and changes the identity. This becomes the
problem when Cilium is carrying the identity and the upper most bit is 1.
The upper most bit becomes 1 when the ClusterID of the source endpoint is
128-255 since we are encoding ClusterID into the upper most 8bits of the
identity.

There are two possible iptables rules which causes this error.

1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770)

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

2. The rule in the PREROUTING chain of nat table (managed by AWS)

```
-A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

There are three possible setup affected by this issue

1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that)
2. Cilium is running with AWS VPC CNI with chaining mode
3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts)

In setup 3, we can fix the issue by only modifying rule 1. This is what this
commit focuses on. It can be resolved by adding the matching rule that don't
restore the connmark when we are carrying the mark. We can check if the
MARK_MAGIC_IDENTITY is set.

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion.

Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
HadrienPatte pushed a commit to DataDog/cilium that referenced this pull request Mar 7, 2025
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also
uses 7th bit to carry the upper most bit of the security identity in
some cases (e.g. Pod to Pod communication on the same node). ENI CNI
uses CONNMARK iptables target to record the flow came from the external
network and restores it in the reply path.

When ENI "restores" the mark for the new connection (connmark == 0), it
zeros the 7th bit of the mark and changes the identity. This becomes the
problem when Cilium is carrying the identity and the upper most bit is 1.
The upper most bit becomes 1 when the ClusterID of the source endpoint is
128-255 since we are encoding ClusterID into the upper most 8bits of the
identity.

There are two possible iptables rules which causes this error.

1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770)

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

2. The rule in the PREROUTING chain of nat table (managed by AWS)

```
-A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

There are three possible setup affected by this issue

1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that)
2. Cilium is running with AWS VPC CNI with chaining mode
3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts)

In setup 3, we can fix the issue by only modifying rule 1. This is what this
commit focuses on. It can be resolved by adding the matching rule that don't
restore the connmark when we are carrying the mark. We can check if the
MARK_MAGIC_IDENTITY is set.

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion.

Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
HadrienPatte pushed a commit to DataDog/cilium that referenced this pull request Aug 19, 2025
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also
uses 7th bit to carry the upper most bit of the security identity in
some cases (e.g. Pod to Pod communication on the same node). ENI CNI
uses CONNMARK iptables target to record the flow came from the external
network and restores it in the reply path.

When ENI "restores" the mark for the new connection (connmark == 0), it
zeros the 7th bit of the mark and changes the identity. This becomes the
problem when Cilium is carrying the identity and the upper most bit is 1.
The upper most bit becomes 1 when the ClusterID of the source endpoint is
128-255 since we are encoding ClusterID into the upper most 8bits of the
identity.

There are two possible iptables rules which causes this error.

1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770)

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

2. The rule in the PREROUTING chain of nat table (managed by AWS)

```
-A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

There are three possible setup affected by this issue

1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that)
2. Cilium is running with AWS VPC CNI with chaining mode
3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts)

In setup 3, we can fix the issue by only modifying rule 1. This is what this
commit focuses on. It can be resolved by adding the matching rule that don't
restore the connmark when we are carrying the mark. We can check if the
MARK_MAGIC_IDENTITY is set.

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion.

Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/eni Impacts ENI based IPAM. area/loader Impacts the loading of BPF programs into the kernel. kind/bug This is a bug in the Cilium logic. priority/high This is considered vital to an upcoming release. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pod-to-b-multi-node-nodeport connectivity test failing on EKS with 1.8.0-rc3
10 participants