iptables, loader: add rules for multi-node NodePort traffic on EKS #12770

qmonnet · 2020-08-04T16:06:42Z

Description

Multi-node NodePort traffic for EKS needs a set of specific rules that are usually set by the aws daemonset:

# sysctl -w net.ipv4.conf.eth0.rp_filter=2
# iptables -t mangle -A PREROUTING -i eth0 -m comment --comment "AWS, primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j CONNMARK --set-xmark 0x80/0x80
# iptables -t mangle -A PREROUTING -i eni+ -m comment --comment "AWS, primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
# ip rule add fwmark 0x80/0x80 lookup main

These rules mark packets coming from another node through eth0, and restore the mark on the return path to force a lookup into the main routing table. Without them, the ip rules set by the cilium-cni plugin tell the host to lookup into the table related to the VPC for which the CIDR used by the endpoint has been configured.

We want to reproduce equivalent rules to ensure correct routing, or multi-node NodePort traffic will not be routed correctly. This could be observed with the pod-to-b-multi-node-nodeport pod from connectivity check never getting ready.

This commit makes the loader and iptables module create the relevant rules when IPAM is ENI and egress masquerading is in use. The rules are nearly identical to those from the aws daemonset (different comments, different interface prefix for conntrack return path, explicit preference for ip rule):

# sysctl -w net.ipv4.conf.<egressMasqueradeInterfaces>.rp_filter=2
# iptables -t mangle -A PREROUTING -i <egressMasqueradeInterfaces> -m comment --comment "cilium: primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j CONNMARK --set-xmark 0x80/0x80
# iptables -t mangle -A PREROUTING -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
# ip rule add fwmark 0x80/0x80 lookup main pref 109

Steps for verification

Apply the patch, build and push the Docker image:

$ DOCKER_DEV_ACCOUNT=qmonnet make dev-docker-image
$ docker push qmonnet/cilium-dev:latest

Follow the instructions to install Cilium on AWS EKS. When deploying Cilium with Helm, do not forget to pass the location of the custom image for the agent:

$ helm install cilium [...] --set agent.image=docker.io/qmonnet/cilium-dev

Deploy the connectivity check. Pick the version from Cilium v1.8 (latest version does not have pod-to-b-multi-node-nodeport):

$ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/v1.8/examples/kubernetes/connectivity-check/connectivity-check.yaml

Wait a few seconds. On a cilium pod, the newly-added rules should be visible:

# sysctl net.ipv4.conf.eth0.rp_filter
net.ipv4.conf.eth0.rp_filter = 2
# iptables-save|grep -w 0x80
-A CILIUM_PRE_mangle -i eth0 -m comment --comment "cilium: primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j CONNMARK --set-xmark 0x80/0x80
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
# ip rule|grep -w 0x80
109:    from all fwmark 0x80/0x80 lookup main

All pods from the connectivity-check should be marked as ready, in particular the pod-to-b-multi-node-nodeport:

$ kubectl get pods
NAME                                                     READY   STATUS    RESTARTS   AGE
echo-a-58dd59998d-ct49w                                  1/1     Running   0          3m8s
echo-b-865969889d-xlb2v                                  1/1     Running   0          3m8s
echo-b-host-659c674bb6-895qn                             1/1     Running   2          3m8s
host-to-b-multi-node-clusterip-6fb94d9df6-7lk9s          1/1     Running   1          3m8s
host-to-b-multi-node-headless-7c4ff79cd-9zk4c            1/1     Running   1          3m8s
pod-to-a-5c8dcf69f7-5jnmh                                1/1     Running   0          3m8s
pod-to-a-allowed-cnp-75684d58cc-flpb9                    1/1     Running   0          3m8s
pod-to-a-external-1111-669ccfb85f-mljxd                  1/1     Running   0          3m7s
pod-to-a-l3-denied-cnp-7b8bfcb66c-tnqqf                  1/1     Running   0          3m8s
pod-to-b-intra-node-74997967f8-phb8v                     1/1     Running   0          3m8s
pod-to-b-intra-node-nodeport-775f967f47-8tjkq            1/1     Running   1          3m8s
pod-to-b-multi-node-clusterip-587678cbc4-4kzpg           1/1     Running   0          3m7s
pod-to-b-multi-node-headless-574d9f5894-vsbjt            1/1     Running   1          3m7s
pod-to-b-multi-node-nodeport-7944d9f9fc-wn4j2            1/1     Running   1          3m7s
pod-to-external-fqdn-allow-google-cnp-6dd57bc859-4h4pj   1/1     Running   0          3m7s

Discussion

I am not entirely sure of my implementation. In particular, reviewers might want to consider:

Is this the best location to add the rules? I initially considered the AWS-related code for the cilium-cni but I don't think this is used to manage the node itself.
Should the rules depend on masquerading being used? I was under the impression we always use it for EKS, and that the interface we use for egress masquerading is the one to target with the rules. But I may be wrong.
I picked a preference for the ip rule (linux_defaults.RulePriorityEgress - 1) so it gets looked up just before the rule we want to avoid, but maybe I should create a new constant in linux_defaults for that? Not sure if this is worth.
Style: The processing of interface names for the ip rule addition feels a bit cumbersome (with this + prefix to process). I'm not sure we have a helper for this somewhere, I could not find one (other lxc+-style names are passed directly to iptables from what I could see). We do have a helper somewhere for setting rp_filter but it is not used in Reinitialize() from the loader, so I just stuck to local code and appended to sysSettings.

Tests

None in this PR. Ready-state can be tested with connectivity check from Cilium v1.8 (I didn't realise it was gone until writing this description).

Follow-up: Re-introduce pod-to-b-multi-node-nodeport or equivalent in connectivity checks.

Fixes: #12098

Fix bug in ENI environments where connections to NodePort would fail due to asymmetric routing

qmonnet · 2020-08-04T16:19:09Z

test-me-please

qmonnet · 2020-08-04T16:32:07Z

Sidenote: Discussing with @joestringer on Slack, it appears that the removal of pod-to-b-multi-node-nodeport from the connectivity checks was unintentional. I'll work on restoring it, but this may be for a follow-up PR and should not be blocking this one.

[Edit] See #12788 [Later edit] Now merged.

coveralls · 2020-08-04T16:56:38Z

Coverage decreased (-0.02%) to 37.096% when pulling 2de4789 on pr/qmonnet/eks_nodeport_rulefix into 9cbc151 on master.

tklauser

Thanks a lot for the detailed descriptions and the verification steps in the PR description! Using these steps I was able to verify that this PR indeed fixes the pod-to-b-multi-node-nodeport connectivity check on EKS.

Some small nits inline regarding coding style and some typos. Other than that LGTM.

I'm not sure (or not knowledgeable enough) to judge on the first 3 discussion bullet points in the PR description. I'd be glad if the other reviewers could comment on these.

Regarding the fourth bullet point (Style) I think it's fine as is for now and could be refactored in a follow-up PR if needed. I don't think we have an existing helper for this (or at least I couldn't find one with my grep skills).

pkg/datapath/iptables/iptables.go

pkg/datapath/loader/base.go

christarazi

Thanks for the thorough explanations. Just a few minor points to address.

pkg/datapath/iptables/iptables.go

pkg/datapath/loader/base.go

joestringer

Looks reasonable, couple of comments below on your discussion points. Please also address the feedback from other reviewers.

pkg/datapath/iptables/iptables.go

pkg/datapath/loader/base.go

joestringer · 2020-08-05T19:04:07Z

pkg/datapath/iptables/iptables.go

@@ -1140,3 +1151,25 @@ func (m *IptablesManager) addCiliumNoTrackXfrmRules() error {
 	}
 	return nil
 }
+
+func (m *IptablesManager) addCiliumENIRules() error {


My next question is, what's our plan for a non-iptables version of this functionality? :-)

No plan at the moment. I'm mostly focusing on getting a fix so NodePort works as expected on EKS, I don't know if we want or need a non-iptables version. It's only two rules per interface used between the nodes, so shouldn't be adding to much overhead, but maybe finding an alternative is desirable... What is your opinion on the topic?

We're aiming to provide an entirely iptables-free implementation, so that's where my question comes from. I think it's very reasonable to have this PR to solve the functional question first, then we can address the non-iptables question later. This is mostly fishing for thoughts :-)

Just got bitten by this, thankfully after 2h of research I landed here 😿

I have multi interfaces nodes and wanted to be full cilium, replacing kube-proxy, kube-router and metallb.

Is there any tracking ticket to support it out of kube-proxy? should it be added as a warning on kube-proxy replacement doc page in the meantime?

@n0rad Please open a bug report. This discussion is about getting rid of iptables, but it seems you are looking at getting this to work with our kube-proxy replacement, which is a different thing.

pkg/datapath/iptables/iptables.go

pkg/datapath/loader/base.go

qmonnet · 2020-08-06T14:11:29Z

Thanks a lot Tobias, Chris and Joe for the verification and the reviews! It's been really helpful and much appreciated.

I addressed the comments from Tobias and Chris, there are some points that I still need to sort out regarding Joe's feedback (see inline answers).

Incremental diff

diff --git a/pkg/datapath/iptables/iptables.go b/pkg/datapath/iptables/iptables.go
index c59a76f5399a..b6f424ca646d 100644
--- a/pkg/datapath/iptables/iptables.go
+++ b/pkg/datapath/iptables/iptables.go
@@ -1037,10 +1037,10 @@ func (m *IptablesManager) InstallRules(ifName string) error {
 	}
 
 	// EKS expects the aws daemonset to set up specific rules for marking
-	// packets It marks packets coming from another node and restores the
+	// packets. It marks packets coming from another node and restores the
 	// mark on the return path to force a lookup into the main routing
 	// table. We want to reproduce something similar. Please see note in
-	// Reanitialize() in package "loader" for more details.
+	// Reanitialize() in pkg/datapath/loader for more details.
 	if option.Config.Masquerade && option.Config.IPAM == ipamOption.IPAMENI {
 		if err := m.addCiliumENIRules(); err != nil {
 			return fmt.Errorf("cannot install rules for ENI multi-node NodePort: %w", err)
@@ -1153,6 +1153,8 @@ func (m *IptablesManager) addCiliumNoTrackXfrmRules() error {
 }
 
 func (m *IptablesManager) addCiliumENIRules() error {
+	nfmask := fmt.Sprintf("%#08x", linux_defaults.MarkMultinodeNodeport)
+	ctmask := fmt.Sprintf("%#08x", linux_defaults.MaskMultinodeNodeport)
 	if err := runProg("iptables", append(
 		m.waitArgs,
 		"-t", "mangle",
@@ -1160,7 +1162,7 @@ func (m *IptablesManager) addCiliumENIRules() error {
 		"-i", option.Config.EgressMasqueradeInterfaces,
 		"-m", "comment", "--comment", "cilium: primary ENI",
 		"-m", "addrtype", "--dst-type", "LOCAL", "--limit-iface-in",
-		"-j", "CONNMARK", "--set-xmark", "0x80/0x80"),
+		"-j", "CONNMARK", "--set-xmark", nfmask+"/"+ctmask),
 		false); err != nil {
 		return err
 	}
@@ -1170,6 +1172,6 @@ func (m *IptablesManager) addCiliumENIRules() error {
 		"-A", ciliumPreMangleChain,
 		"-i", getDeliveryInterface(""),
 		"-m", "comment", "--comment", "cilium: primary ENI",
-		"-j", "CONNMARK", "--restore-mark", "--nfmask", "0x80", "--ctmask", "0x80"),
+		"-j", "CONNMARK", "--restore-mark", "--nfmask", nfmask, "--ctmask", ctmask),
 		false)
 }
diff --git a/pkg/datapath/linux/linux_defaults/linux_defaults.go b/pkg/datapath/linux/linux_defaults/linux_defaults.go
index 8d9dda86755b..33040aa7324c 100644
--- a/pkg/datapath/linux/linux_defaults/linux_defaults.go
+++ b/pkg/datapath/linux/linux_defaults/linux_defaults.go
@@ -34,6 +34,15 @@ const (
 	// RouteMarkMask is the mask required for the route mark value
 	RouteMarkMask = 0xF00
 
+	// MarkMultinodeNodeport is used on AWS EKS to mark traffic from another
+	// node, so that it gets routed back through the relevant interface.
+	MarkMultinodeNodeport = 0x80
+
+	// MaskMultinodeNodeport is the mask associated with the
+	// RouterMarkNodePort for NodePort traffic from remote nodes on AWS
+	// EKS.
+	MaskMultinodeNodeport = 0x80
+
 	// IPSecProtocolID IP protocol ID for IPSec defined in RFC4303
 	RouteProtocolIPSec = 50
 
@@ -46,6 +55,13 @@ const (
 	// of endpoints. This priority is after the local table priority.
 	RulePriorityEgress = 110
 
+	// RulePriorityNodeport is the priority of the rule used on AWS EKS, to
+	// make sure that lookups for multi-node NodePort traffic are NOT done
+	// from the table for the VPC to which the endpoint's CIDR is
+	// associated, but from the main routing table instead.
+	// This priority is before the egress priority.
+	RulePriorityNodeport = RulePriorityEgress - 1
+
 	// TunnelDeviceName the default name of the tunnel device when using vxlan
 	TunnelDeviceName = "cilium_vxlan"
 
diff --git a/pkg/datapath/loader/base.go b/pkg/datapath/loader/base.go
index 629cebf118ee..75ef8197fb53 100644
--- a/pkg/datapath/loader/base.go
+++ b/pkg/datapath/loader/base.go
@@ -137,8 +137,9 @@ func getEgressMasqueradeInterfaces() ([]string, error) {
 			ifaces = append(ifaces, l.Attrs().Name)
 		}
 	} else {
+		ifPrefix := strings.TrimSuffix(*egMasqIfOpt, "+")
 		for _, l := range links {
-			if strings.HasPrefix(l.Attrs().Name, (*egMasqIfOpt)[:len(*egMasqIfOpt)-1]) {
+			if strings.HasPrefix(l.Attrs().Name, ifPrefix) {
 				ifaces = append(ifaces, l.Attrs().Name)
 			}
 		}
@@ -358,9 +359,9 @@ func (l *Loader) Reinitialize(ctx context.Context, o datapath.BaseProgramOwner,
 				setting{"net.ipv4.conf." + iface + ".rp_filter", "2", false})
 		}
 		if err := route.ReplaceRule(route.Rule{
-			Priority: linux_defaults.RulePriorityEgress - 1,
-			Mark:     0x80,
-			Mask:     0x80,
+			Priority: linux_defaults.RulePriorityNodeport,
+			Mark:     linux_defaults.MarkMultinodeNodeport,
+			Mask:     linux_defaults.MaskMultinodeNodeport,
 			Table:    route.MainTable,
 		}); err != nil {
 			return fmt.Errorf("unable to install ip rule for ENI multi-node NodePort: %w", err)

tklauser

👍 for my requested changes

joestringer · 2020-08-26T22:40:13Z

From discussion on Slack, while this is important to fix and it would be great to include in a release shortly, it should not block other important fixes from being released. Removed the corresponding label.

ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (Cilium ENI mode iptables rule exists, AWS VPC CNI iptables rule exists (can delete it)) 2. Cilium is running with AWS VPC CNI with chaining mode (Cilium ENI mode iptables rule doesn’t exist, AWS VPC CNI iptables rule exists (cannot delete it)) 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (Cilium ENI mode iptables rule exists, AWS VPC iptables rule doesn’t exist) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>

ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>

ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion. Co-developed-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com> Suggested-by: Eric Mountain <eric.mountain@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>

ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion. Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>

Anton Protopopov says: ==================== This patch series adds policy routing support in bpf_fib_lookup. This is a useful functionality which was missing for a long time, as without it some networking setups can't be implemented in BPF. One example can be found here [1]. A while ago there was an attempt to add this functionality [2] by Rumen Telbizov and David Ahern. I've completely refactored the code, except that the changes to the struct bpf_fib_lookup were copy-pasted from the original patch. The first patch implements the functionality, the second patch adds a few selftests, the third patch adds a build time check of the size of the struct bpf_fib_lookup. [1] cilium/cilium#12770 [2] https://lore.kernel.org/all/20210629185537.78008-2-rumen.telbizov@menlosecurity.com/ v1 -> v2: - simplify the selftests (Martin) - add a static check for sizeof(struct bpf_fib_lookup) (David) ==================== Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion. Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>

qmonnet requested review from tklauser, tgraf and a team August 4, 2020 16:06

qmonnet marked this pull request as ready for review August 4, 2020 16:19

qmonnet requested a review from a team August 4, 2020 16:19

tklauser approved these changes Aug 5, 2020

View reviewed changes

pkg/datapath/iptables/iptables.go Outdated Show resolved Hide resolved

pkg/datapath/iptables/iptables.go Outdated Show resolved Hide resolved

pkg/datapath/loader/base.go Outdated Show resolved Hide resolved

qmonnet mentioned this pull request Aug 5, 2020

connectivity-check: re-introduce port-to-b NodePort checks #12788

Merged

christarazi reviewed Aug 5, 2020

View reviewed changes

pkg/datapath/iptables/iptables.go Outdated Show resolved Hide resolved

pkg/datapath/loader/base.go Outdated Show resolved Hide resolved

joestringer requested changes Aug 5, 2020

View reviewed changes

qmonnet force-pushed the pr/qmonnet/eks_nodeport_rulefix branch 4 times, most recently from 3aba9e4 to 2de4789 Compare August 6, 2020 14:02

tklauser approved these changes Aug 7, 2020

View reviewed changes

joestringer mentioned this pull request Aug 13, 2020

Remove iptables / netfilter dependency #12879

Open

3 tasks

joestringer added the priority/release-blocker label Aug 17, 2020

joestringer removed the priority/release-blocker label Aug 26, 2020

tgraf approved these changes Aug 28, 2020

View reviewed changes

YutaroHayakawa mentioned this pull request Sep 12, 2022

eni: Don't restore the connmark when Cilium is carrying the identity #21280

Closed

hemanthmalla mentioned this pull request Dec 21, 2022

eni: Don't restore the connmark when Cilium is carrying the identity DataDog/cilium#419

Merged

brb mentioned this pull request Nov 15, 2023

CI: Add N/S LB tests for cloud K8s flows #29188

Open

iptables, loader: add rules for multi-node NodePort traffic on EKS #12770

iptables, loader: add rules for multi-node NodePort traffic on EKS #12770

Uh oh!

Conversation

qmonnet commented Aug 4, 2020 • edited by joestringer Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Steps for verification

Discussion

Tests

Uh oh!

qmonnet commented Aug 4, 2020

Uh oh!

qmonnet commented Aug 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Aug 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tklauser left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

christarazi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

joestringer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

joestringer Aug 5, 2020

Choose a reason for hiding this comment

Uh oh!

qmonnet Aug 6, 2020

Choose a reason for hiding this comment

Uh oh!

joestringer Aug 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

n0rad Apr 16, 2023

Choose a reason for hiding this comment

Uh oh!

pchaigno Apr 16, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qmonnet commented Aug 6, 2020

Uh oh!

tklauser left a comment

Choose a reason for hiding this comment

Uh oh!

joestringer commented Aug 26, 2020

Uh oh!

Uh oh!

qmonnet commented Aug 4, 2020 •

edited by joestringer

Loading

qmonnet commented Aug 4, 2020 •

edited

Loading

coveralls commented Aug 4, 2020 •

edited

Loading

joestringer Aug 6, 2020 •

edited

Loading