-
Notifications
You must be signed in to change notification settings - Fork 3.4k
iptables, loader: add rules for multi-node NodePort traffic on EKS #12770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
test-me-please |
Sidenote: Discussing with @joestringer on Slack, it appears that the removal of [Edit] See #12788 [Later edit] Now merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the detailed descriptions and the verification steps in the PR description! Using these steps I was able to verify that this PR indeed fixes the pod-to-b-multi-node-nodeport
connectivity check on EKS.
Some small nits inline regarding coding style and some typos. Other than that LGTM.
I'm not sure (or not knowledgeable enough) to judge on the first 3 discussion bullet points in the PR description. I'd be glad if the other reviewers could comment on these.
Regarding the fourth bullet point (Style) I think it's fine as is for now and could be refactored in a follow-up PR if needed. I don't think we have an existing helper for this (or at least I couldn't find one with my grep
skills).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thorough explanations. Just a few minor points to address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable, couple of comments below on your discussion points. Please also address the feedback from other reviewers.
@@ -1140,3 +1151,25 @@ func (m *IptablesManager) addCiliumNoTrackXfrmRules() error { | |||
} | |||
return nil | |||
} | |||
|
|||
func (m *IptablesManager) addCiliumENIRules() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My next question is, what's our plan for a non-iptables version of this functionality? :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No plan at the moment. I'm mostly focusing on getting a fix so NodePort works as expected on EKS, I don't know if we want or need a non-iptables version. It's only two rules per interface used between the nodes, so shouldn't be adding to much overhead, but maybe finding an alternative is desirable... What is your opinion on the topic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're aiming to provide an entirely iptables-free implementation, so that's where my question comes from. I think it's very reasonable to have this PR to solve the functional question first, then we can address the non-iptables question later. This is mostly fishing for thoughts :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just got bitten by this, thankfully after 2h of research I landed here 😿
I have multi interfaces nodes and wanted to be full cilium, replacing kube-proxy, kube-router and metallb.
Is there any tracking ticket to support it out of kube-proxy? should it be added as a warning on kube-proxy replacement doc page in the meantime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@n0rad Please open a bug report. This discussion is about getting rid of iptables, but it seems you are looking at getting this to work with our kube-proxy replacement, which is a different thing.
3aba9e4
to
2de4789
Compare
Thanks a lot Tobias, Chris and Joe for the verification and the reviews! It's been really helpful and much appreciated. I addressed the comments from Tobias and Chris, there are some points that I still need to sort out regarding Joe's feedback (see inline answers). Incremental diffdiff --git a/pkg/datapath/iptables/iptables.go b/pkg/datapath/iptables/iptables.go
index c59a76f5399a..b6f424ca646d 100644
--- a/pkg/datapath/iptables/iptables.go
+++ b/pkg/datapath/iptables/iptables.go
@@ -1037,10 +1037,10 @@ func (m *IptablesManager) InstallRules(ifName string) error {
}
// EKS expects the aws daemonset to set up specific rules for marking
- // packets It marks packets coming from another node and restores the
+ // packets. It marks packets coming from another node and restores the
// mark on the return path to force a lookup into the main routing
// table. We want to reproduce something similar. Please see note in
- // Reanitialize() in package "loader" for more details.
+ // Reanitialize() in pkg/datapath/loader for more details.
if option.Config.Masquerade && option.Config.IPAM == ipamOption.IPAMENI {
if err := m.addCiliumENIRules(); err != nil {
return fmt.Errorf("cannot install rules for ENI multi-node NodePort: %w", err)
@@ -1153,6 +1153,8 @@ func (m *IptablesManager) addCiliumNoTrackXfrmRules() error {
}
func (m *IptablesManager) addCiliumENIRules() error {
+ nfmask := fmt.Sprintf("%#08x", linux_defaults.MarkMultinodeNodeport)
+ ctmask := fmt.Sprintf("%#08x", linux_defaults.MaskMultinodeNodeport)
if err := runProg("iptables", append(
m.waitArgs,
"-t", "mangle",
@@ -1160,7 +1162,7 @@ func (m *IptablesManager) addCiliumENIRules() error {
"-i", option.Config.EgressMasqueradeInterfaces,
"-m", "comment", "--comment", "cilium: primary ENI",
"-m", "addrtype", "--dst-type", "LOCAL", "--limit-iface-in",
- "-j", "CONNMARK", "--set-xmark", "0x80/0x80"),
+ "-j", "CONNMARK", "--set-xmark", nfmask+"/"+ctmask),
false); err != nil {
return err
}
@@ -1170,6 +1172,6 @@ func (m *IptablesManager) addCiliumENIRules() error {
"-A", ciliumPreMangleChain,
"-i", getDeliveryInterface(""),
"-m", "comment", "--comment", "cilium: primary ENI",
- "-j", "CONNMARK", "--restore-mark", "--nfmask", "0x80", "--ctmask", "0x80"),
+ "-j", "CONNMARK", "--restore-mark", "--nfmask", nfmask, "--ctmask", ctmask),
false)
}
diff --git a/pkg/datapath/linux/linux_defaults/linux_defaults.go b/pkg/datapath/linux/linux_defaults/linux_defaults.go
index 8d9dda86755b..33040aa7324c 100644
--- a/pkg/datapath/linux/linux_defaults/linux_defaults.go
+++ b/pkg/datapath/linux/linux_defaults/linux_defaults.go
@@ -34,6 +34,15 @@ const (
// RouteMarkMask is the mask required for the route mark value
RouteMarkMask = 0xF00
+ // MarkMultinodeNodeport is used on AWS EKS to mark traffic from another
+ // node, so that it gets routed back through the relevant interface.
+ MarkMultinodeNodeport = 0x80
+
+ // MaskMultinodeNodeport is the mask associated with the
+ // RouterMarkNodePort for NodePort traffic from remote nodes on AWS
+ // EKS.
+ MaskMultinodeNodeport = 0x80
+
// IPSecProtocolID IP protocol ID for IPSec defined in RFC4303
RouteProtocolIPSec = 50
@@ -46,6 +55,13 @@ const (
// of endpoints. This priority is after the local table priority.
RulePriorityEgress = 110
+ // RulePriorityNodeport is the priority of the rule used on AWS EKS, to
+ // make sure that lookups for multi-node NodePort traffic are NOT done
+ // from the table for the VPC to which the endpoint's CIDR is
+ // associated, but from the main routing table instead.
+ // This priority is before the egress priority.
+ RulePriorityNodeport = RulePriorityEgress - 1
+
// TunnelDeviceName the default name of the tunnel device when using vxlan
TunnelDeviceName = "cilium_vxlan"
diff --git a/pkg/datapath/loader/base.go b/pkg/datapath/loader/base.go
index 629cebf118ee..75ef8197fb53 100644
--- a/pkg/datapath/loader/base.go
+++ b/pkg/datapath/loader/base.go
@@ -137,8 +137,9 @@ func getEgressMasqueradeInterfaces() ([]string, error) {
ifaces = append(ifaces, l.Attrs().Name)
}
} else {
+ ifPrefix := strings.TrimSuffix(*egMasqIfOpt, "+")
for _, l := range links {
- if strings.HasPrefix(l.Attrs().Name, (*egMasqIfOpt)[:len(*egMasqIfOpt)-1]) {
+ if strings.HasPrefix(l.Attrs().Name, ifPrefix) {
ifaces = append(ifaces, l.Attrs().Name)
}
}
@@ -358,9 +359,9 @@ func (l *Loader) Reinitialize(ctx context.Context, o datapath.BaseProgramOwner,
setting{"net.ipv4.conf." + iface + ".rp_filter", "2", false})
}
if err := route.ReplaceRule(route.Rule{
- Priority: linux_defaults.RulePriorityEgress - 1,
- Mark: 0x80,
- Mask: 0x80,
+ Priority: linux_defaults.RulePriorityNodeport,
+ Mark: linux_defaults.MarkMultinodeNodeport,
+ Mask: linux_defaults.MaskMultinodeNodeport,
Table: route.MainTable,
}); err != nil {
return fmt.Errorf("unable to install ip rule for ENI multi-node NodePort: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for my requested changes
From discussion on Slack, while this is important to fix and it would be great to include in a release shortly, it should not block other important fixes from being released. Removed the corresponding label. |
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (Cilium ENI mode iptables rule exists, AWS VPC CNI iptables rule exists (can delete it)) 2. Cilium is running with AWS VPC CNI with chaining mode (Cilium ENI mode iptables rule doesn’t exist, AWS VPC CNI iptables rule exists (cannot delete it)) 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (Cilium ENI mode iptables rule exists, AWS VPC iptables rule doesn’t exist) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion. Co-developed-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com> Suggested-by: Eric Mountain <eric.mountain@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion. Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion. Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion. Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion. Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion. Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion. Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
Anton Protopopov says: ==================== This patch series adds policy routing support in bpf_fib_lookup. This is a useful functionality which was missing for a long time, as without it some networking setups can't be implemented in BPF. One example can be found here [1]. A while ago there was an attempt to add this functionality [2] by Rumen Telbizov and David Ahern. I've completely refactored the code, except that the changes to the struct bpf_fib_lookup were copy-pasted from the original patch. The first patch implements the functionality, the second patch adds a few selftests, the third patch adds a build time check of the size of the struct bpf_fib_lookup. [1] cilium/cilium#12770 [2] https://lore.kernel.org/all/20210629185537.78008-2-rumen.telbizov@menlosecurity.com/ v1 -> v2: - simplify the selftests (Martin) - add a static check for sizeof(struct bpf_fib_lookup) (David) ==================== Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion. Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion. Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion. Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion. Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion. Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion. Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion. Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion. Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
Description
Multi-node NodePort traffic for EKS needs a set of specific rules that are usually set by the aws daemonset:
These rules mark packets coming from another node through eth0, and restore the mark on the return path to force a lookup into the main routing table. Without them, the
ip rules
set by the cilium-cni plugin tell the host to lookup into the table related to the VPC for which the CIDR used by the endpoint has been configured.We want to reproduce equivalent rules to ensure correct routing, or multi-node NodePort traffic will not be routed correctly. This could be observed with the
pod-to-b-multi-node-nodeport
pod from connectivity check never getting ready.This commit makes the loader and iptables module create the relevant rules when IPAM is ENI and egress masquerading is in use. The rules are nearly identical to those from the aws daemonset (different comments, different interface prefix for conntrack return path, explicit preference for ip rule):
Steps for verification
Apply the patch, build and push the Docker image:
Follow the instructions to install Cilium on AWS EKS. When deploying Cilium with Helm, do not forget to pass the location of the custom image for the agent:
Deploy the connectivity check. Pick the version from Cilium v1.8 (latest version does not have
pod-to-b-multi-node-nodeport
):Wait a few seconds. On a cilium pod, the newly-added rules should be visible:
All pods from the connectivity-check should be marked as ready, in particular the
pod-to-b-multi-node-nodeport
:Discussion
I am not entirely sure of my implementation. In particular, reviewers might want to consider:
ip rule
(linux_defaults.RulePriorityEgress - 1
) so it gets looked up just before the rule we want to avoid, but maybe I should create a new constant inlinux_defaults
for that? Not sure if this is worth.ip rule
addition feels a bit cumbersome (with this+
prefix to process). I'm not sure we have a helper for this somewhere, I could not find one (otherlxc+
-style names are passed directly to iptables from what I could see). We do have a helper somewhere for settingrp_filter
but it is not used inReinitialize()
from theloader
, so I just stuck to local code and appended tosysSettings
.Tests
None in this PR. Ready-state can be tested with connectivity check from Cilium v1.8 (I didn't realise it was gone until writing this description).
Follow-up: Re-introduce
pod-to-b-multi-node-nodeport
or equivalent in connectivity checks.Fixes: #12098