Skip to content

Conversation

dylandreimerink
Copy link
Member

SystemD versions greater than 245 will create sysctl config which sets
the rp_filter value for all network interfaces to 1. This conflicts
with cilium which requires rp_filter to be 0 on interfaces it uses.

This commit adds a small utility/tool: sysctlfix which will insert
a config file into the /etc/sysctl.d dir with the highest priority
containing directives to disable rp_filter and perhaps to contain
other sysctl config in future.

This utility is called as an init container before the cilium agent
starts. Because the sysctl config is in place before the agent starts,
all interfaces created by the agent and matching the patten in the
config file will have rp_filter disabled, even when SystemD >=245 is
installed.

Fixes: #10645
Fixes: #19909

Fixed SystemD >=245 sysctl(`rp_filter`) config incompatibility

SystemD versions greater than 245 will create sysctl config which sets
the `rp_filter` value for all network interfaces to 1. This conflicts
with cilium which requires `rp_filter` to be 0 on interfaces it uses.

This commit adds a small utility/tool: `sysctlfix` which will insert
a config file into the `/etc/sysctl.d` dir with the highest priority
containing directives to disable `rp_filter` and perhaps to contain
other sysctl config in future.

This utility is called as an init container before the cilium agent
starts. Because the sysctl config is in place before the agent starts,
all interfaces created by the agent and matching the patten in the
config file will have `rp_filter` disabled, even when SystemD >=245 is
installed.

Fixes: cilium#10645
Fixes: cilium#19909
Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 3, 2022
@dylandreimerink dylandreimerink added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/daemon Impacts operation of the Cilium daemon. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Jun 3, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 3, 2022
@dylandreimerink dylandreimerink marked this pull request as ready for review June 3, 2022 10:46
@dylandreimerink dylandreimerink requested a review from a team as a code owner June 3, 2022 10:46
@dylandreimerink dylandreimerink requested a review from a team June 3, 2022 10:46
@dylandreimerink dylandreimerink requested review from a team as code owners June 3, 2022 10:46
Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 💯

@sayboras
Copy link
Member

sayboras commented Jun 3, 2022

Below is from my local minikube, do we need to do something with net.ipv4.conf.all|default.rp_filter=1 ?

docker@minikube:/etc/sysctl.d$ grep -Ris "rp_filter" * 
10-network-security.conf:net.ipv4.conf.default.rp_filter=1
10-network-security.conf:net.ipv4.conf.all.rp_filter=1
99-sysctl.conf:#net.ipv4.conf.default.rp_filter=1
99-sysctl.conf:#net.ipv4.conf.all.rp_filter=1
99-zzz-override_cilium.conf:# Disable rp_filter on Cilium interfaces since it may cause mangled packets to be dropped
99-zzz-override_cilium.conf:net.ipv4.conf.lxc*.rp_filter = 0
99-zzz-override_cilium.conf:net.ipv4.conf.cilium_*.rp_filter = 0

@rastislavs
Copy link
Contributor

Below is from my local minikube, do we need to do something with net.ipv4.conf.all|default.rp_filter=1 ?

yeah, in our case from #19761, the proposed change is not enough to resolve the issue and net.ipv4.conf.all.rp_filter would solve it.

@dylandreimerink
Copy link
Member Author

@sayboras @rastislavs is the goal of adding net.ipv4.conf.all|default.rp_filter=1|0 to change this setting for all interfaces? Asking because I am looking into it atm. Setting it seems to only actually change net.ipv4.conf.all.rp_filter for me

$ sysctl -a | grep rp_filter
net.ipv4.conf.all.arp_filter = 0
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.cilium_host.arp_filter = 0
net.ipv4.conf.cilium_host.rp_filter = 2
net.ipv4.conf.cilium_net.arp_filter = 0
net.ipv4.conf.cilium_net.rp_filter = 2
net.ipv4.conf.cilium_vxlan.arp_filter = 0
net.ipv4.conf.cilium_vxlan.rp_filter = 2
net.ipv4.conf.default.arp_filter = 0
net.ipv4.conf.default.rp_filter = 2
net.ipv4.conf.docker0.arp_filter = 0
net.ipv4.conf.docker0.rp_filter = 2
net.ipv4.conf.enp0s3.arp_filter = 0
net.ipv4.conf.enp0s3.rp_filter = 2
net.ipv4.conf.enp0s8.arp_filter = 0
net.ipv4.conf.enp0s8.rp_filter = 2
net.ipv4.conf.lo.arp_filter = 0
net.ipv4.conf.lo.rp_filter = 2
net.ipv4.conf.lxc_health.arp_filter = 0
net.ipv4.conf.lxc_health.rp_filter = 2
net.ipv4.conf.veth5086343.arp_filter = 0
net.ipv4.conf.veth5086343.rp_filter = 2
net.ipv4.conf.veth8e6cce6.arp_filter = 0
net.ipv4.conf.veth8e6cce6.rp_filter = 2
net.ipv4.conf.vetha39077f.arp_filter = 0
net.ipv4.conf.vetha39077f.rp_filter = 2

I have to add net.ipv4.conf.*.rp_filter to get it to change all interfaces.

$ sudo systemctl restart systemd-sysctl.service`
$ sysctl -a | grep rp_filter
net.ipv4.conf.all.arp_filter = 0
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.cilium_host.arp_filter = 0
net.ipv4.conf.cilium_host.rp_filter = 0
net.ipv4.conf.cilium_net.arp_filter = 0
net.ipv4.conf.cilium_net.rp_filter = 0
net.ipv4.conf.cilium_vxlan.arp_filter = 0
net.ipv4.conf.cilium_vxlan.rp_filter = 0
net.ipv4.conf.default.arp_filter = 0
net.ipv4.conf.default.rp_filter = 2
net.ipv4.conf.docker0.arp_filter = 0
net.ipv4.conf.docker0.rp_filter = 0
net.ipv4.conf.enp0s3.arp_filter = 0
net.ipv4.conf.enp0s3.rp_filter = 0
net.ipv4.conf.enp0s8.arp_filter = 0
net.ipv4.conf.enp0s8.rp_filter = 0
net.ipv4.conf.lo.arp_filter = 0
net.ipv4.conf.lo.rp_filter = 0
net.ipv4.conf.lxc_health.arp_filter = 0
net.ipv4.conf.lxc_health.rp_filter = 0
net.ipv4.conf.veth5086343.arp_filter = 0
net.ipv4.conf.veth5086343.rp_filter = 0
net.ipv4.conf.veth8e6cce6.arp_filter = 0
net.ipv4.conf.veth8e6cce6.rp_filter = 0
net.ipv4.conf.vetha39077f.arp_filter = 0
net.ipv4.conf.vetha39077f.rp_filter = 0

@ungureanuvladvictor
Copy link
Member

@dylandreimerink -- I was reading this and was curious if this can be done in the CNI plugin itself (something like https://github.com/cilium/cilium/blob/master/plugins/cilium-cni/cilium-cni.go#L516-L527). I am not familiar with the problem itself, but wanted to offer a possible alternative way to achieving the goal (edit the sysctl and reload inside the netns).

@dylandreimerink
Copy link
Member Author

@ungureanuvladvictor that is something to look into. I just occurred to me that this might also affect Cilium in 'standalone LB' mode (@brb, @borkmann ?), setting this up via an init container might not be the way to go. Perhaps we should move this into the agent/daemon itself.

@joestringer
Copy link
Member

CNI plugin is an interesting option, the awkward thing there is that the CNI binary is run semi-frequently and each time is an independent execution of the binary. So it'd just need to be idempotent. The nice thing about running in CNI plugin context is that it's running as an application directly on the host with privileges, so we shouldn't need to worry about the security context at all.

Thinking about the option to move it into the agent, this becomes more tricky because then we're further expanding the set of things that the cilium-agent container can access at runtime, and we're trying to avoid granting blanket access to lots of the system at all times from the cilium-agent. Cilium needs to be pretty powerful already, but do we really need access to reconfigure random /etc/sysctl.d settings in addition to all the other stuff? At least if it's an init container, this access is relatively short-lived.

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀


var sysctlConfig = strings.Join([]string{
"# Disable rp_filter on Cilium interfaces since it may cause mangled packets to be dropped",
"net.ipv4.conf.lxc*.rp_filter = 0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I overlooked that we are completely disabling the filter. What about setting it to the Loose mode? That's all we should need to pass the packets with asymmetric routing. Details - #13130. That's the value we are currently setting as well.

Copy link
Member Author

@dylandreimerink dylandreimerink Jun 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set it to disabled since that seemed to be the suggested fix in #10645. According too @aanm, this commit in SystemD is the source of the issue. In the commit the rp_filter is changed from 1(Strict) to 2(Loose). In #10645 this commit is mentioned as the culprit (also a change to 2(Loose))

The description of 'Loose' mode is:

Loose mode as defined in RFC3704 Loose Reverse Path
Each incoming packet's source address is also tested against the FIB
and if the source address is not reachable via any interface
the packet check will fail.

I am not sure if there are common use cases in Cilium where we handle packets for source addresses which don't appear in the FIB, but it wouldn't surprise me if there are.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cilium isn't the only entity that could be running on a host. The issues you've linked above indicate that systemd overwrites rp_filter settings set by Cilium. Regardless, I would like for us to understand why we would need to disable source address checks altogether in case of Cilium.

@dylandreimerink
Copy link
Member Author

Thinking about the option to move it into the agent, this becomes more tricky because then we're further expanding the set of things that the cilium-agent container can access at runtime

That is a good point, hadn't considered that. Running the tool should be a one time thing, so we can make it part of the installation process/guide for the standalone LB if it ever becomes an issue.

So it'd just need to be idempotent.

Updating the config file frequently doesn't do any harm. But we should only trigger a reload of the systemd unit when it has actually changed, will take this into account when updating the tool, so we can move this logic to the CNI plugin if we wish.

@ungureanuvladvictor
Copy link
Member

@dylandreimerink -- another thing that crossed my mind right now is that I don't think this approach is going to work when deploying Cilium into OpenShift. Over there there files on disk (esp config files for various systems [sshd, systemd, etc]) are configured via this and when updates are done outside this mechanism the node itself will get into a degraded state.

Maybe this is not worth changing the approach for now but would be good to give it a try install into OpenShift and then run an OpenShift upgrade to see if the system will be happy.

@qmonnet qmonnet mentioned this pull request Jun 30, 2022
12 tasks
@qmonnet qmonnet added backport-pending/1.11 backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed needs-backport/1.11 labels Jun 30, 2022
sayboras added a commit to sayboras/cilium that referenced this pull request Jul 24, 2022
The newly added init container (e.g. sysctl init) requires hostproc
volume mount, however, this volume is only mounted based on the helm
flag .Values.cgroup.autoMount.enabled. This commit is to make sure
that such condition is added to avoid any failure.

Relates: cilium#20072
Fixes: cilium#20626
Signed-off-by: Tam Mach <tam.mach@cilium.io>
gandro pushed a commit that referenced this pull request Jul 26, 2022
The newly added init container (e.g. sysctl init) requires hostproc
volume mount, however, this volume is only mounted based on the helm
flag .Values.cgroup.autoMount.enabled. This commit is to make sure
that such condition is added to avoid any failure.

Relates: #20072
Fixes: #20626
Signed-off-by: Tam Mach <tam.mach@cilium.io>
nbusseneau pushed a commit to nbusseneau/cilium that referenced this pull request Aug 9, 2022
[ upstream commit 6d920b2 ]

[ backporter's notes: original file
 `install/kubernetes/cilium/templates/cilium-agent/daemonset.yaml` does
 not exist in v1.10 due to the Helm refactor from
 8c49b83, changes were applied to
 `install/kubernetes/cilium/templates/cilium-agent-daemonset.yaml`
 instead. ]

The newly added init container (e.g. sysctl init) requires hostproc
volume mount, however, this volume is only mounted based on the helm
flag .Values.cgroup.autoMount.enabled. This commit is to make sure
that such condition is added to avoid any failure.

Relates: cilium#20072
Fixes: cilium#20626
Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
dezmodue pushed a commit to dezmodue/cilium that referenced this pull request Aug 10, 2022
The newly added init container (e.g. sysctl init) requires hostproc
volume mount, however, this volume is only mounted based on the helm
flag .Values.cgroup.autoMount.enabled. This commit is to make sure
that such condition is added to avoid any failure.

Relates: cilium#20072
Fixes: cilium#20626
Signed-off-by: Tam Mach <tam.mach@cilium.io>
nbusseneau pushed a commit to nbusseneau/cilium that referenced this pull request Aug 10, 2022
[ upstream commit 6d920b2 ]

[ backporter's notes: conflicts due to privileges rework not having
  been backported to `v1.11`. ]

The newly added init container (e.g. sysctl init) requires hostproc
volume mount, however, this volume is only mounted based on the helm
flag .Values.cgroup.autoMount.enabled. This commit is to make sure
that such condition is added to avoid any failure.

Relates: cilium#20072
Fixes: cilium#20626
Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
nbusseneau pushed a commit to nbusseneau/cilium that referenced this pull request Aug 10, 2022
[ upstream commit 6d920b2 ]

The newly added init container (e.g. sysctl init) requires hostproc
volume mount, however, this volume is only mounted based on the helm
flag .Values.cgroup.autoMount.enabled. This commit is to make sure
that such condition is added to avoid any failure.

Relates: cilium#20072
Fixes: cilium#20626
Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
tklauser pushed a commit that referenced this pull request Aug 11, 2022
[ upstream commit 6d920b2 ]

[ backporter's notes: conflicts due to privileges rework not having
  been backported to `v1.11`. ]

The newly added init container (e.g. sysctl init) requires hostproc
volume mount, however, this volume is only mounted based on the helm
flag .Values.cgroup.autoMount.enabled. This commit is to make sure
that such condition is added to avoid any failure.

Relates: #20072
Fixes: #20626
Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
tklauser pushed a commit that referenced this pull request Aug 11, 2022
[ upstream commit 6d920b2 ]

[ backporter's notes: original file
 `install/kubernetes/cilium/templates/cilium-agent/daemonset.yaml` does
 not exist in v1.10 due to the Helm refactor from
 8c49b83, changes were applied to
 `install/kubernetes/cilium/templates/cilium-agent-daemonset.yaml`
 instead. ]

The newly added init container (e.g. sysctl init) requires hostproc
volume mount, however, this volume is only mounted based on the helm
flag .Values.cgroup.autoMount.enabled. This commit is to make sure
that such condition is added to avoid any failure.

Relates: #20072
Fixes: #20626
Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
tklauser pushed a commit that referenced this pull request Aug 11, 2022
[ upstream commit 6d920b2 ]

The newly added init container (e.g. sysctl init) requires hostproc
volume mount, however, this volume is only mounted based on the helm
flag .Values.cgroup.autoMount.enabled. This commit is to make sure
that such condition is added to avoid any failure.

Relates: #20072
Fixes: #20626
Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet