Use ipcache for tunneling #37146

pippolo84 · 2025-01-21T18:21:12Z

-- WORK IN PROGRESS, NOT READY FOR REVIEW --

pippolo84 · 2025-01-21T18:26:48Z

/test

Store allocation CIDRs from each remote node into the ipcache, using the world identity. This sets the ground for using the ipcache for tunneling instead of relying on the additional tunnel map. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>

The ipcache map now stores entries describing remote node pod CIDRs, including the tunnel endpoint and encryption key. Therefore, when a packet must be encapsulated or encrypted we can leverage the information from the lookup into the ipcache, avoiding the additional one into the tunnel map. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>

Given that the datapath now relies on the ipcache map for tunneling, it is possible to remove the insertions and deletions of allocations CIDRs into the deprecated tunnel map. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>

When handling the deletion of a node, we should take into account all the node allocations CIDRs, not just the primary one. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>

As a consequence of using the ipcache map to support tunneling, multi-pool IPAM is now compatible with it, thus lift the startup checks that stops the agent when both the features are enabled. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>

pippolo84 · 2025-01-29T16:45:05Z

/test

christarazi · 2025-01-29T20:27:38Z

pkg/node/manager/manager.go

@@ -734,6 +734,49 @@ func (m *manager) NodeUpdated(n nodeTypes.Node) {
 	m.ipsetMgr.AddToIPSet(ipset.CiliumNodeIPSetV4, ipset.INetFamily, v4Addrs...)
 	m.ipsetMgr.AddToIPSet(ipset.CiliumNodeIPSetV6, ipset.INet6Family, v6Addrs...)

+	if !n.IsLocal() {


Why is it gated on whether the node is remote to insert pod CIDRs into the ipcache? We don't need the pod CIDRs in the ipcache for the local node?

Ah I missed that it was answered here already: #37026 (comment)

In that case, could we add a comment to the code to clarify why this is being done?

gandro

Did a quick review to familiarize myself with the code

gandro · 2025-02-27T13:29:35Z

pkg/node/manager/manager.go

+			}
+
+			metadata := []ipcache.IPMetadata{
+				ipcacheTypes.RequestedIdentity(identity.ReservedIdentityWorldIPv4),


I believe this needs to use GetWorldIdentityFromIP for non-dual-stack setups

Well, as it turns out (see below), this is actually completely ignored, which is probably why CI was passing

gandro · 2025-02-27T13:34:53Z

pkg/node/manager/manager.go

+			}
+
+			metadata := []ipcache.IPMetadata{
+				ipcacheTypes.RequestedIdentity(identity.ReservedIdentityWorldIPv4),


I don't think RequestedIdentity is correct here. It's only used to suggest a numeric identity when using identity allocator to allocate a local identity - but that's not what should happen here, since we don't want to allocate a local identity, but use a reserved one instead. We likely want to use OverrideIdentity here instead, i.e. force IPCache to use the reserved:world identity.

Not sure how this is working with RequestedIdentity. If we are allocating a CIDR identity for the pod CIDR (which I think is what is happening here), that might work too - but I'm not sure if this is what we want.

So it turns out that PR still works, but probably only by accident. As I mentioned, RequestedIdentity is completely ignored. They way the pod CIDR still gets the correct identity is thanks to mergeParentLabels. It make the pod CIDR inherit the reserved:world label (which doesn't have a CIDR label), and thus AllocateLocalIdentity will return the reserved world identity rather than attempting to allocate a new one

cilium/pkg/ipcache/metadata.go

Line 688 in 5710ac4

if lbls.HasHostLabel() {

cilium/pkg/identity/cache/allocator.go

Line 415 in 71cebc7

}).Debug("Resolving reserved identity")

I'll make this a bit more explicit by using identity overwrite instead. However, I also want to test the interaction with the node-cidr feature, which I think could conflict.

On closer inspection, we probably don't want to force an identity override, and instead just make sure IPCache uses the reserved world identity if no other sources add additional labels. But if there are other label sources, we likely want to merge the labels used by other sources.

It is possible to write a ToCIDR policy with a PodCIDR - for example, one could legitimately write a policy that allows traffic to endpoints (i.e. matching on the podIP/32 entries), but disallows traffic to the pod CIDR (i.e. matching on podCIDR/24 entries), which would in practice deny traffic to unknown pod IPs (which might be desirable to protect against control plane lag issues where CEP propagation takes a long time). This is a use-case we probably want to preserve.

It is possible to write a ToCIDR policy with a PodCIDR - for example, one could legitimately write a policy that allows traffic to endpoints (i.e. matching on the podIP/32 entries), but disallows traffic to the pod CIDR (i.e. matching on podCIDR/24 entries), which would in practice deny traffic to unknown pod IPs (which might be desirable to protect against control plane lag issues where CEP propagation takes a long time). This is a use-case we probably want to preserve.

As network-policy newbie, I have trouble imagining what this could look like 🙂. Wouldn't this require knowing which podIPs are assigned in the cluster (== need a /32 "allow" entry), and constantly updating the Network Policy to reflect this IP assignment?

Wouldn't this require knowing which podIPs are assigned in the cluster (== need a /32 "allow" entry), and constantly updating the Network Policy to reflect this IP assignment?

You wouldn't have to watch the pod IPs, just the node's pod CIDRs. The $podIP/32 entry would be inserted by the CEP watcher, no user interaction required. But I think you do raise a good point that the outlined scenario is a bit unrealistic - the user would specifically have to specify the pod CIDR of a specific node - rather than adding a ToCIDR rule that covers the pod CIDRs of all nodes.

Taking a step back, what we're doing with this PR is adding a $podCIDR -> [reserved:world] mapping to IPCache.

The question to ask is: What else can perform IPCache upserts for $podCIDR, and do we deal with those cases correctly. Here's what I came up with:

A CNP contains a ToCIDR: $podCIDR entry. There is not really a good use case for it, but it is expressible today. Here what happens is the policy ingestor will do an IPCache metadata upsert for $podCIDR -> [reserved:world,cidr:$podCIDR]. This will cause the allocation of a local CIDR identity, but the tunnel endpoint is preserved and new identity is compatible with the old one (both are reserved:world)

A CiliumCIDRGroup contains a $podCIDR. This will cause an $podCIDR -> [reserved:world,cidrgroup:$name] metadata upsert. As above, tunnel endpoint is preserved and the identity is compatible.

With multi-pool IPAM, there is a special case to consider, because multi-pool now allows the $podCIDR to be of size /32. Therefore, there can be additional cases where IPCache entries are upserted:

The /32 pod "CIDR" is used as a pod IP. This means that the $podIP/32 -> [reserved:world] entry introduced by this PR is replaced by a $podIP -> [k8s:foo,k8s:bar] entry (upserted by the CEP watcher). In contrast to above, we don't actually merge any information here. Instead, the CEP entry shadows the node manager entry. But since the CEP entry contains they encrypt and tunnel key, no loss of information happens here. When the CEP goes away, the old entry is restored.

So all cases seem to behave correctly. Assuming we don't introduce any logic in the datapath which assumes that these $podCIDR entries have a particular identity, we should be fine.

In any case, I'll make sure to introduce some unit test to assert the behavior of these three cases going forward.

Assuming we don't introduce any logic in the datapath which assumes that these $podCIDR entries have a particular identity, we should be fine.

So far the thinking was "start with WORLD_ID, march on, and during v1.18 development window see whether we find any users in the datapath that want something $better".

This could be either

actual Network Policy reasons, with new identities for (node-specific?) PodCIDRs

feature-specific reasons ("I matched a PodCIDR, let me do something unique")

So far the thinking was "start with WORLD_ID, march on, and during v1.18 development window see whether we find any users in the datapath that want something $better".

Yeah.. in general, I guess the main problem is that a pod CIDR can not only be a pod CIDR belonging to a remote node, but also a CIDR either selected by ToCIDR or a CiliumCIDRGroup.

Case 1 ("new identities for PodCIDRs that can be matched in CNPs") will still just work I think. Similar to e.g. reserved:kube-apiserver, which can also either be the reserved identity 7 or a regular CIDR identity. No special treatment is needed here in the datapath. This concept would purely be implemented in the user-space policy engine, which already handles the case merging of labels just fine.

Case 2 ("I matched a PodCIDR, let me do something unique") however is harder. It would require us to use a special numeric range for such identities, similar to how we deal with the "I matched a remote node" case if an identity is both a remote node and a CIDR identity (c.f. identity_is_remote_node). The downside here is that there are only so many bits in the numeric identity space, so we can pull this stunt only a limited amount of time.

Case 2 ("I matched a PodCIDR, let me do something unique") however is harder. It would require us to use a special numeric range for such identities, similar to how we deal with the "I matched a remote node" case if an identity is both a remote node and a CIDR identity (c.f. identity_is_remote_node). The downside here is that there are only so many bits in the numeric identity space, so we can pull this stunt only a limited amount of time.

ack, I'm not sure it's the best fit either. I could easily see a flag in the ipcache entry ("this is a CIDR entry") being good enough (tm) for a while.

gandro · 2025-03-05T12:29:34Z

pkg/node/manager/manager.go

+				metadata = append(metadata, ipcacheTypes.EncryptKey(oldNode.EncryptionKey))
+			}
+
+			m.ipcache.RemoveMetadata(prefix, resource, metadata...)


Found another bug - this unconditionally removes pod CIDRs when ever a node is updated. We need to diff here like we do for other IPCache updates we do here (c.f. ingressIPsAdded)

pippolo84 · 2025-03-25T12:50:57Z

Superseded by #38483

pippolo84 added the dont-merge/preview-only Only for preview or testing, don't merge it. label Jan 21, 2025

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 21, 2025

This comment was marked as outdated.

Sign in to view

pippolo84 force-pushed the pr/pippolo84/use-ipcache-map-for-tunneling branch from 78f57ff to 5b495e3 Compare January 21, 2025 18:26

pippolo84 force-pushed the pr/pippolo84/use-ipcache-map-for-tunneling branch from 5b495e3 to 8a98f2c Compare January 29, 2025 16:11

pippolo84 added 6 commits January 29, 2025 17:21

node: Remove insertions and deletions to tunnel map

cf18adb

Given that the datapath now relies on the ipcache map for tunneling, it is possible to remove the insertions and deletions of allocations CIDRs into the deprecated tunnel map. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>

node: Consider all node CIDRs when deleting a node

915c3b5

When handling the deletion of a node, we should take into account all the node allocations CIDRs, not just the primary one. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>

node: Temporarily remove node tests relying on tunnel map

a3474a2

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>

pippolo84 force-pushed the pr/pippolo84/use-ipcache-map-for-tunneling branch from 8a98f2c to d820645 Compare January 29, 2025 16:22

pippolo84 mentioned this pull request Jan 29, 2025

Add node CIDRs into ipcache to remove tunnel map #37026

Closed

christarazi reviewed Jan 29, 2025

View reviewed changes

gandro reviewed Feb 27, 2025

View reviewed changes

gandro reviewed Mar 5, 2025

View reviewed changes

gandro mentioned this pull request Mar 5, 2025

Support tunnel routing in Multi-Pool IPAM mode #38015

Closed

pippolo84 mentioned this pull request Mar 25, 2025

Support tunnel routing in Multi-Pool IPAM mode #38483

Merged

pippolo84 closed this Mar 25, 2025

Use ipcache for tunneling #37146

Use ipcache for tunneling #37146

Uh oh!

Conversation

pippolo84 commented Jan 21, 2025

Uh oh!

This comment was marked as outdated.

pippolo84 commented Jan 21, 2025

Uh oh!

pippolo84 commented Jan 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gandro left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gandro Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gandro Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gandro Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gandro Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gandro Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pippolo84 commented Mar 25, 2025

Uh oh!

Uh oh!

gandro Feb 27, 2025 •

edited

Loading

gandro Feb 27, 2025 •

edited

Loading

gandro Mar 3, 2025 •

edited

Loading

gandro Mar 5, 2025 •

edited

Loading

gandro Mar 5, 2025 •

edited

Loading