Skip to content

Conversation

pippolo84
Copy link
Member

-- WORK IN PROGRESS, NOT READY FOR REVIEW --

@pippolo84 pippolo84 added the dont-merge/preview-only Only for preview or testing, don't merge it. label Jan 21, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 21, 2025
@pippolo84

This comment was marked as outdated.

@pippolo84 pippolo84 force-pushed the pr/pippolo84/use-ipcache-map-for-tunneling branch from 78f57ff to 5b495e3 Compare January 21, 2025 18:26
@pippolo84
Copy link
Member Author

/test

@pippolo84 pippolo84 force-pushed the pr/pippolo84/use-ipcache-map-for-tunneling branch from 5b495e3 to 8a98f2c Compare January 29, 2025 16:11
Store allocation CIDRs from each remote node into the ipcache, using the
world identity.  This sets the ground for using the ipcache for
tunneling instead of relying on the additional tunnel map.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
The ipcache map now stores entries describing remote node pod CIDRs,
including the tunnel endpoint and encryption key.

Therefore, when a packet must be encapsulated or encrypted we can
leverage the information from the lookup into the ipcache, avoiding the
additional one into the tunnel map.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Given that the datapath now relies on the ipcache map for tunneling, it
is possible to remove the insertions and deletions of allocations CIDRs
into the deprecated tunnel map.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
When handling the deletion of a node, we should take into account all
the node allocations CIDRs, not just the primary one.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
As a consequence of using the ipcache map to support tunneling,
multi-pool IPAM is now compatible with it, thus lift the startup checks
that stops the agent when both the features are enabled.

Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
@pippolo84 pippolo84 force-pushed the pr/pippolo84/use-ipcache-map-for-tunneling branch from 8a98f2c to d820645 Compare January 29, 2025 16:22
@pippolo84
Copy link
Member Author

/test

@@ -734,6 +734,49 @@ func (m *manager) NodeUpdated(n nodeTypes.Node) {
m.ipsetMgr.AddToIPSet(ipset.CiliumNodeIPSetV4, ipset.INetFamily, v4Addrs...)
m.ipsetMgr.AddToIPSet(ipset.CiliumNodeIPSetV6, ipset.INet6Family, v6Addrs...)

if !n.IsLocal() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it gated on whether the node is remote to insert pod CIDRs into the ipcache? We don't need the pod CIDRs in the ipcache for the local node?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I missed that it was answered here already: #37026 (comment)

In that case, could we add a comment to the code to clarify why this is being done?

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick review to familiarize myself with the code

}

metadata := []ipcache.IPMetadata{
ipcacheTypes.RequestedIdentity(identity.ReservedIdentityWorldIPv4),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this needs to use GetWorldIdentityFromIP for non-dual-stack setups

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, as it turns out (see below), this is actually completely ignored, which is probably why CI was passing

}

metadata := []ipcache.IPMetadata{
ipcacheTypes.RequestedIdentity(identity.ReservedIdentityWorldIPv4),
Copy link
Member

@gandro gandro Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think RequestedIdentity is correct here. It's only used to suggest a numeric identity when using identity allocator to allocate a local identity - but that's not what should happen here, since we don't want to allocate a local identity, but use a reserved one instead. We likely want to use OverrideIdentity here instead, i.e. force IPCache to use the reserved:world identity.

Not sure how this is working with RequestedIdentity. If we are allocating a CIDR identity for the pod CIDR (which I think is what is happening here), that might work too - but I'm not sure if this is what we want.

Copy link
Member

@gandro gandro Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it turns out that PR still works, but probably only by accident. As I mentioned, RequestedIdentity is completely ignored. They way the pod CIDR still gets the correct identity is thanks to mergeParentLabels. It make the pod CIDR inherit the reserved:world label (which doesn't have a CIDR label), and thus AllocateLocalIdentity will return the reserved world identity rather than attempting to allocate a new one

if lbls.HasHostLabel() {

}).Debug("Resolving reserved identity")

I'll make this a bit more explicit by using identity overwrite instead. However, I also want to test the interaction with the node-cidr feature, which I think could conflict.

Copy link
Member

@gandro gandro Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On closer inspection, we probably don't want to force an identity override, and instead just make sure IPCache uses the reserved world identity if no other sources add additional labels. But if there are other label sources, we likely want to merge the labels used by other sources.

It is possible to write a ToCIDR policy with a PodCIDR - for example, one could legitimately write a policy that allows traffic to endpoints (i.e. matching on the podIP/32 entries), but disallows traffic to the pod CIDR (i.e. matching on podCIDR/24 entries), which would in practice deny traffic to unknown pod IPs (which might be desirable to protect against control plane lag issues where CEP propagation takes a long time). This is a use-case we probably want to preserve.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to write a ToCIDR policy with a PodCIDR - for example, one could legitimately write a policy that allows traffic to endpoints (i.e. matching on the podIP/32 entries), but disallows traffic to the pod CIDR (i.e. matching on podCIDR/24 entries), which would in practice deny traffic to unknown pod IPs (which might be desirable to protect against control plane lag issues where CEP propagation takes a long time). This is a use-case we probably want to preserve.

As network-policy newbie, I have trouble imagining what this could look like 🙂. Wouldn't this require knowing which podIPs are assigned in the cluster (== need a /32 "allow" entry), and constantly updating the Network Policy to reflect this IP assignment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this require knowing which podIPs are assigned in the cluster (== need a /32 "allow" entry), and constantly updating the Network Policy to reflect this IP assignment?

You wouldn't have to watch the pod IPs, just the node's pod CIDRs. The $podIP/32 entry would be inserted by the CEP watcher, no user interaction required. But I think you do raise a good point that the outlined scenario is a bit unrealistic - the user would specifically have to specify the pod CIDR of a specific node - rather than adding a ToCIDR rule that covers the pod CIDRs of all nodes.

Copy link
Member

@gandro gandro Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a step back, what we're doing with this PR is adding a $podCIDR -> [reserved:world] mapping to IPCache.

The question to ask is: What else can perform IPCache upserts for $podCIDR, and do we deal with those cases correctly. Here's what I came up with:

  1. A CNP contains a ToCIDR: $podCIDR entry. There is not really a good use case for it, but it is expressible today. Here what happens is the policy ingestor will do an IPCache metadata upsert for $podCIDR -> [reserved:world,cidr:$podCIDR]. This will cause the allocation of a local CIDR identity, but the tunnel endpoint is preserved and new identity is compatible with the old one (both are reserved:world)
  2. A CiliumCIDRGroup contains a $podCIDR. This will cause an $podCIDR -> [reserved:world,cidrgroup:$name] metadata upsert. As above, tunnel endpoint is preserved and the identity is compatible.

With multi-pool IPAM, there is a special case to consider, because multi-pool now allows the $podCIDR to be of size /32. Therefore, there can be additional cases where IPCache entries are upserted:

  1. The /32 pod "CIDR" is used as a pod IP. This means that the $podIP/32 -> [reserved:world] entry introduced by this PR is replaced by a $podIP -> [k8s:foo,k8s:bar] entry (upserted by the CEP watcher). In contrast to above, we don't actually merge any information here. Instead, the CEP entry shadows the node manager entry. But since the CEP entry contains they encrypt and tunnel key, no loss of information happens here. When the CEP goes away, the old entry is restored.

So all cases seem to behave correctly. Assuming we don't introduce any logic in the datapath which assumes that these $podCIDR entries have a particular identity, we should be fine.

In any case, I'll make sure to introduce some unit test to assert the behavior of these three cases going forward.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we don't introduce any logic in the datapath which assumes that these $podCIDR entries have a particular identity, we should be fine.

So far the thinking was "start with WORLD_ID, march on, and during v1.18 development window see whether we find any users in the datapath that want something $better".

This could be either

  1. actual Network Policy reasons, with new identities for (node-specific?) PodCIDRs
  2. feature-specific reasons ("I matched a PodCIDR, let me do something unique")

Copy link
Member

@gandro gandro Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far the thinking was "start with WORLD_ID, march on, and during v1.18 development window see whether we find any users in the datapath that want something $better".

Yeah.. in general, I guess the main problem is that a pod CIDR can not only be a pod CIDR belonging to a remote node, but also a CIDR either selected by ToCIDR or a CiliumCIDRGroup.

Case 1 ("new identities for PodCIDRs that can be matched in CNPs") will still just work I think. Similar to e.g. reserved:kube-apiserver, which can also either be the reserved identity 7 or a regular CIDR identity. No special treatment is needed here in the datapath. This concept would purely be implemented in the user-space policy engine, which already handles the case merging of labels just fine.

Case 2 ("I matched a PodCIDR, let me do something unique") however is harder. It would require us to use a special numeric range for such identities, similar to how we deal with the "I matched a remote node" case if an identity is both a remote node and a CIDR identity (c.f. identity_is_remote_node). The downside here is that there are only so many bits in the numeric identity space, so we can pull this stunt only a limited amount of time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case 2 ("I matched a PodCIDR, let me do something unique") however is harder. It would require us to use a special numeric range for such identities, similar to how we deal with the "I matched a remote node" case if an identity is both a remote node and a CIDR identity (c.f. identity_is_remote_node). The downside here is that there are only so many bits in the numeric identity space, so we can pull this stunt only a limited amount of time.

ack, I'm not sure it's the best fit either. I could easily see a flag in the ipcache entry ("this is a CIDR entry") being good enough (tm) for a while.

metadata = append(metadata, ipcacheTypes.EncryptKey(oldNode.EncryptionKey))
}

m.ipcache.RemoveMetadata(prefix, resource, metadata...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found another bug - this unconditionally removes pod CIDRs when ever a node is updated. We need to diff here like we do for other IPCache updates we do here (c.f. ingressIPsAdded)

@pippolo84
Copy link
Member Author

Superseded by #38483

@pippolo84 pippolo84 closed this Mar 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. dont-merge/preview-only Only for preview or testing, don't merge it.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants