-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Use ipcache for tunneling #37146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ipcache for tunneling #37146
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
78f57ff
to
5b495e3
Compare
/test |
5b495e3
to
8a98f2c
Compare
Store allocation CIDRs from each remote node into the ipcache, using the world identity. This sets the ground for using the ipcache for tunneling instead of relying on the additional tunnel map. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
The ipcache map now stores entries describing remote node pod CIDRs, including the tunnel endpoint and encryption key. Therefore, when a packet must be encapsulated or encrypted we can leverage the information from the lookup into the ipcache, avoiding the additional one into the tunnel map. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Given that the datapath now relies on the ipcache map for tunneling, it is possible to remove the insertions and deletions of allocations CIDRs into the deprecated tunnel map. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
When handling the deletion of a node, we should take into account all the node allocations CIDRs, not just the primary one. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
As a consequence of using the ipcache map to support tunneling, multi-pool IPAM is now compatible with it, thus lift the startup checks that stops the agent when both the features are enabled. Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
8a98f2c
to
d820645
Compare
/test |
@@ -734,6 +734,49 @@ func (m *manager) NodeUpdated(n nodeTypes.Node) { | |||
m.ipsetMgr.AddToIPSet(ipset.CiliumNodeIPSetV4, ipset.INetFamily, v4Addrs...) | |||
m.ipsetMgr.AddToIPSet(ipset.CiliumNodeIPSetV6, ipset.INet6Family, v6Addrs...) | |||
|
|||
if !n.IsLocal() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it gated on whether the node is remote to insert pod CIDRs into the ipcache? We don't need the pod CIDRs in the ipcache for the local node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I missed that it was answered here already: #37026 (comment)
In that case, could we add a comment to the code to clarify why this is being done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a quick review to familiarize myself with the code
} | ||
|
||
metadata := []ipcache.IPMetadata{ | ||
ipcacheTypes.RequestedIdentity(identity.ReservedIdentityWorldIPv4), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this needs to use GetWorldIdentityFromIP
for non-dual-stack setups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, as it turns out (see below), this is actually completely ignored, which is probably why CI was passing
} | ||
|
||
metadata := []ipcache.IPMetadata{ | ||
ipcacheTypes.RequestedIdentity(identity.ReservedIdentityWorldIPv4), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think RequestedIdentity
is correct here. It's only used to suggest a numeric identity when using identity allocator to allocate a local identity - but that's not what should happen here, since we don't want to allocate a local identity, but use a reserved one instead. We likely want to use OverrideIdentity
here instead, i.e. force IPCache to use the reserved:world
identity.
Not sure how this is working with RequestedIdentity
. If we are allocating a CIDR identity for the pod CIDR (which I think is what is happening here), that might work too - but I'm not sure if this is what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it turns out that PR still works, but probably only by accident. As I mentioned, RequestedIdentity
is completely ignored. They way the pod CIDR still gets the correct identity is thanks to mergeParentLabels
. It make the pod CIDR inherit the reserved:world
label (which doesn't have a CIDR label), and thus AllocateLocalIdentity
will return the reserved world identity rather than attempting to allocate a new one
cilium/pkg/ipcache/metadata.go
Line 688 in 5710ac4
if lbls.HasHostLabel() { |
cilium/pkg/identity/cache/allocator.go
Line 415 in 71cebc7
}).Debug("Resolving reserved identity") |
I'll make this a bit more explicit by using identity overwrite instead. However, I also want to test the interaction with the node-cidr feature, which I think could conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On closer inspection, we probably don't want to force an identity override, and instead just make sure IPCache uses the reserved world identity if no other sources add additional labels. But if there are other label sources, we likely want to merge the labels used by other sources.
It is possible to write a ToCIDR policy with a PodCIDR - for example, one could legitimately write a policy that allows traffic to endpoints (i.e. matching on the podIP/32 entries), but disallows traffic to the pod CIDR (i.e. matching on podCIDR/24 entries), which would in practice deny traffic to unknown pod IPs (which might be desirable to protect against control plane lag issues where CEP propagation takes a long time). This is a use-case we probably want to preserve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to write a ToCIDR policy with a PodCIDR - for example, one could legitimately write a policy that allows traffic to endpoints (i.e. matching on the podIP/32 entries), but disallows traffic to the pod CIDR (i.e. matching on podCIDR/24 entries), which would in practice deny traffic to unknown pod IPs (which might be desirable to protect against control plane lag issues where CEP propagation takes a long time). This is a use-case we probably want to preserve.
As network-policy newbie, I have trouble imagining what this could look like 🙂. Wouldn't this require knowing which podIPs are assigned in the cluster (== need a /32 "allow" entry), and constantly updating the Network Policy to reflect this IP assignment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this require knowing which podIPs are assigned in the cluster (== need a /32 "allow" entry), and constantly updating the Network Policy to reflect this IP assignment?
You wouldn't have to watch the pod IPs, just the node's pod CIDRs. The $podIP/32
entry would be inserted by the CEP watcher, no user interaction required. But I think you do raise a good point that the outlined scenario is a bit unrealistic - the user would specifically have to specify the pod CIDR of a specific node - rather than adding a ToCIDR
rule that covers the pod CIDRs of all nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a step back, what we're doing with this PR is adding a $podCIDR -> [reserved:world]
mapping to IPCache.
The question to ask is: What else can perform IPCache upserts for $podCIDR
, and do we deal with those cases correctly. Here's what I came up with:
- A CNP contains a
ToCIDR: $podCIDR
entry. There is not really a good use case for it, but it is expressible today. Here what happens is the policy ingestor will do an IPCache metadata upsert for$podCIDR -> [reserved:world,cidr:$podCIDR]
. This will cause the allocation of a local CIDR identity, but the tunnel endpoint is preserved and new identity is compatible with the old one (both arereserved:world
) - A CiliumCIDRGroup contains a
$podCIDR
. This will cause an$podCIDR -> [reserved:world,cidrgroup:$name]
metadata upsert. As above, tunnel endpoint is preserved and the identity is compatible.
With multi-pool IPAM, there is a special case to consider, because multi-pool now allows the $podCIDR
to be of size /32
. Therefore, there can be additional cases where IPCache entries are upserted:
- The /32 pod "CIDR" is used as a pod IP. This means that the
$podIP/32 -> [reserved:world]
entry introduced by this PR is replaced by a$podIP -> [k8s:foo,k8s:bar]
entry (upserted by the CEP watcher). In contrast to above, we don't actually merge any information here. Instead, the CEP entry shadows the node manager entry. But since the CEP entry contains they encrypt and tunnel key, no loss of information happens here. When the CEP goes away, the old entry is restored.
So all cases seem to behave correctly. Assuming we don't introduce any logic in the datapath which assumes that these $podCIDR
entries have a particular identity, we should be fine.
In any case, I'll make sure to introduce some unit test to assert the behavior of these three cases going forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming we don't introduce any logic in the datapath which assumes that these $podCIDR entries have a particular identity, we should be fine.
So far the thinking was "start with WORLD_ID
, march on, and during v1.18 development window see whether we find any users in the datapath that want something $better
".
This could be either
- actual Network Policy reasons, with new identities for (node-specific?) PodCIDRs
- feature-specific reasons ("I matched a PodCIDR, let me do something unique")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far the thinking was "start with WORLD_ID, march on, and during v1.18 development window see whether we find any users in the datapath that want something $better".
Yeah.. in general, I guess the main problem is that a pod CIDR can not only be a pod CIDR belonging to a remote node, but also a CIDR either selected by ToCIDR
or a CiliumCIDRGroup
.
Case 1 ("new identities for PodCIDRs that can be matched in CNPs") will still just work I think. Similar to e.g. reserved:kube-apiserver
, which can also either be the reserved identity 7 or a regular CIDR identity. No special treatment is needed here in the datapath. This concept would purely be implemented in the user-space policy engine, which already handles the case merging of labels just fine.
Case 2 ("I matched a PodCIDR, let me do something unique") however is harder. It would require us to use a special numeric range for such identities, similar to how we deal with the "I matched a remote node" case if an identity is both a remote node and a CIDR identity (c.f. identity_is_remote_node
). The downside here is that there are only so many bits in the numeric identity space, so we can pull this stunt only a limited amount of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Case 2 ("I matched a PodCIDR, let me do something unique") however is harder. It would require us to use a special numeric range for such identities, similar to how we deal with the "I matched a remote node" case if an identity is both a remote node and a CIDR identity (c.f. identity_is_remote_node). The downside here is that there are only so many bits in the numeric identity space, so we can pull this stunt only a limited amount of time.
ack, I'm not sure it's the best fit either. I could easily see a flag in the ipcache entry ("this is a CIDR entry") being good enough (tm) for a while.
metadata = append(metadata, ipcacheTypes.EncryptKey(oldNode.EncryptionKey)) | ||
} | ||
|
||
m.ipcache.RemoveMetadata(prefix, resource, metadata...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found another bug - this unconditionally removes pod CIDRs when ever a node is updated. We need to diff here like we do for other IPCache updates we do here (c.f. ingressIPsAdded)
Superseded by #38483 |
-- WORK IN PROGRESS, NOT READY FOR REVIEW --