-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Optimize scalability of CiliumIdentity operation #11275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…dpoint Rework the CiliumIdentity garbage collector to derive liveness of a CiliumIdentity based on the identity usage of CiliumEndpoint information. As the CiliumEndpoint garbage collector scans endpoints, it will maintain a list of identities with heartbeat signals. As the CiliumIdentity garbage collector runs, it will use the heartbeat signals before removing stale identities. In order to make the gabrage collector with the heartbeat timeout requirements work, the default scan intervals are reduced slightly. Signed-off-by: Thomas Graf <thomas@cilium.io>
With the introduction of the CiliumEndpoint based garbage collection of CiliumIdentities. The status field of the CiliumIdentity can remain unpopulated. This will massively increase the scalability of identity allocation as no PATCHing is required after the identity has been allocated. Signed-off-by: Thomas Graf <thomas@cilium.io>
46948c0
to
24abe2f
Compare
test-me-please |
Signed-off-by: Thomas Graf <thomas@cilium.io>
24abe2f
to
30ec2e5
Compare
test-me-please Looks like the CI is subject to some underlying network issues:
|
test-me-please |
defer i.mutex.Unlock() | ||
|
||
for identity, lifesign := range i.lastLifesign { | ||
if time.Since(lifesign) > 10*i.timeout { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the 10*
? can't this live live somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 is a safe assumption for when a life sign is definitely never needed again.
@@ -70,6 +71,7 @@ func enableCiliumEndpointSyncGC() { | |||
podsCache[pod.Namespace+"/"+pod.Name] = &pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It worries me a little bit that we are doing an entire Pod List of the cluster every N minutes. This function should be re-written once #11195 is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is your worry that we do this every 10min instead of 30min now? We have always been doing this so far. This only changes the interval.
We can definitely optimize this. I'm not sure a watcher is always better though. it will pay the cost of churn. Whereas right now we pay the snapshot cost of all pods which can be a lot less or a lot more. It will vary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't aware about this listing at all, that's why I made the comment. The watcher will be good because we can optimize it and only keep the pod name and namespace in memory (instead of all fields of the pod). Can be optimized as a follow up.
@@ -96,6 +98,10 @@ func enableCiliumEndpointSyncGC() { | |||
|
|||
// For each CEP we fetched, check if we know about it | |||
for _, cep := range ceps.Items { | |||
if cep.Status.Identity != nil { | |||
identityHeartbeat.MarkAlive(strconv.FormatInt(cep.Status.Identity.ID, 10), timeNow) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this logic be in the CEP watcher? With a reference counter we would be able to know which CEPs are using a particular identity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which CEP watcher?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently a non-existing one (I thought we had one in the operator). I was thinking on having a CEP watcher in the operator that would keep a reference counter for the used identities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The watcher has a downside though. If we ever miss the notification, we won't ever clean up again unless the identity gets re-used and released again. We are also no longer talking about a heartbeat-based mechanism but event notification with the usual pros and cons.
@@ -70,6 +71,7 @@ func enableCiliumEndpointSyncGC() { | |||
podsCache[pod.Namespace+"/"+pod.Name] = &pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't aware about this listing at all, that's why I made the comment. The watcher will be good because we can optimize it and only keep the pod name and namespace in memory (instead of all fields of the pod). Can be optimized as a follow up.
@@ -96,6 +98,10 @@ func enableCiliumEndpointSyncGC() { | |||
|
|||
// For each CEP we fetched, check if we know about it | |||
for _, cep := range ceps.Items { | |||
if cep.Status.Identity != nil { | |||
identityHeartbeat.MarkAlive(strconv.FormatInt(cep.Status.Identity.ID, 10), timeNow) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently a non-existing one (I thought we had one in the operator). I was thinking on having a CEP watcher in the operator that would keep a reference counter for the used identities.
Following the deprecation notice from cilium#11275, remove the field. Signed-off-by: Chris Tarazi <chris@isovalent.com>
Following the deprecation notice from #11275, remove the field. Signed-off-by: Chris Tarazi <chris@isovalent.com>
Following the deprecation notice from #11275, remove the field. Signed-off-by: Chris Tarazi <chris@isovalent.com>
Following the deprecation notice from #11275, remove the field. Signed-off-by: Chris Tarazi <chris@isovalent.com>
This PR changes the garbage collection mechanism of identities to use the CiliumEndpoint information instead of requiring all nodes to maintain status information about who is using a particular identity.
This improves scalability of identity management in CRD mode massively.
Depends on: #11270