Skip to content

Conversation

tgraf
Copy link
Member

@tgraf tgraf commented Apr 30, 2020

This PR changes the garbage collection mechanism of identities to use the CiliumEndpoint information instead of requiring all nodes to maintain status information about who is using a particular identity.

This improves scalability of identity management in CRD mode massively.

Depends on: #11270

tgraf added 2 commits May 1, 2020 01:26
…dpoint

Rework the CiliumIdentity garbage collector to derive liveness of a
CiliumIdentity based on the identity usage of CiliumEndpoint
information. As the CiliumEndpoint garbage collector scans endpoints, it
will maintain a list of identities with heartbeat signals.

As the CiliumIdentity garbage collector runs, it will use the heartbeat
signals before removing stale identities.

In order to make the gabrage collector with the heartbeat timeout
requirements work, the default scan intervals are reduced slightly.

Signed-off-by: Thomas Graf <thomas@cilium.io>
With the introduction of the CiliumEndpoint based garbage collection of
CiliumIdentities. The status field of the CiliumIdentity can remain
unpopulated. This will massively increase the scalability of identity
allocation as no PATCHing is required after the identity has been
allocated.

Signed-off-by: Thomas Graf <thomas@cilium.io>
@tgraf tgraf added pending-review kind/performance There is a performance impact of this. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels Apr 30, 2020
@tgraf tgraf requested review from a team as code owners April 30, 2020 23:28
@tgraf tgraf requested a review from a team April 30, 2020 23:28
@tgraf tgraf force-pushed the pr/tgraf/optimize-identity-gc branch from 46948c0 to 24abe2f Compare April 30, 2020 23:38
@tgraf
Copy link
Member Author

tgraf commented Apr 30, 2020

test-me-please

Signed-off-by: Thomas Graf <thomas@cilium.io>
@tgraf tgraf force-pushed the pr/tgraf/optimize-identity-gc branch from 24abe2f to 30ec2e5 Compare May 1, 2020 06:19
@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 44.699% when pulling 30ec2e5 on pr/tgraf/optimize-identity-gc into ba3a74b on master.

@tgraf
Copy link
Member Author

tgraf commented May 1, 2020

test-me-please

Looks like the CI is subject to some underlying network issues:

/home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Validated/runtime-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:431
5 of 1  pings to "8.8.8.8" failed
Expected
    <int>: 5
to be <=
    <int>: 1

@tgraf
Copy link
Member Author

tgraf commented May 1, 2020

test-me-please

@tgraf tgraf requested a review from aanm May 1, 2020 11:36
defer i.mutex.Unlock()

for identity, lifesign := range i.lastLifesign {
if time.Since(lifesign) > 10*i.timeout {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the 10*? can't this live live somewhere else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 is a safe assumption for when a life sign is definitely never needed again.

@@ -70,6 +71,7 @@ func enableCiliumEndpointSyncGC() {
podsCache[pod.Namespace+"/"+pod.Name] = &pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It worries me a little bit that we are doing an entire Pod List of the cluster every N minutes. This function should be re-written once #11195 is merged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is your worry that we do this every 10min instead of 30min now? We have always been doing this so far. This only changes the interval.

We can definitely optimize this. I'm not sure a watcher is always better though. it will pay the cost of churn. Whereas right now we pay the snapshot cost of all pods which can be a lot less or a lot more. It will vary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware about this listing at all, that's why I made the comment. The watcher will be good because we can optimize it and only keep the pod name and namespace in memory (instead of all fields of the pod). Can be optimized as a follow up.

@@ -96,6 +98,10 @@ func enableCiliumEndpointSyncGC() {

// For each CEP we fetched, check if we know about it
for _, cep := range ceps.Items {
if cep.Status.Identity != nil {
identityHeartbeat.MarkAlive(strconv.FormatInt(cep.Status.Identity.ID, 10), timeNow)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this logic be in the CEP watcher? With a reference counter we would be able to know which CEPs are using a particular identity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which CEP watcher?

Copy link
Member

@aanm aanm May 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently a non-existing one (I thought we had one in the operator). I was thinking on having a CEP watcher in the operator that would keep a reference counter for the used identities.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The watcher has a downside though. If we ever miss the notification, we won't ever clean up again unless the identity gets re-used and released again. We are also no longer talking about a heartbeat-based mechanism but event notification with the usual pros and cons.

@@ -70,6 +71,7 @@ func enableCiliumEndpointSyncGC() {
podsCache[pod.Namespace+"/"+pod.Name] = &pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware about this listing at all, that's why I made the comment. The watcher will be good because we can optimize it and only keep the pod name and namespace in memory (instead of all fields of the pod). Can be optimized as a follow up.

@@ -96,6 +98,10 @@ func enableCiliumEndpointSyncGC() {

// For each CEP we fetched, check if we know about it
for _, cep := range ceps.Items {
if cep.Status.Identity != nil {
identityHeartbeat.MarkAlive(strconv.FormatInt(cep.Status.Identity.ID, 10), timeNow)
Copy link
Member

@aanm aanm May 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently a non-existing one (I thought we had one in the operator). I was thinking on having a CEP watcher in the operator that would keep a reference counter for the used identities.

@tgraf tgraf marked this pull request as draft May 4, 2020 10:26
@tgraf tgraf marked this pull request as ready for review May 6, 2020 14:41
@aanm aanm merged commit c0e278a into master May 6, 2020
@aanm aanm deleted the pr/tgraf/optimize-identity-gc branch May 6, 2020 14:48
christarazi added a commit to christarazi/cilium that referenced this pull request Sep 2, 2020
Following the deprecation notice from
cilium#11275, remove the field.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi added a commit that referenced this pull request Sep 4, 2020
Following the deprecation notice from
#11275, remove the field.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi added a commit that referenced this pull request Sep 7, 2020
Following the deprecation notice from
#11275, remove the field.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
nebril pushed a commit that referenced this pull request Sep 7, 2020
Following the deprecation notice from
#11275, remove the field.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/performance There is a performance impact of this. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants