Skip to content

Conversation

dlapcevic
Copy link
Contributor

@dlapcevic dlapcevic commented Sep 13, 2024

CFP: cilium/design-cfps#59

Related work done in cilium-operator in
https://github.com/cilium/cilium/tree/main/operator/pkg/ciliumidentity.

Note: No user facing changes. The flag is disabled by default and hidden.

Introduce changes to agent's identity allocation system to support the case when operator is managing (create, update, delete) Cilium Identities.

Agent no longer creates Cilium Identities (global identities). Agent instead only gets identity from the local cache, which is populated by the Cilium Identities watcher events. In case identity is not found in the local cache, identity allocation will be retried. After a certain amount of retries, identity allocation will fail. This affects pod creation -- CNI fails and pod doesn’t get into ready state. This allows easier error detection in the identity management system. Cilium Identity is expected to exist for all pods, because operator creates them on pod object creation, before the pods are scheduled.

The Cilium Identity GC remains to be handled by operator without changes, for now. Agent no longer acquires reference to identities locally, and agent no longer removes the HeartBeatAnnotation ("io.cilium.heartbeat"). This isn’t a problem because GC will only delete Cilium Identities after they are no longer used in any Cilium Endpoints or Cilium Endpoint Slices. Practically, there shouldn’t be a case when operator GC deletes a Cilium Identity that is used by agent. But even if it happens, operator will recreate it immediately, because there are pods that match that identity, in the same way agents used to do. The plan is to eventually move GC to be part of the Cilium Identity controller in operator, at which point the HeartBeatAnnotation will be deprecated, and this concern will no longer exist.

Strategy
Including operator ID management in the existing allocator is a relatively small change and the first step towards adopting operator ID management. Over time, as there is proof that this feature is reliable and beneficial (additionally because of further enhancements that the feature enables), the goal would be to refactor the allocator code, to make it shorter, simpler and easier to extend, and eventually make the feature default. Until then, it makes sense not to decouple it from the current allocator implementation, because it would make it more difficult for other modifications/enhancements in the allocator to be applied (because there would be two implementations), while operator ID management implementation would anyway reuse nearly all the code of the current allocator implementation.

Manual test
Tested manually with the CID controller running (https://github.com/cilium/cilium/tree/main/operator/pkg/ciliumidentity) in cilium-operator and operator-manages-identities flag enabled for both cilium-agent and cilium-operator.

Signed-off-by: Dorde Lapcevic <dordel@google.com>

@dlapcevic dlapcevic requested review from a team as code owners September 13, 2024 15:52
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 13, 2024
@github-actions github-actions bot added the sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. label Sep 13, 2024
@dlapcevic
Copy link
Contributor Author

@dlapcevic
Copy link
Contributor Author

Hi. I would greatly appreciate suggestions on how I could expand testing to cover other features that utilize the ID allocation system.

It appears that the ID allocation system is mostly tested for KVStore ID allocation mode.

I want to find a way to set up integration tests for a broader scope of features using different ID allocation modes and with/without operator ID management.

@dlapcevic dlapcevic force-pushed the operator-manages-global-identities-7 branch from 4ab4d7e to 6ed96e4 Compare September 15, 2024 17:04
@dlapcevic
Copy link
Contributor Author

cc @pchaigno

@pchaigno pchaigno added the release-note/misc This PR makes changes that have no direct user impact. label Sep 16, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 16, 2024
@dlapcevic
Copy link
Contributor Author

cc: @cilium/sig-policy

@pchaigno pchaigno requested a review from pippolo84 September 20, 2024 08:34
@pchaigno
Copy link
Member

Ping @doniacld @tommyp1ckles @thorn3r @giorio94 🙂

@squeed
Copy link
Contributor

squeed commented Sep 20, 2024

the goal would be to refactor the allocator code, to make it shorter, simpler and easier to extend, and eventually make the feature default.

That would be nice. The current structure is a bit confusing, for sure!

@dlapcevic dlapcevic force-pushed the operator-manages-global-identities-7 branch 2 times, most recently from 3f68fd7 to 6e91079 Compare September 20, 2024 10:41
@dlapcevic dlapcevic requested a review from a team as a code owner September 20, 2024 10:41
@dlapcevic dlapcevic force-pushed the operator-manages-global-identities-7 branch 2 times, most recently from feff69c to f0b76bf Compare September 20, 2024 13:56
@dlapcevic dlapcevic requested a review from squeed September 20, 2024 14:22
@marseel marseel removed the request for review from thorn3r September 24, 2024 12:16
Copy link
Contributor

@marseel marseel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for my codeowners, thanks!

@dlapcevic
Copy link
Contributor Author

Hi @squeed, could you please help me move this forward?

@christarazi christarazi added area/operator Impacts the cilium-operator component kind/feature This introduces new functionality. labels Sep 26, 2024
Copy link
Contributor

@ovidiutirla ovidiutirla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution 🚀
Some small nits

@dlapcevic dlapcevic force-pushed the operator-manages-global-identities-7 branch from f0b76bf to 6b13b8b Compare October 3, 2024 13:57
@dlapcevic dlapcevic requested a review from ovidiutirla October 3, 2024 14:00
@dlapcevic dlapcevic force-pushed the operator-manages-global-identities-7 branch 3 times, most recently from 136d914 to 999ae07 Compare October 4, 2024 13:36
@dlapcevic
Copy link
Contributor Author

/ci-eks

@dlapcevic
Copy link
Contributor Author

/ci-ginkgo

@dlapcevic
Copy link
Contributor Author

/ci-runtime

@pchaigno pchaigno added this pull request to the merge queue Oct 10, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 10, 2024
@pchaigno pchaigno added this pull request to the merge queue Oct 10, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 10, 2024
@dlapcevic dlapcevic force-pushed the operator-manages-global-identities-7 branch from 999ae07 to 4846a1c Compare October 10, 2024 14:38
CFP: cilium/design-cfps#59

Related work done in cilium-operator in
https://github.com/cilium/cilium/tree/main/operator/pkg/ciliumidentity.

Introduce changes to agent's identity allocation system to support the case when operator is managing (create, update, delete) Cilium Identities.

Agent no longer creates Cilium Identities (global identities).
Agent instead only gets identity from the local cache, which is populated by the Cilium Identities watcher events.
In case identity is not found in the local cache, identity allocation will be retried.
After a certain amount of retries, identity allocation will fail. This affects pod creation -- CNI fails and pod doesn’t get into ready state. This allows easier error detection in the identity management system.
Cilium Identity is expected to exist for all pods, because operator creates them on pod object creation, before the pods are scheduled.

The Cilium Identity GC remains to be handled by operator without changes, for now.
Agent no longer acquires reference to identities locally, and agent no longer removes the HeartBeatAnnotation ("io.cilium.heartbeat").
This isn’t a problem because GC will only delete Cilium Identities after they are no longer used in any Cilium Endpoints or Cilium Endpoint Slices.
Practically, there shouldn’t be a case when operator GC deletes a Cilium Identity that is used by agent.
But even if it happens, operator will recreate it immediately, because there are pods that match that identity, in the same way agents used to do.
The plan is to eventually move GC to be part of the Cilium Identity controller in operator, at which point the HeartBeatAnnotation will be deprecated, and this concern will no longer exist.

Strategy
Including operator ID management in the existing allocator is a relatively small change and the first step towards adopting operator ID management. Over time, as there is proof that this feature is reliable and beneficial (additionally because of further enhancements that the feature enables), the goal would be to refactor the allocator code, to make it shorter, simpler and easier to extend, and eventually make the feature default. Until then, it makes sense not to decouple it from the current allocator implementation, because it would make it more difficult for other modifications/enhancements in the allocator to be applied (because there would be two implementations), while operator ID management implementation would anyway reuse nearly all the code of the current allocator implementation.

Manual test
Tested manually with the CID controller running (https://github.com/cilium/cilium/tree/main/operator/pkg/ciliumidentity) in cilium-operator and operator-manages-identities flag enabled for both cilium-agent and cilium-operator.

Signed-off-by: Dorde Lapcevic <dordel@google.com>
@dlapcevic dlapcevic force-pushed the operator-manages-global-identities-7 branch from 4846a1c to ef2c61a Compare October 10, 2024 14:55
@pchaigno pchaigno enabled auto-merge October 10, 2024 15:10
@pchaigno
Copy link
Member

/test

@dlapcevic
Copy link
Contributor Author

Rebased and fixed Lint Source Code from Go Related Checks.

@pchaigno pchaigno added this pull request to the merge queue Oct 10, 2024
Merged via the queue into cilium:main with commit c7d6f00 Oct 10, 2024
63 checks passed
@joestringer
Copy link
Member

We should probably have a discussion about goals for this feature as we approach v1.17 feature freeze, and how to communicate the changes with users. I noticed this is marked as release-note/misc which is probably accurate particularly because the flag is disabled+hidden. But at some point we probably will want to have a release note that references this, and maybe feature this somehow in the release announcement. Not sure if that's a v1.17 thing or for a subsequent release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/operator Impacts the cilium-operator component kind/feature This introduces new functionality. release-note/misc This PR makes changes that have no direct user impact. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants