-
Notifications
You must be signed in to change notification settings - Fork 3.4k
operator-id-management: agent waits for global identities #34867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
operator-id-management: agent waits for global identities #34867
Conversation
Hi. I would greatly appreciate suggestions on how I could expand testing to cover other features that utilize the ID allocation system. It appears that the ID allocation system is mostly tested for KVStore ID allocation mode. I want to find a way to set up integration tests for a broader scope of features using different ID allocation modes and with/without operator ID management. |
4ab4d7e
to
6ed96e4
Compare
cc @pchaigno |
cc: @cilium/sig-policy |
Ping @doniacld @tommyp1ckles @thorn3r @giorio94 🙂 |
That would be nice. The current structure is a bit confusing, for sure! |
3f68fd7
to
6e91079
Compare
feff69c
to
f0b76bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for my codeowners, thanks!
Hi @squeed, could you please help me move this forward? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution 🚀
Some small nits
f0b76bf
to
6b13b8b
Compare
136d914
to
999ae07
Compare
/ci-eks |
/ci-ginkgo |
/ci-runtime |
999ae07
to
4846a1c
Compare
CFP: cilium/design-cfps#59 Related work done in cilium-operator in https://github.com/cilium/cilium/tree/main/operator/pkg/ciliumidentity. Introduce changes to agent's identity allocation system to support the case when operator is managing (create, update, delete) Cilium Identities. Agent no longer creates Cilium Identities (global identities). Agent instead only gets identity from the local cache, which is populated by the Cilium Identities watcher events. In case identity is not found in the local cache, identity allocation will be retried. After a certain amount of retries, identity allocation will fail. This affects pod creation -- CNI fails and pod doesn’t get into ready state. This allows easier error detection in the identity management system. Cilium Identity is expected to exist for all pods, because operator creates them on pod object creation, before the pods are scheduled. The Cilium Identity GC remains to be handled by operator without changes, for now. Agent no longer acquires reference to identities locally, and agent no longer removes the HeartBeatAnnotation ("io.cilium.heartbeat"). This isn’t a problem because GC will only delete Cilium Identities after they are no longer used in any Cilium Endpoints or Cilium Endpoint Slices. Practically, there shouldn’t be a case when operator GC deletes a Cilium Identity that is used by agent. But even if it happens, operator will recreate it immediately, because there are pods that match that identity, in the same way agents used to do. The plan is to eventually move GC to be part of the Cilium Identity controller in operator, at which point the HeartBeatAnnotation will be deprecated, and this concern will no longer exist. Strategy Including operator ID management in the existing allocator is a relatively small change and the first step towards adopting operator ID management. Over time, as there is proof that this feature is reliable and beneficial (additionally because of further enhancements that the feature enables), the goal would be to refactor the allocator code, to make it shorter, simpler and easier to extend, and eventually make the feature default. Until then, it makes sense not to decouple it from the current allocator implementation, because it would make it more difficult for other modifications/enhancements in the allocator to be applied (because there would be two implementations), while operator ID management implementation would anyway reuse nearly all the code of the current allocator implementation. Manual test Tested manually with the CID controller running (https://github.com/cilium/cilium/tree/main/operator/pkg/ciliumidentity) in cilium-operator and operator-manages-identities flag enabled for both cilium-agent and cilium-operator. Signed-off-by: Dorde Lapcevic <dordel@google.com>
4846a1c
to
ef2c61a
Compare
/test |
Rebased and fixed Lint Source Code from Go Related Checks. |
We should probably have a discussion about goals for this feature as we approach v1.17 feature freeze, and how to communicate the changes with users. I noticed this is marked as |
CFP: cilium/design-cfps#59
Related work done in cilium-operator in
https://github.com/cilium/cilium/tree/main/operator/pkg/ciliumidentity.
Note: No user facing changes. The flag is disabled by default and hidden.
Introduce changes to agent's identity allocation system to support the case when operator is managing (create, update, delete) Cilium Identities.
Agent no longer creates Cilium Identities (global identities). Agent instead only gets identity from the local cache, which is populated by the Cilium Identities watcher events. In case identity is not found in the local cache, identity allocation will be retried. After a certain amount of retries, identity allocation will fail. This affects pod creation -- CNI fails and pod doesn’t get into ready state. This allows easier error detection in the identity management system. Cilium Identity is expected to exist for all pods, because operator creates them on pod object creation, before the pods are scheduled.
The Cilium Identity GC remains to be handled by operator without changes, for now. Agent no longer acquires reference to identities locally, and agent no longer removes the HeartBeatAnnotation ("io.cilium.heartbeat"). This isn’t a problem because GC will only delete Cilium Identities after they are no longer used in any Cilium Endpoints or Cilium Endpoint Slices. Practically, there shouldn’t be a case when operator GC deletes a Cilium Identity that is used by agent. But even if it happens, operator will recreate it immediately, because there are pods that match that identity, in the same way agents used to do. The plan is to eventually move GC to be part of the Cilium Identity controller in operator, at which point the HeartBeatAnnotation will be deprecated, and this concern will no longer exist.
Strategy
Including operator ID management in the existing allocator is a relatively small change and the first step towards adopting operator ID management. Over time, as there is proof that this feature is reliable and beneficial (additionally because of further enhancements that the feature enables), the goal would be to refactor the allocator code, to make it shorter, simpler and easier to extend, and eventually make the feature default. Until then, it makes sense not to decouple it from the current allocator implementation, because it would make it more difficult for other modifications/enhancements in the allocator to be applied (because there would be two implementations), while operator ID management implementation would anyway reuse nearly all the code of the current allocator implementation.
Manual test
Tested manually with the CID controller running (https://github.com/cilium/cilium/tree/main/operator/pkg/ciliumidentity) in cilium-operator and
operator-manages-identities
flag enabled for both cilium-agent and cilium-operator.Signed-off-by: Dorde Lapcevic <dordel@google.com>