Skip to content

Conversation

tommyp1ckles
Copy link
Contributor

@tommyp1ckles tommyp1ckles commented Nov 1, 2023

Currently, the CiliumIdentity K8s watcher OnDelete handler has a mechanism to recreate wrongly deleted CiliumIdentities.

This is done by checking if:

  • If enableMasterKeyProtection is enabled (default=true).
  • If this agent is the "owning" agent (i.e. the cid originates from this node and the cid is in the agents localStore).

If so, K8s Delete events are handled by recreating the local CID object. However this is not done resiliently - where if there is a failure at that time with the create request to apiserver then the error is logged and the issue never resolved.

To improve this, this will perform the request attempt in a controller, terminating only if:

  • Since the last attempt, the ciliumidentity was not recreated (i.e. still doesn't exist).
  • The identity is still locally owned.

This will provide resiliency against issues where the apiserver is temporarily unavailable, the agent has been rate limited, etc.

Testing

Kind: ensuring ciliumidentities are recreated by deleting the cid and looking for relevant logs:

level=debug msg="OnDelete MasterKeyProtection update succeeded" id=42281 subsys=allocator
When master key protection is enabled, failed attempts at recreating k8s identity resources will now be retried. 

@maintainer-s-little-helper
Copy link

Commit f6baedd does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Nov 1, 2023
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/ciliumidentity-resiliency-improvements-v1 branch from 2b58614 to 3761d25 Compare November 1, 2023 02:28
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Nov 1, 2023
@tommyp1ckles
Copy link
Contributor Author

/test

1 similar comment
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles tommyp1ckles force-pushed the pr/tp/ciliumidentity-resiliency-improvements-v1 branch from 3761d25 to 2a0392a Compare November 1, 2023 05:05
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles tommyp1ckles force-pushed the pr/tp/ciliumidentity-resiliency-improvements-v1 branch 2 times, most recently from d5e4786 to c497e09 Compare November 1, 2023 20:36
@tommyp1ckles tommyp1ckles changed the title Pr/tp/ciliumidentity resiliency improvements v1 ciliumidentity resiliency improvements Nov 1, 2023
@tommyp1ckles tommyp1ckles added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Nov 1, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 1, 2023
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/ciliumidentity-resiliency-improvements-v1 branch from c497e09 to ce82268 Compare November 1, 2023 20:58
@tommyp1ckles tommyp1ckles marked this pull request as ready for review November 2, 2023 15:02
@tommyp1ckles tommyp1ckles requested review from a team as code owners November 2, 2023 15:02
@tommyp1ckles tommyp1ckles changed the title ciliumidentity resiliency improvements ciliumidentity resiliency improvement Nov 2, 2023
@christarazi christarazi added area/daemon Impacts operation of the Cilium daemon. area/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. labels Nov 2, 2023
@tommyp1ckles tommyp1ckles added the affects/v1.14 This issue affects v1.14 branch label Nov 2, 2023
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall structure looks good to me. I've left a few comment inline

@tommyp1ckles tommyp1ckles force-pushed the pr/tp/ciliumidentity-resiliency-improvements-v1 branch 2 times, most recently from 240ed3e to 6cab386 Compare November 5, 2023 17:17
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Just a couple more comments inline.

@tommyp1ckles tommyp1ckles force-pushed the pr/tp/ciliumidentity-resiliency-improvements-v1 branch from 6cab386 to 6f35c61 Compare November 6, 2023 18:01
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles
Copy link
Contributor Author

Hitting: #26351 in e2e

@tommyp1ckles
Copy link
Contributor Author

tommyp1ckles commented Nov 6, 2023

Seeing

Error response from daemon: received unexpected HTTP status: 502 Bad Gateway
Error: Process completed with exit code 1.

on the ginkgo tests, seems like possibly a sporadic issue with the docker registry, retrying...

@tommyp1ckles tommyp1ckles force-pushed the pr/tp/ciliumidentity-resiliency-improvements-v1 branch from 6f35c61 to 1a04e3f Compare November 6, 2023 20:54
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles
Copy link
Contributor Author

rebasing back one commit on main to see if that resolves/narrows down the gw conformance failures.

@tommyp1ckles
Copy link
Contributor Author

/test

1 similar comment
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles tommyp1ckles removed the affects/v1.14 This issue affects v1.14 branch label Nov 7, 2023
@tommyp1ckles tommyp1ckles reopened this Nov 7, 2023
@tommyp1ckles
Copy link
Contributor Author

/test

OnDelete k8s handler will now use a controller to reattempt master key
recreation (i.e. situations where the owning agent sees a delete event
on the k8s cid crd).

This will use the localKeys store as a source of truth of local
ciliumidentities, with sync attempts only happening if the numeric identity
exists in localKeys.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
onDeleteLocked is used when draining the cache to remove all keys.
In this case we want to remove all keys regardless of master keys.

This switches this functionality to make the master key recreation
optional, and only using this behavior when handling OnDelete events.

Other uses of onDeleteLocked should always remove the identity locally.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
This test should cover:
* OnDelete with master key protection enabled causing resilient sync of
  cid crd.
* OnDelete without master key protection.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
These tests have been commented out for over four years, so we should
probably just get rid of them.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/ciliumidentity-resiliency-improvements-v1 branch from 1a04e3f to 452c640 Compare November 7, 2023 18:27
@tommyp1ckles
Copy link
Contributor Author

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 7, 2023
@squeed squeed merged commit 64d59aa into cilium:main Nov 8, 2023
@aanm
Copy link
Member

aanm commented Dec 5, 2023

@tommyp1ckles can we also have a release note for this one? Thanks!

@tommyp1ckles
Copy link
Contributor Author

@tommyp1ckles can we also have a release note for this one? Thanks!

yup added

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. area/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants