Skip to content

Conversation

joestringer
Copy link
Member

  • k8s: Refactor Endpoint{,Slice} initialization
  • k8s: Make k8s resource init and wait consistent
  • k8s: Consolidate resource registration
  • k8s: Only register CRDs for enabled resources

Cilium should no longer register or wait to sync k8s resources that are disabled such as LRP.

See commits for more details.

Previously merged in #17145 and reverted in #17675 due to breakage of external
workloads and clustermesh testsuites which were marked optional at the time.

Refactor this logic into a dedicated function to simplify later commits.
No functional changes.

Signed-off-by: Joe Stringer <joe@cilium.io>
Previously, the list of resources to be initialized would be prepared in
k.EnableK8sWatcher(), but the logic to handle the wait for cache sync
could select a different set of resources.

This PR tidies these up into smaller functions divided by the
functionality they provide (core CNI or feature additions), and
consistently enables these using the pkg/option configurations.

In future we can potentially even remove unused CRDs from the
cluster and Cilium will not complain that they are unavailable; this
commit doesn't try to go quite that far yet.

Signed-off-by: Joe Stringer <joe@cilium.io>
Shift the declaration of the CRD resources used by the agent & other
applications into pkg/k8s/synced, then declare a mapping from that
canonical list of resources to the corresponding groups inside
pkg/k8s/watchers. This makes both pieces of code synchronized. If for
instance CiliumEndpoint CRDs are disabled, they are globally disabled
everywhere rather than just in part of the agent code.

Signed-off-by: Joe Stringer <joe@cilium.io>
Some resources may be dynamically disabled by configuration options in
the agent / operator. It doesn't make sense to register these resources
into Kubernetes if they are disabled in the agent, so don't register
them in that case.

Signed-off-by: Joe Stringer <joe@cilium.io>
@joestringer joestringer requested review from a team as code owners October 22, 2021 18:31
@joestringer joestringer requested review from a team and christarazi October 22, 2021 18:31
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 22, 2021
@joestringer joestringer added the release-note/misc This PR makes changes that have no direct user impact. label Oct 22, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 22, 2021
@joestringer
Copy link
Member Author

ci-multicluster

@joestringer
Copy link
Member Author

ci-external-workloads

@joestringer
Copy link
Member Author

/test

@joestringer
Copy link
Member Author

The two testsuites that were broken by the previous implementation are now passing:

The problem was that this PR makes the CRD registration conditional for optional features like LRP, egress gateway. However I mistakenly set it up so that some of the daemons would wait for all resources to be created on startup, not just the ones that are currently enabled. This led to the following issue in clustermesh-apiserver:

2021-10-20T18:03:34.020650997Z github.com/cilium/cilium/pkg/k8s/synced/crd.go:139: Failed to watch *v1.PartialObjectMetadata: unknown
2021-10-20T18:03:34.020709156Z level=error msg="github.com/cilium/cilium/pkg/k8s/synced/crd.go:139: Failed to watch *v1.PartialObjectMetadata: unknown" subsys=klog
2021-10-20T18:03:52.405794056Z level=fatal msg="Unable to find all Cilium CRDs necessary within 5m0s timeout. Please ensure that Cilium Operator is running, as it's responsible for registering all the Cilium CRDs. The following CRDs were not found: [crd:ciliumegressnatpolicies.cilium.io crd:ciliumlocalredirectpolicies.cilium.io]" error="context deadline exceeded" subsys=k8s

This is now fixed in the latest version.

@joestringer
Copy link
Member Author

joestringer commented Oct 22, 2021

Here's the diff from the previous version I posted, basically the notion of "override" didn't really make sense; all code should just respect the conditional enablement of specific resources.

diff --git a/pkg/k8s/apis/cilium.io/client/register.go b/pkg/k8s/apis/cilium.io/client/register.go
index 7c5e513e7113..7eef1402f674 100644
--- a/pkg/k8s/apis/cilium.io/client/register.go
+++ b/pkg/k8s/apis/cilium.io/client/register.go
@@ -83,7 +83,7 @@ func CreateCustomResourceDefinitions(clientset apiextensionsclient.Interface) er
                synced.CRDResourceName(k8sconstv2.CLRPName):       createCLRPCRD,
                synced.CRDResourceName(k8sconstv2alpha1.CENPName): createCENPCRD,
        }
-       for _, r := range synced.OperatorCRDResourceNames() {
+       for _, r := range synced.AllCRDResourceNames() {
                fn, ok := resourceToCreateFnMapping[r]
                if !ok {
                        log.Fatalf("Unknown resource %s. Please update pkg/k8s/apis/cilium.io/client to understand this type.", r)
diff --git a/pkg/k8s/synced/crd.go b/pkg/k8s/synced/crd.go
index c3e49da5115d..b03a916a19ac 100644
--- a/pkg/k8s/synced/crd.go
+++ b/pkg/k8s/synced/crd.go
@@ -36,7 +36,7 @@ func CRDResourceName(crd string) string {
        return "crd:" + crd
 }
 
-func agentCRDResourceNames(override bool) []string {
+func agentCRDResourceNames() []string {
        result := []string{
                CRDResourceName(v2.CNPName),
                CRDResourceName(v2.CCNPName),
@@ -44,13 +44,13 @@ func agentCRDResourceNames(override bool) []string {
                CRDResourceName(v2.CIDName),
        }
 
-       if override || !option.Config.DisableCiliumEndpointCRD {
+       if !option.Config.DisableCiliumEndpointCRD {
                result = append(result, CRDResourceName(v2.CEPName))
        }
-       if override || option.Config.EnableEgressGateway {
+       if option.Config.EnableEgressGateway {
                result = append(result, CRDResourceName(v2alpha1.CENPName))
        }
-       if override || option.Config.EnableLocalRedirectPolicy {
+       if option.Config.EnableLocalRedirectPolicy {
                result = append(result, CRDResourceName(v2.CLRPName))
        }
 
@@ -60,19 +60,13 @@ func agentCRDResourceNames(override bool) []string {
 // AgentCRDResourceNames returns a list of all CRD resource names the Cilium
 // agent needs to wait to be registered before initializing any k8s watchers.
 func AgentCRDResourceNames() []string {
-       return agentCRDResourceNames(false)
-}
-
-// OperatorCRDResourceNames returns a list of all CRD resource names that the
-// operator may register.
-func OperatorCRDResourceNames() []string {
-       return append(agentCRDResourceNames(false), CRDResourceName(v2.CEWName))
+       return agentCRDResourceNames()
 }
 
 // AllCRDResourceNames returns a list of all CRD resource names that the
 // clustermesh-apiserver or testsuite may register.
 func AllCRDResourceNames() []string {
-       return append(agentCRDResourceNames(true), CRDResourceName(v2.CEWName))
+       return append(agentCRDResourceNames(), CRDResourceName(v2.CEWName))
 }
 
 // SyncCRDs will sync Cilium CRDs to ensure that they have all been

Copy link
Contributor

@kkourt kkourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Comment on lines +179 to +180
// We don't add the service option modifier here, as endpointslices do not
// mirror service proxy name label present in the corresponding service.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch just moves code around, but reading above comment got me wondering: does here imply that the "service option modifier" is added elsewhere? If yes, should we mention where this happens in the comment? If no, should we remove here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the original author of the comment @fristonio or some of the other @cilium/sig-k8s folks could chime in here. Original commit that introduced the comment is c6eae50 ("k8s: honor the service.kubernetes.io/service-proxy-name label") , but from a glance the comment doesn't mean much to me.

@joestringer
Copy link
Member Author

The ConfermanceEKS job hit #16938 , but otherwise CI is all green. Should be good to merge given that the PR was previously reviewed via #17145, the diff was reviewed, and the tests which had previously regressed are now fixed with the latest diff.

@joestringer joestringer added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 25, 2021
@kkourt kkourt merged commit aed4798 into cilium:master Oct 25, 2021
@joestringer joestringer deleted the submit/k8s-subsystem-init-cleanup-v2 branch October 25, 2021 09:06
@joestringer joestringer mentioned this pull request May 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants