Skip to content

Conversation

YutaroHayakawa
Copy link
Member

We recently introduced several CRD status reporting features (#33411, #35517, #35527, #35650). They are useful for troubleshooting, but there's a trade-off between API server load. For the large cluster with many nodes or many BGP resources may experience a high API server load. In this PR, we introduce a knob to disable BGPv2 CRD status reporting for such an environment.

bgpv2: Add a knob to disable CRD status reporting

@YutaroHayakawa YutaroHayakawa added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Nov 14, 2024
@YutaroHayakawa YutaroHayakawa force-pushed the opt-out-bgpv2-status branch 4 times, most recently from b4a17f1 to f732c49 Compare November 18, 2024 05:33
Status reporting (status.conditions in the ClusterConfig and PeerConfig
or status in the NodeConfig) are useful for troubleshooting. Thus, it is
recommended to enable it. However, if there's an issue like high API
server load, we may want to disable it.

This commit introduces a Helm value to control it. The default is
enabled. We'll introduce an actual agent/operator options in the
subsequent commits.

Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Add a new agent/operator option enable-bgp-control-plane-status-report
that enables/disables status reporting (status.conditions on the
CiliumBGPClusterConfig and CiliumBGPPeerConfig, status on the
CiliumBGPNodeConfig). It is enabled by default, but users can disable it
to reduce the API server load.

So far, we only provide a knob to enable/disable all status reporting.
In the future, we may want to provide more fine-grained knob such as
enabling/disabling status reporting per resource. However, as a starting
point, we'll provide it with the simplest way.

Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
When users disable the status reporting on the running cluster, the
previously reported status won't be updated anymore. That means the
stale status report persists. To avoid this, cleanup the status fields
when the status reporting is disabled.

Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
@YutaroHayakawa
Copy link
Member Author

/test

@YutaroHayakawa YutaroHayakawa marked this pull request as ready for review November 18, 2024 07:40
@YutaroHayakawa YutaroHayakawa requested review from a team as code owners November 18, 2024 07:40
Copy link
Contributor

@rastislavs rastislavs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code-wise, this looks good to me.

We used to have a similar feature for CNPs, that we eventually turned off by default and then removed completely because it suffered from the same scalability issues.

I do wonder if we should also disable this feature by default.

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. Turns out the scalability issues are less pronounced with the BGP status compared to the old CNP status I mentioned. So keeping this enabled by default seems like a reasonable tradeoff.

In the CNP case, we used to edit global CNP from multiple nodes while they are watched by all nodes. On the other hand, the BGP resources only have a single writer.

@ldelossa ldelossa removed their request for review November 18, 2024 17:56
@YutaroHayakawa
Copy link
Member Author

Cilium E2E Upgrade: #35774
Conformance External Workloads: Provisioning timeout
Conformance Ginkgo: #18889

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 20, 2024
@youngnick youngnick added this pull request to the merge queue Nov 21, 2024
Merged via the queue into cilium:main with commit f114b99 Nov 21, 2024
82 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants