Skip to content

fix: stop watches when TCP is scaled to zero #771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 7, 2025

Conversation

avorima
Copy link
Contributor

@avorima avorima commented Apr 4, 2025

We sometimes scale down unused control planes to preserve resources. They can be scaled up on demand, so it's always meant as a transitory state.
We observed some issues in our cluster in the past. There were of course the network errors that the controller-runtime watches logged because the control plane was unavailable, but there also was a strange behavior where new control planes would not get valid certificates.
I decided to do the full cleanup with finalizer when the TCP is scaled to zero, because this mirrors what we see in our environments. Sometimes control planes are just deleted by users after they notice they were scaled down, i.e. were unused for a while.

Preliminary testing of this change in our environment looked promising, but I'll let it sit over the weekend to get more data.

Another issue we see when scaling TCPs down to 0 is that the datastore controllers keeps logging channel is full well after the control planes are scaled back up again.
I have yet to figure out 1) why it fills up in the first place and 2) why it stays full after the scale up.

Copy link

netlify bot commented Apr 4, 2025

Deploy Preview for kamaji-documentation canceled.

Name Link
🔨 Latest commit 563bcba
🔍 Latest deploy log https://app.netlify.com/sites/kamaji-documentation/deploys/67efb4cd3d87dd0008d69939

@prometherion
Copy link
Member

Another issue we see when scaling TCPs down to 0 is that the datastore controllers keeps logging channel is full well after the control planes are scaled back up again

I suspect it's related to our misbehaviour with controllers: we're pushing a generic event to the source channel, which pushes to the controllers unable to run.

Are you still facing this error despite the proposed changes?

@avorima
Copy link
Contributor Author

avorima commented Apr 7, 2025

Another issue we see when scaling TCPs down to 0 is that the datastore controllers keeps logging channel is full well after the control planes are scaled back up again

I suspect it's related to our misbehaviour with controllers: we're pushing a generic event to the source channel, which pushes to the controllers unable to run.

Are you still facing this error despite the proposed changes?

Yes, and it looks like it's logged for all TCPs (even the ones that are scaled up) when a TCP event is received, e.g. by creating, updating or deleting a TCP.

@avorima avorima marked this pull request as ready for review April 7, 2025 09:03
@avorima
Copy link
Contributor Author

avorima commented Apr 7, 2025

The main issue was fixed by this PR, so I'm undrafting.

@prometherion prometherion merged commit dc18f27 into clastix:master Apr 7, 2025
11 checks passed
@prometherion
Copy link
Member

Thanks, this PR was very good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants