fix: stop watches when TCP is scaled to zero #771

avorima · 2025-04-04T10:30:35Z

We sometimes scale down unused control planes to preserve resources. They can be scaled up on demand, so it's always meant as a transitory state.
We observed some issues in our cluster in the past. There were of course the network errors that the controller-runtime watches logged because the control plane was unavailable, but there also was a strange behavior where new control planes would not get valid certificates.
I decided to do the full cleanup with finalizer when the TCP is scaled to zero, because this mirrors what we see in our environments. Sometimes control planes are just deleted by users after they notice they were scaled down, i.e. were unused for a while.

Preliminary testing of this change in our environment looked promising, but I'll let it sit over the weekend to get more data.

Another issue we see when scaling TCPs down to 0 is that the datastore controllers keeps logging channel is full well after the control planes are scaled back up again.
I have yet to figure out 1) why it fills up in the first place and 2) why it stays full after the scale up.

netlify · 2025-04-04T10:30:54Z

✅ Deploy Preview for kamaji-documentation canceled.

Name	Link
🔨 Latest commit	`563bcba`
🔍 Latest deploy log	https://app.netlify.com/sites/kamaji-documentation/deploys/67efb4cd3d87dd0008d69939

prometherion · 2025-04-07T08:58:11Z

Another issue we see when scaling TCPs down to 0 is that the datastore controllers keeps logging channel is full well after the control planes are scaled back up again

I suspect it's related to our misbehaviour with controllers: we're pushing a generic event to the source channel, which pushes to the controllers unable to run.

Are you still facing this error despite the proposed changes?

avorima · 2025-04-07T09:03:46Z

Another issue we see when scaling TCPs down to 0 is that the datastore controllers keeps logging channel is full well after the control planes are scaled back up again

I suspect it's related to our misbehaviour with controllers: we're pushing a generic event to the source channel, which pushes to the controllers unable to run.

Are you still facing this error despite the proposed changes?

Yes, and it looks like it's logged for all TCPs (even the ones that are scaled up) when a TCP event is received, e.g. by creating, updating or deleting a TCP.

avorima · 2025-04-07T09:04:03Z

The main issue was fixed by this PR, so I'm undrafting.

prometherion · 2025-04-07T09:19:25Z

Thanks, this PR was very good.

fix: stop watches when TCP is scaled to zero

563bcba

avorima marked this pull request as ready for review April 7, 2025 09:03

prometherion merged commit dc18f27 into clastix:master Apr 7, 2025
11 checks passed

prometherion mentioned this pull request Apr 7, 2025

soot manager fails to reconcile when TenantControlPlane is Ready #756

Closed

avorima deleted the fix-scale-to-zero branch April 7, 2025 09:21

prometherion mentioned this pull request Apr 7, 2025

feat!: introducing sleeping status #773

Merged

avorima mentioned this pull request Apr 8, 2025

test: add scale to zero e2e test #776

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: stop watches when TCP is scaled to zero #771

fix: stop watches when TCP is scaled to zero #771

Uh oh!

avorima commented Apr 4, 2025 •

edited

Loading

Uh oh!

netlify bot commented Apr 4, 2025 •

edited

Loading

Uh oh!

prometherion commented Apr 7, 2025

Uh oh!

avorima commented Apr 7, 2025

Uh oh!

avorima commented Apr 7, 2025

Uh oh!

Uh oh!

prometherion commented Apr 7, 2025

Uh oh!

Uh oh!

fix: stop watches when TCP is scaled to zero #771

fix: stop watches when TCP is scaled to zero #771

Uh oh!

Conversation

avorima commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kamaji-documentation canceled.

Uh oh!

prometherion commented Apr 7, 2025

Uh oh!

avorima commented Apr 7, 2025

Uh oh!

avorima commented Apr 7, 2025

Uh oh!

Uh oh!

prometherion commented Apr 7, 2025

Uh oh!

Uh oh!

avorima commented Apr 4, 2025 •

edited

Loading

netlify bot commented Apr 4, 2025 •

edited

Loading