Skip to content

feat: enhancing concurrent reconciliations #790

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 23, 2025

Conversation

prometherion
Copy link
Member

Strictly related to #787, started from a benchmark of Kamaji in reconciling multiple Tenant Control Planes.

When dealing with multiple Tenant Control Plane creations, my suggestion is to increase the --max-concurrent-tcp-reconciles flag according to the benchmarked scenarios.

e.g.: default value of 1 and creating dozen of clusters:

source /tmp/script  # refer to the one in the original issue
tenantcontrolplane.kamaji.clastix.io/stress-0 created
tenantcontrolplane.kamaji.clastix.io/stress-1 created
tenantcontrolplane.kamaji.clastix.io/stress-2 created
tenantcontrolplane.kamaji.clastix.io/stress-3 created
tenantcontrolplane.kamaji.clastix.io/stress-4 created
tenantcontrolplane.kamaji.clastix.io/stress-5 created
tenantcontrolplane.kamaji.clastix.io/stress-6 created
tenantcontrolplane.kamaji.clastix.io/stress-7 created
tenantcontrolplane.kamaji.clastix.io/stress-8 created
tenantcontrolplane.kamaji.clastix.io/stress-9 created
cat /tmp/log|grep "channel is full"|wc -l
739
source /tmp/check  # refer to the one in the original issue
stress-0 succeeded
stress-1 succeeded
stress-2 succeeded
stress-3 succeeded
stress-4 succeeded
stress-5 succeeded
stress-6 succeeded
stress-7 succeeded
stress-8 succeeded
stress-9 succeeded

When increasing the said flag (e.g.: 10)

source /tmp/check  # refer to the one in the original issue
stress-0 succeeded
stress-1 succeeded
stress-2 succeeded
stress-3 succeeded
stress-4 succeeded
stress-5 succeeded
stress-6 succeeded
stress-7 succeeded
stress-8 succeeded
stress-9 succeeded

cat /tmp/log|grep "channel is full"|wc -l
0

For the sake of transparency, I tested Kamaji on my laptop and had some network issues due to CPU spikes and soot manager controllers have been restarted multiple times due to the following error:

2025-04-17T08:51:50Z    ERROR   controller-runtime.source.EventHandler  failed to get informer from cache       {"error": "failed to get server groups: Get \"https://stress-0.default.svc:6443/api?timeout=10s\": dial tcp 10.96.221.160:6443: connect: connection refused"}

Copy link

netlify bot commented Apr 17, 2025

Deploy Preview for kamaji-documentation ready!

Name Link
🔨 Latest commit 5ec7ebc
🔍 Latest deploy log https://app.netlify.com/sites/kamaji-documentation/deploys/680921c43e89da00082ad4d4
😎 Deploy Preview https://deploy-preview-790--kamaji-documentation.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@prometherion
Copy link
Member Author

e2e is successful, it's just a false positive, happened time to time with GH Actions.

@avorima
Copy link
Contributor

avorima commented Apr 17, 2025

I got 5/10 with --max-concurrent-tcp-reconciles=1 and with --max-concurrent-tcp-reconciles=10 most TCPs don't even get ready

NAME             VERSION   STATUS         CONTROL-PLANE ENDPOINT   KUBECONFIG                        DATASTORE   AGE
stress-0-k4mtf   v1.31.5   Provisioning   10.233.20.235:6443       stress-0-k4mtf-admin-kubeconfig   default     22m
stress-1-8j72d   v1.31.5   Provisioning   10.233.43.171:6443       stress-1-8j72d-admin-kubeconfig   default     22m
stress-2-csr85   v1.31.5   Provisioning   10.233.1.107:6443        stress-2-csr85-admin-kubeconfig   default     22m
stress-3-4pp48   v1.31.5   Ready          10.233.42.73:6443        stress-3-4pp48-admin-kubeconfig   default     22m
stress-4-grx7l   v1.31.5   Provisioning   10.233.49.97:6443        stress-4-grx7l-admin-kubeconfig   default     22m
stress-5-kw2xc   v1.31.5   Provisioning   10.233.45.196:6443       stress-5-kw2xc-admin-kubeconfig   default     22m
stress-6-4hdhj   v1.31.5   Provisioning   10.233.29.232:6443       stress-6-4hdhj-admin-kubeconfig   default     22m
stress-7-bbh46   v1.31.5   Provisioning   10.233.52.176:6443       stress-7-bbh46-admin-kubeconfig   default     22m
stress-8-fw684   v1.31.5   Provisioning   10.233.50.67:6443        stress-8-fw684-admin-kubeconfig   default     22m
stress-9-sndhv   v1.31.5   Provisioning   10.233.59.17:6443        stress-9-sndhv-admin-kubeconfig   default     22m

@prometherion
Copy link
Member Author

Why those TCP are in Provisioning?

It seems to me it's unrelated to the changes introduced with this PR. Is the Deployment and related Pods matching the expected replicas?

@prometherion
Copy link
Member Author

@avorima it seems we can skip checking the channel status, I just added a safety net with a timeout for the Generic Event communication.

It would be great if you could give it a shot.

Channels used for GenericEvent feeding for cross controllers triggers
are now buffered according to the --max-concurrent-tcp-reconciles: this
is required to avoid channel full errors when dealing with large
management clusters serving a sizeable amount of Tenant Control Planes.

Increasing this value will put more pressure on memory (mostly for GC)
and CPU (provisioning multiple certificates at the same time).

Signed-off-by: Dario Tranchitella <dario@tranchitella.eu>
Signed-off-by: Dario Tranchitella <dario@tranchitella.eu>
Signed-off-by: Dario Tranchitella <dario@tranchitella.eu>
Signed-off-by: Dario Tranchitella <dario@tranchitella.eu>
This change introduces a grace period of 10 seconds before abruptly
terminating the Tenant Control Plane deployment, allowing the soot
manager to complete its exit procedure and avoid false positive errors
due to API Server being unresponsive due to user deletion.

Aim of this change is reducing the amount of false positive errors upon
mass deletion of Tenant COntrol Plane objects.

Signed-off-by: Dario Tranchitella <dario@tranchitella.eu>
WatchesRawSource is non blocking, no need to check if channel is full.
To prevent deadlocks a WithTimeout check has been introduced.

Signed-off-by: Dario Tranchitella <dario@tranchitella.eu>
@prometherion prometherion merged commit b027e23 into clastix:master Apr 23, 2025
10 of 11 checks passed
@prometherion prometherion deleted the issues/787 branch April 23, 2025 19:00
@jonas27
Copy link

jonas27 commented Apr 24, 2025

--max-concurrent-tcp-reconciles=1

Did you increase the CPU limits? That caused all kinds of weird issues for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants