feat: enhancing concurrent reconciliations #790

prometherion · 2025-04-17T08:53:57Z

Strictly related to #787, started from a benchmark of Kamaji in reconciling multiple Tenant Control Planes.

When dealing with multiple Tenant Control Plane creations, my suggestion is to increase the --max-concurrent-tcp-reconciles flag according to the benchmarked scenarios.

e.g.: default value of 1 and creating dozen of clusters:

source /tmp/script  # refer to the one in the original issue
tenantcontrolplane.kamaji.clastix.io/stress-0 created
tenantcontrolplane.kamaji.clastix.io/stress-1 created
tenantcontrolplane.kamaji.clastix.io/stress-2 created
tenantcontrolplane.kamaji.clastix.io/stress-3 created
tenantcontrolplane.kamaji.clastix.io/stress-4 created
tenantcontrolplane.kamaji.clastix.io/stress-5 created
tenantcontrolplane.kamaji.clastix.io/stress-6 created
tenantcontrolplane.kamaji.clastix.io/stress-7 created
tenantcontrolplane.kamaji.clastix.io/stress-8 created
tenantcontrolplane.kamaji.clastix.io/stress-9 created

cat /tmp/log|grep "channel is full"|wc -l
739

source /tmp/check  # refer to the one in the original issue
stress-0 succeeded
stress-1 succeeded
stress-2 succeeded
stress-3 succeeded
stress-4 succeeded
stress-5 succeeded
stress-6 succeeded
stress-7 succeeded
stress-8 succeeded
stress-9 succeeded

When increasing the said flag (e.g.: 10)

source /tmp/check  # refer to the one in the original issue
stress-0 succeeded
stress-1 succeeded
stress-2 succeeded
stress-3 succeeded
stress-4 succeeded
stress-5 succeeded
stress-6 succeeded
stress-7 succeeded
stress-8 succeeded
stress-9 succeeded

cat /tmp/log|grep "channel is full"|wc -l
0

For the sake of transparency, I tested Kamaji on my laptop and had some network issues due to CPU spikes and soot manager controllers have been restarted multiple times due to the following error:
2025-04-17T08:51:50Z    ERROR   controller-runtime.source.EventHandler  failed to get informer from cache       {"error": "failed to get server groups: Get \"https://stress-0.default.svc:6443/api?timeout=10s\": dial tcp 10.96.221.160:6443: connect: connection refused"}

netlify · 2025-04-17T08:54:17Z

✅ Deploy Preview for kamaji-documentation ready!

Name	Link
🔨 Latest commit	`5ec7ebc`
🔍 Latest deploy log	https://app.netlify.com/sites/kamaji-documentation/deploys/680921c43e89da00082ad4d4
😎 Deploy Preview	https://deploy-preview-790--kamaji-documentation.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

prometherion · 2025-04-17T09:17:49Z

e2e is successful, it's just a false positive, happened time to time with GH Actions.

avorima · 2025-04-17T13:17:58Z

I got 5/10 with --max-concurrent-tcp-reconciles=1 and with --max-concurrent-tcp-reconciles=10 most TCPs don't even get ready

NAME             VERSION   STATUS         CONTROL-PLANE ENDPOINT   KUBECONFIG                        DATASTORE   AGE
stress-0-k4mtf   v1.31.5   Provisioning   10.233.20.235:6443       stress-0-k4mtf-admin-kubeconfig   default     22m
stress-1-8j72d   v1.31.5   Provisioning   10.233.43.171:6443       stress-1-8j72d-admin-kubeconfig   default     22m
stress-2-csr85   v1.31.5   Provisioning   10.233.1.107:6443        stress-2-csr85-admin-kubeconfig   default     22m
stress-3-4pp48   v1.31.5   Ready          10.233.42.73:6443        stress-3-4pp48-admin-kubeconfig   default     22m
stress-4-grx7l   v1.31.5   Provisioning   10.233.49.97:6443        stress-4-grx7l-admin-kubeconfig   default     22m
stress-5-kw2xc   v1.31.5   Provisioning   10.233.45.196:6443       stress-5-kw2xc-admin-kubeconfig   default     22m
stress-6-4hdhj   v1.31.5   Provisioning   10.233.29.232:6443       stress-6-4hdhj-admin-kubeconfig   default     22m
stress-7-bbh46   v1.31.5   Provisioning   10.233.52.176:6443       stress-7-bbh46-admin-kubeconfig   default     22m
stress-8-fw684   v1.31.5   Provisioning   10.233.50.67:6443        stress-8-fw684-admin-kubeconfig   default     22m
stress-9-sndhv   v1.31.5   Provisioning   10.233.59.17:6443        stress-9-sndhv-admin-kubeconfig   default     22m

prometherion · 2025-04-17T13:21:44Z

Why those TCP are in Provisioning?

It seems to me it's unrelated to the changes introduced with this PR. Is the Deployment and related Pods matching the expected replicas?

prometherion · 2025-04-17T14:04:16Z

@avorima it seems we can skip checking the channel status, I just added a safety net with a timeout for the Generic Event communication.

It would be great if you could give it a shot.

Channels used for GenericEvent feeding for cross controllers triggers are now buffered according to the --max-concurrent-tcp-reconciles: this is required to avoid channel full errors when dealing with large management clusters serving a sizeable amount of Tenant Control Planes. Increasing this value will put more pressure on memory (mostly for GC) and CPU (provisioning multiple certificates at the same time). Signed-off-by: Dario Tranchitella <dario@tranchitella.eu>

Signed-off-by: Dario Tranchitella <dario@tranchitella.eu>

This change introduces a grace period of 10 seconds before abruptly terminating the Tenant Control Plane deployment, allowing the soot manager to complete its exit procedure and avoid false positive errors due to API Server being unresponsive due to user deletion. Aim of this change is reducing the amount of false positive errors upon mass deletion of Tenant COntrol Plane objects. Signed-off-by: Dario Tranchitella <dario@tranchitella.eu>

WatchesRawSource is non blocking, no need to check if channel is full. To prevent deadlocks a WithTimeout check has been introduced. Signed-off-by: Dario Tranchitella <dario@tranchitella.eu>

jonas27 · 2025-04-24T09:54:19Z

--max-concurrent-tcp-reconciles=1

Did you increase the CPU limits? That caused all kinds of weird issues for me.

prometherion force-pushed the issues/787 branch from c8d297f to 5c65a06 Compare April 17, 2025 14:02

prometherion added 5 commits April 18, 2025 19:59

refactor: retrying datastore status update

d0ab859

Signed-off-by: Dario Tranchitella <dario@tranchitella.eu>

feat(performance): reducing memory consumption for channel triggers

2f9bf79

Signed-off-by: Dario Tranchitella <dario@tranchitella.eu>

feat(datastore): reconcile events only for root object changes

cc92f4e

Signed-off-by: Dario Tranchitella <dario@tranchitella.eu>

prometherion force-pushed the issues/787 branch from 5c65a06 to 1f0d3bd Compare April 18, 2025 18:00

refactor: unbuffered channel with timeout

5ec7ebc

WatchesRawSource is non blocking, no need to check if channel is full. To prevent deadlocks a WithTimeout check has been introduced. Signed-off-by: Dario Tranchitella <dario@tranchitella.eu>

prometherion force-pushed the issues/787 branch from 1f0d3bd to 5ec7ebc Compare April 23, 2025 17:22

prometherion merged commit b027e23 into clastix:master Apr 23, 2025
10 of 11 checks passed

prometherion deleted the issues/787 branch April 23, 2025 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: enhancing concurrent reconciliations #790

feat: enhancing concurrent reconciliations #790

Uh oh!

prometherion commented Apr 17, 2025

Uh oh!

netlify bot commented Apr 17, 2025 •

edited

Loading

Uh oh!

prometherion commented Apr 17, 2025

Uh oh!

avorima commented Apr 17, 2025 •

edited

Loading

Uh oh!

prometherion commented Apr 17, 2025

Uh oh!

prometherion commented Apr 17, 2025

Uh oh!

Uh oh!

jonas27 commented Apr 24, 2025

Uh oh!

Uh oh!

feat: enhancing concurrent reconciliations #790

feat: enhancing concurrent reconciliations #790

Uh oh!

Conversation

prometherion commented Apr 17, 2025

Uh oh!

netlify bot commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kamaji-documentation ready!

Uh oh!

prometherion commented Apr 17, 2025

Uh oh!

avorima commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prometherion commented Apr 17, 2025

Uh oh!

prometherion commented Apr 17, 2025

Uh oh!

Uh oh!

jonas27 commented Apr 24, 2025

Uh oh!

Uh oh!

netlify bot commented Apr 17, 2025 •

edited

Loading

avorima commented Apr 17, 2025 •

edited

Loading