Skip to content

Conversation

giorio94
Copy link
Member

The first two commits modify the sysdump collection logic to actually respect the configured worker count, while the third moves again the collection of Cilium profiling data as first task to prevent it from being impacted by the other sysdump collection tasks. Refer to the individual commit messages for additional details.

@giorio94 giorio94 added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. cilium-cli This PR contains changes related with cilium-cli labels Nov 11, 2024
@github-actions github-actions bot added the cilium-cli-exclusive This PR only impacts cilium-cli binary label Nov 11, 2024
@giorio94
Copy link
Member Author

/scale-100

@giorio94
Copy link
Member Author

Collection of sysdump on a 100 nodes cluster took ~9 minutes, which is not significantly different from previous runs of the same workflow (~7 minutes), and can be explained by the lower default number of workers.

Currently, the number of workers used by the sysdump command can be
configured via the dedicated --worker-count option, which defaults
to the number of CPU cores of the local host. However, this option
is practically always overruled by the number of tasks creating
subtasks (~30 currently), to prevent a deadlock due to Submit being
blocking if no worker is available to pick the task.

Let's slightly rework the logic to always wait for the current task
to have successfully spawned all of its subtasks before moving to
the subsequent one, hence preventing the deadlock situation (as long
as there are at least two workers), and allowing to actually respect
the configured worker count.

Additionally, let's set the default number of worker threads to not
modify the current parallelism too much when running on a host with
a limited number of cores. Additionally, CPU cores don't appear to
be a good estimator in this case, given that practically none of the
tasks are CPU intensive.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Serial tasks are intended to be executed initially and in their own
workerpool, to prevent interference with other tasks executed in
parallel. Currently, the number of workers is fixed at either one or
two, depending on whether the given task creates additional subtasks.
However, this is suboptimal in case of tasks creating a subtask for
each node in the cluster (e.g., to capture profiling/tracing data),
as the collection on separate nodes may be performed in parallel to
speed-up the overall execution time. Hence, let's respect the worker
count parameter in this case as well, given that the overall logic
is already ensuring that we'll wait for their completion before
moving to the next task.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
This reverts commit fbcd8ee.

Now that the serial task execution logic has been modified to respect
the configured worker count, hence speeding up the execution time,
let's move again the collection of profiling data to the set of serial
tasks, to prevent it from being impacted by the other sysdump collection
tasks.

Reporting the message of the original commit below, for convenience.

Move the collection of profiling data to the list of serial tasks that
are executed at the beginning, to prevent it from being impacted by the
other sysdump collection tasks. Indeed, many bugtool operations invoke
the Cilium API (potentially performing expensive operations), hence
impacting the pprof results. Differently, the collection of the cilium
operator and clustermesh-apiserver profiling data is not moved, as not
expected to be impacted by the parallel tasks.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 force-pushed the mio/sysdump-profiling-sequential branch from b76bf19 to 1b60deb Compare November 12, 2024 11:30
@giorio94
Copy link
Member Author

Rebased onto main to pick CI fixes.

@giorio94
Copy link
Member Author

/test

@giorio94 giorio94 marked this pull request as ready for review November 12, 2024 12:48
@giorio94 giorio94 requested a review from a team as a code owner November 12, 2024 12:48
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 15, 2024
@aanm aanm added this pull request to the merge queue Nov 18, 2024
Merged via the queue into cilium:main with commit 109fd93 Nov 18, 2024
67 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cilium-cli This PR contains changes related with cilium-cli cilium-cli-exclusive This PR only impacts cilium-cli binary ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants