-
Notifications
You must be signed in to change notification settings - Fork 3.4k
sysdump: respect worker count and collect Cilium profiling data as first task #35897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/scale-100 |
Collection of sysdump on a 100 nodes cluster took ~9 minutes, which is not significantly different from previous runs of the same workflow (~7 minutes), and can be explained by the lower default number of workers. |
Currently, the number of workers used by the sysdump command can be configured via the dedicated --worker-count option, which defaults to the number of CPU cores of the local host. However, this option is practically always overruled by the number of tasks creating subtasks (~30 currently), to prevent a deadlock due to Submit being blocking if no worker is available to pick the task. Let's slightly rework the logic to always wait for the current task to have successfully spawned all of its subtasks before moving to the subsequent one, hence preventing the deadlock situation (as long as there are at least two workers), and allowing to actually respect the configured worker count. Additionally, let's set the default number of worker threads to not modify the current parallelism too much when running on a host with a limited number of cores. Additionally, CPU cores don't appear to be a good estimator in this case, given that practically none of the tasks are CPU intensive. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Serial tasks are intended to be executed initially and in their own workerpool, to prevent interference with other tasks executed in parallel. Currently, the number of workers is fixed at either one or two, depending on whether the given task creates additional subtasks. However, this is suboptimal in case of tasks creating a subtask for each node in the cluster (e.g., to capture profiling/tracing data), as the collection on separate nodes may be performed in parallel to speed-up the overall execution time. Hence, let's respect the worker count parameter in this case as well, given that the overall logic is already ensuring that we'll wait for their completion before moving to the next task. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
This reverts commit fbcd8ee. Now that the serial task execution logic has been modified to respect the configured worker count, hence speeding up the execution time, let's move again the collection of profiling data to the set of serial tasks, to prevent it from being impacted by the other sysdump collection tasks. Reporting the message of the original commit below, for convenience. Move the collection of profiling data to the list of serial tasks that are executed at the beginning, to prevent it from being impacted by the other sysdump collection tasks. Indeed, many bugtool operations invoke the Cilium API (potentially performing expensive operations), hence impacting the pprof results. Differently, the collection of the cilium operator and clustermesh-apiserver profiling data is not moved, as not expected to be impacted by the parallel tasks. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
b76bf19
to
1b60deb
Compare
Rebased onto main to pick CI fixes. |
/test |
nathanjsweet
approved these changes
Nov 15, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
cilium-cli
This PR contains changes related with cilium-cli
cilium-cli-exclusive
This PR only impacts cilium-cli binary
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
release-note/minor
This PR changes functionality that users may find relevant to operating Cilium.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The first two commits modify the sysdump collection logic to actually respect the configured worker count, while the third moves again the collection of Cilium profiling data as first task to prevent it from being impacted by the other sysdump collection tasks. Refer to the individual commit messages for additional details.