-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Add 100 node scale test workflow #29214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e9abeaa
to
de4ddb9
Compare
de4ddb9
to
2fa3394
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks good, just one comment :)
Added the |
2fa3394
to
614ddec
Compare
Thanks for the reviews everybody! I made some changes and addressed some feedback. Please feel free to let me know if I missed something of if something else needs to be changed. |
0e2801f
to
ee4035f
Compare
Hey @marseel, just a heads up the last node throughput test failed due to p99 pod startup latency being 76ms over the 1m threshold. |
This commit modularizes steps in the scale test workflow by turning them into their own actions. These actions can be used by new scale test workflows in the future to reduce code duplication. Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
This commit adds a 100 node scale test workflow, which creates a 100 node cluster and runs a full CL2 test suite. Three changes were needed to the current scale test actions: 1. Adjust the create-cluster action to use a larger network. Kops will provision the 100 node cluster in a /20 by default, however this doesn't leave enough address space. 2. Add a new action to create an additional instance group within the cluster. This action is used to deploy a larger node inside the 100 node cluster which CL2 will use to host Prometheus. 3. Bump the timeout used for the validate-cluster action from 20m to 45m Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Co-authored-by: Ryan Drew <ryan.drew@isovalent.com>
This commit adds a dedicated variable for the target cluster's name, meaning steps in the workflow do not have to construct the name from other variables. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
When creating a cluster using kops for scale tests run with CL2, a manual modification is made to a firewall rule that kops deploys to manage traffic sent from workers to control plane nodes. This manual modification is reset every time a call to `kops update` is made. This commit creates a separate action for this step so it can be called at the appropriate time in scale test workflows. See: kubernetes/perf-tests#2319 Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
This commit updates the commit SHA for ClusterLoader2 in the 100 node scale test workflow, which pulls in recent changes to allow the 100 node scale test workflow to successfully pass. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
This commit changes the cron schedule of the 100 node scale test to run every business day. After two weeks, this commit will be reverted. The goal is to collect lots of data for the next two weeks and ensure the workflow is working properly. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
This commit adds an explicit timeout to the step in the node throughput scale test workflow that runs CL2. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
All scale-test related actions, meaning, everything under .github/actions/scale-tests, has been moved to the cilium/scale-tests-action repository. The fork of kubernetes/perf-tests has been moved from learnitall/perf-tests to cilium/perf-tests. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
This bucket is exposed to the internet and meant to be accessed by the community, therefore the bucket name shouldn't be a secret. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
This commit modifies the CL2 step in the node throughput scale test to specify 'bash' as the shell, which toggles the pipefail option. This is important, because without the pipefail option, this step would always succeed. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
ee4035f
to
2af035b
Compare
Hey @aanm I made the changes you requested and the workflows are running successfully. The Node Throughput Test is failing, but that's due to an increase in PodStartupLatency above our configured threshold. An investigation into the increase is out of the scope of this PR and we can do that as a follow-up. |
2af035b
to
47e1ac6
Compare
Marking as ready to merge because:
|
test_name was not set causing both tests to export results to the same gs bucket directory. Fixes cilium#29214 Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
test_name was not set causing both tests to export results to the same gs bucket directory. Fixes #29214 Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
test_name was not set causing both tests to export results to the same gs bucket directory. Fixes cilium#29214 Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
Please ensure your pull request adheres to the following guidelines:
description and a
Fixes: #XXX
line if the commit addresses a particularGitHub issue.
Fixes: <commit-id>
tag, thenplease add the commit author[s] as reviewer[s] to this issue.
This PR culminates in adding a workflow for running a ClusterLoader2 load test in a 100-node cluster. It builds on the initial workflow added in #28362.
This PR also contains work that modularizes common steps between scale test workflows into separate actions.
Co-authored-by: @marseel