Skip to content

Conversation

learnitall
Copy link
Contributor

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Cilium? Please add yourself to the Users doc
  • Thanks for contributing!

This PR culminates in adding a workflow for running a ClusterLoader2 load test in a 100-node cluster. It builds on the initial workflow added in #28362.

This PR also contains work that modularizes common steps between scale test workflows into separate actions.

Co-authored-by: @marseel

Add 100 node scale test workflow

@learnitall learnitall requested review from a team as code owners November 15, 2023 23:45
@learnitall learnitall requested a review from brlbil November 15, 2023 23:45
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 15, 2023
@learnitall learnitall added area/CI Continuous Integration testing issue or flake kind/performance There is a performance impact of this. release-note/ci This PR makes changes to the CI. labels Nov 15, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 15, 2023
@learnitall learnitall force-pushed the pr/learnitall/scale-test-100-nodes branch 2 times, most recently from e9abeaa to de4ddb9 Compare November 15, 2023 23:57
@marseel marseel self-requested a review November 16, 2023 10:35
@learnitall learnitall force-pushed the pr/learnitall/scale-test-100-nodes branch from de4ddb9 to 2fa3394 Compare November 16, 2023 21:49
Copy link
Contributor

@marseel marseel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks good, just one comment :)

@aanm aanm added the dont-merge/preview-only Only for preview or testing, don't merge it. label Nov 20, 2023
@aanm
Copy link
Member

aanm commented Nov 20, 2023

Added the dont-merge/preview-only label because of the temporary commit.

@learnitall learnitall force-pushed the pr/learnitall/scale-test-100-nodes branch from 2fa3394 to 614ddec Compare November 21, 2023 23:56
@learnitall
Copy link
Contributor Author

Thanks for the reviews everybody! I made some changes and addressed some feedback. Please feel free to let me know if I missed something of if something else needs to be changed.

@learnitall learnitall force-pushed the pr/learnitall/scale-test-100-nodes branch 3 times, most recently from 0e2801f to ee4035f Compare November 29, 2023 19:14
@learnitall
Copy link
Contributor Author

Hey @marseel, just a heads up the last node throughput test failed due to p99 pod startup latency being 76ms over the 1m threshold.

marseel and others added 12 commits November 29, 2023 14:00
This commit modularizes steps in the scale test workflow by
turning them into their own actions. These actions can be used
by new scale test workflows in the future to reduce code
duplication.

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
This commit adds a 100 node scale test workflow, which creates a 100
node cluster and runs a full CL2 test suite. Three changes were needed
to the current scale test actions:

1. Adjust the create-cluster action to use a larger network. Kops
   will provision the 100 node cluster in a /20 by default, however
   this doesn't leave enough address space.
2. Add a new action to create an additional instance group within
   the cluster. This action is used to deploy a larger node inside
   the 100 node cluster which CL2 will use to host Prometheus.
3. Bump the timeout used for the validate-cluster action from 20m to 45m

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
Co-authored-by: Ryan Drew <ryan.drew@isovalent.com>
This commit adds a dedicated variable for the target cluster's name,
meaning steps in the workflow do not have to construct the name from
other variables.

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
When creating a cluster using kops for scale tests run with CL2, a
manual modification is made to a firewall rule that kops deploys to
manage traffic sent from workers to control plane nodes. This manual
modification is reset every time a call to `kops update` is made. This
commit creates a separate action for this step so it can be called at
the appropriate time in scale test workflows.

See: kubernetes/perf-tests#2319

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
This commit updates the commit SHA for ClusterLoader2 in the 100 node
scale test workflow, which pulls in recent changes to allow the 100 node
scale test workflow to successfully pass.

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
This commit changes the cron schedule of the 100 node scale test to run
every business day. After two weeks, this commit will be reverted. The
goal is to collect lots of data for the next two weeks and ensure the
workflow is working properly.

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
This commit adds an explicit timeout to the step in the node throughput
scale test workflow that runs CL2.

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
All scale-test related actions, meaning, everything under
.github/actions/scale-tests, has been moved to the
cilium/scale-tests-action repository. The fork of kubernetes/perf-tests
has been moved from learnitall/perf-tests to cilium/perf-tests.

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
This bucket is exposed to the internet and meant to be accessed by the
community, therefore the bucket name shouldn't be a secret.

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
This commit modifies the CL2 step in the node throughput scale test to
specify 'bash' as the shell, which toggles the pipefail option. This is
important, because without the pipefail option, this step would always
succeed.

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
@learnitall learnitall force-pushed the pr/learnitall/scale-test-100-nodes branch from ee4035f to 2af035b Compare November 29, 2023 21:00
@learnitall learnitall requested a review from aanm November 29, 2023 21:51
@learnitall
Copy link
Contributor Author

Hey @aanm I made the changes you requested and the workflows are running successfully. The Node Throughput Test is failing, but that's due to an increase in PodStartupLatency above our configured threshold. An investigation into the increase is out of the scope of this PR and we can do that as a follow-up.

@aanm aanm removed the dont-merge/preview-only Only for preview or testing, don't merge it. label Nov 30, 2023
@learnitall learnitall added the dont-merge/preview-only Only for preview or testing, don't merge it. label Nov 30, 2023
@learnitall learnitall force-pushed the pr/learnitall/scale-test-100-nodes branch from 2af035b to 47e1ac6 Compare November 30, 2023 17:38
@learnitall learnitall removed the dont-merge/preview-only Only for preview or testing, don't merge it. label Nov 30, 2023
@learnitall
Copy link
Contributor Author

Marking as ready to merge because:

  • Received approvals from reviewers
  • Does not need full test suite, since this PR only involves changes to GH workflows
  • Workflows are running successfully (see previous comments about context for the node throughput workflow)
  • Temporary commit used for testing workflow changes has been removed

@learnitall learnitall added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 30, 2023
@aanm aanm merged commit 92726f2 into main Dec 1, 2023
@aanm aanm deleted the pr/learnitall/scale-test-100-nodes branch December 1, 2023 09:53
marseel added a commit to marseel/cilium that referenced this pull request Dec 4, 2023
test_name was not set causing both tests to export results to the same
gs bucket directory.

Fixes cilium#29214

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
aanm pushed a commit that referenced this pull request Dec 4, 2023
test_name was not set causing both tests to export results to the same
gs bucket directory.

Fixes #29214

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
pjablonski123 pushed a commit to pjablonski123/cilium that referenced this pull request Dec 15, 2023
test_name was not set causing both tests to export results to the same
gs bucket directory.

Fixes cilium#29214

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake kind/performance There is a performance impact of this. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants