Add 100 node scale test workflow #29214

learnitall · 2023-11-15T23:45:42Z

Please ensure your pull request adheres to the following guidelines:

For first time contributors, read Submitting a pull request
All code is covered by unit and/or runtime tests where feasible.
All commits contain a well written commit description including a title,
description and a Fixes: #XXX line if the commit addresses a particular
GitHub issue.
If your commit description contains a Fixes: <commit-id> tag, then
please add the commit author[s] as reviewer[s] to this issue.
All commits are signed off. See the section Developer’s Certificate of Origin
Provide a title or release-note blurb suitable for the release notes.
Are you a user of Cilium? Please add yourself to the Users doc
Thanks for contributing!

This PR culminates in adding a workflow for running a ClusterLoader2 load test in a 100-node cluster. It builds on the initial workflow added in #28362.

This PR also contains work that modularizes common steps between scale test workflows into separate actions.

Co-authored-by: @marseel

Add 100 node scale test workflow

.github/workflows/scale-test-100-gce.yaml

.github/actions/scale-tests/cleanup-cluster/action.yaml

.github/actions/scale-tests/create-cluster/action.yaml

.github/actions/scale-tests/export-results/action.yaml

.github/actions/scale-tests/setup-firewall/action.yaml

.github/actions/scale-tests/validate-cluster/action.yaml

.github/workflows/scale-test-node-throughput-gce.yaml

.github/workflows/scale-test-100-gce.yaml

.github/actions/scale-tests/create-ig/action.yaml

marseel

Thanks! Looks good, just one comment :)

.github/actions/scale-tests/cleanup-cluster/action.yaml

.github/workflows/scale-test-100-gce.yaml

aanm · 2023-11-20T22:18:24Z

Added the dont-merge/preview-only label because of the temporary commit.

learnitall · 2023-11-21T23:58:04Z

Thanks for the reviews everybody! I made some changes and addressed some feedback. Please feel free to let me know if I missed something of if something else needs to be changed.

.github/actions/scale-tests/cleanup-cluster/action.yaml

.github/workflows/scale-test-100-gce.yaml

learnitall · 2023-11-29T20:59:50Z

Hey @marseel, just a heads up the last node throughput test failed due to p99 pod startup latency being 76ms over the 1m threshold.

This commit modularizes steps in the scale test workflow by turning them into their own actions. These actions can be used by new scale test workflows in the future to reduce code duplication. Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>

This commit adds a 100 node scale test workflow, which creates a 100 node cluster and runs a full CL2 test suite. Three changes were needed to the current scale test actions: 1. Adjust the create-cluster action to use a larger network. Kops will provision the 100 node cluster in a /20 by default, however this doesn't leave enough address space. 2. Add a new action to create an additional instance group within the cluster. This action is used to deploy a larger node inside the 100 node cluster which CL2 will use to host Prometheus. 3. Bump the timeout used for the validate-cluster action from 20m to 45m Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Co-authored-by: Ryan Drew <ryan.drew@isovalent.com>

This commit adds a dedicated variable for the target cluster's name, meaning steps in the workflow do not have to construct the name from other variables. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>

When creating a cluster using kops for scale tests run with CL2, a manual modification is made to a firewall rule that kops deploys to manage traffic sent from workers to control plane nodes. This manual modification is reset every time a call to `kops update` is made. This commit creates a separate action for this step so it can be called at the appropriate time in scale test workflows. See: kubernetes/perf-tests#2319 Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>

This commit updates the commit SHA for ClusterLoader2 in the 100 node scale test workflow, which pulls in recent changes to allow the 100 node scale test workflow to successfully pass. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>

This commit changes the cron schedule of the 100 node scale test to run every business day. After two weeks, this commit will be reverted. The goal is to collect lots of data for the next two weeks and ensure the workflow is working properly. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>

This commit adds an explicit timeout to the step in the node throughput scale test workflow that runs CL2. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>

All scale-test related actions, meaning, everything under .github/actions/scale-tests, has been moved to the cilium/scale-tests-action repository. The fork of kubernetes/perf-tests has been moved from learnitall/perf-tests to cilium/perf-tests. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>

This bucket is exposed to the internet and meant to be accessed by the community, therefore the bucket name shouldn't be a secret. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>

This commit modifies the CL2 step in the node throughput scale test to specify 'bash' as the shell, which toggles the pipefail option. This is important, because without the pipefail option, this step would always succeed. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>

learnitall · 2023-11-29T21:53:59Z

Hey @aanm I made the changes you requested and the workflows are running successfully. The Node Throughput Test is failing, but that's due to an increase in PodStartupLatency above our configured threshold. An investigation into the increase is out of the scope of this PR and we can do that as a follow-up.

learnitall · 2023-11-30T17:41:13Z

Marking as ready to merge because:

Received approvals from reviewers
Does not need full test suite, since this PR only involves changes to GH workflows
Workflows are running successfully (see previous comments about context for the node throughput workflow)
Temporary commit used for testing workflow changes has been removed

test_name was not set causing both tests to export results to the same gs bucket directory. Fixes cilium#29214 Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>

test_name was not set causing both tests to export results to the same gs bucket directory. Fixes #29214 Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>

test_name was not set causing both tests to export results to the same gs bucket directory. Fixes cilium#29214 Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>

learnitall requested review from a team as code owners November 15, 2023 23:45

learnitall requested a review from brlbil November 15, 2023 23:45

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 15, 2023

learnitall added area/CI Continuous Integration testing issue or flake kind/performance There is a performance impact of this. release-note/ci This PR makes changes to the CI. labels Nov 15, 2023

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 15, 2023

learnitall force-pushed the pr/learnitall/scale-test-100-nodes branch 2 times, most recently from e9abeaa to de4ddb9 Compare November 15, 2023 23:57

marseel self-requested a review November 16, 2023 10:35

marseel reviewed Nov 16, 2023

View reviewed changes

.github/workflows/scale-test-100-gce.yaml Outdated Show resolved Hide resolved

learnitall force-pushed the pr/learnitall/scale-test-100-nodes branch from de4ddb9 to 2fa3394 Compare November 16, 2023 21:49

brlbil reviewed Nov 20, 2023

View reviewed changes

marseel reviewed Nov 20, 2023

View reviewed changes

.github/workflows/scale-test-100-gce.yaml Show resolved Hide resolved

marseel reviewed Nov 20, 2023

View reviewed changes

.github/actions/scale-tests/create-ig/action.yaml Outdated Show resolved Hide resolved

marseel approved these changes Nov 20, 2023

View reviewed changes

aanm requested changes Nov 20, 2023

View reviewed changes

aanm added the dont-merge/preview-only Only for preview or testing, don't merge it. label Nov 20, 2023

learnitall force-pushed the pr/learnitall/scale-test-100-nodes branch from 2fa3394 to 614ddec Compare November 21, 2023 23:56

learnitall requested review from brlbil, marseel and aanm November 21, 2023 23:57

aanm requested changes Nov 22, 2023

View reviewed changes

.github/actions/scale-tests/cleanup-cluster/action.yaml Outdated Show resolved Hide resolved

.github/workflows/scale-test-100-gce.yaml Outdated Show resolved Hide resolved

.github/workflows/scale-test-100-gce.yaml Outdated Show resolved Hide resolved

learnitall mentioned this pull request Nov 27, 2023

Adjustments for 100 node scale test cilium/perf-tests#1

Merged

learnitall force-pushed the pr/learnitall/scale-test-100-nodes branch 3 times, most recently from 0e2801f to ee4035f Compare November 29, 2023 19:14

marseel and others added 12 commits November 29, 2023 14:00

ci: Modularize scale-test workflow

bf58947

This commit modularizes steps in the scale test workflow by turning them into their own actions. These actions can be used by new scale test workflows in the future to reduce code duplication. Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>

ci: Add CLUSTER_NAME variable in vars step in scale test workflows

4b65e58

This commit adds a dedicated variable for the target cluster's name, meaning steps in the workflow do not have to construct the name from other variables. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>

ci: Change filename of scale-test-gce workflow to be more specific

4ac7d2b

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>

ci: Update CL2 in 100 node scale test workflow

db4bae0

This commit updates the commit SHA for ClusterLoader2 in the 100 node scale test workflow, which pulls in recent changes to allow the 100 node scale test workflow to successfully pass. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>

ci: Fix capitalization of node throughput workflow name

cac7c29

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>

ci: Add timeout to CL2 step in node throughput scale test workflow

561747c

This commit adds an explicit timeout to the step in the node throughput scale test workflow that runs CL2. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>

ci: Use env var instead of secret for scale tests results bucket

0130443

This bucket is exposed to the internet and meant to be accessed by the community, therefore the bucket name shouldn't be a secret. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>

learnitall force-pushed the pr/learnitall/scale-test-100-nodes branch from ee4035f to 2af035b Compare November 29, 2023 21:00

learnitall requested a review from aanm November 29, 2023 21:51

aanm removed the dont-merge/preview-only Only for preview or testing, don't merge it. label Nov 30, 2023

aanm approved these changes Nov 30, 2023

View reviewed changes

brlbil approved these changes Nov 30, 2023

View reviewed changes

learnitall added the dont-merge/preview-only Only for preview or testing, don't merge it. label Nov 30, 2023

learnitall force-pushed the pr/learnitall/scale-test-100-nodes branch from 2af035b to 47e1ac6 Compare November 30, 2023 17:38

learnitall removed the dont-merge/preview-only Only for preview or testing, don't merge it. label Nov 30, 2023

learnitall added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 30, 2023

aanm merged commit 92726f2 into main Dec 1, 2023

aanm deleted the pr/learnitall/scale-test-100-nodes branch December 1, 2023 09:53

marseel mentioned this pull request Dec 4, 2023

Fix exporting results to gs bucket. #29587

Merged

aanm pushed a commit that referenced this pull request Dec 4, 2023

Fix exporting results to gs bucket.

ea09e90

test_name was not set causing both tests to export results to the same gs bucket directory. Fixes #29214 Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>

Add 100 node scale test workflow #29214

Add 100 node scale test workflow #29214

Uh oh!

Conversation

learnitall commented Nov 15, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marseel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aanm commented Nov 20, 2023

Uh oh!

learnitall commented Nov 21, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

learnitall commented Nov 29, 2023

Uh oh!

learnitall commented Nov 29, 2023

Uh oh!

learnitall commented Nov 30, 2023

Uh oh!

Uh oh!