-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
CI failure
Due to Black Friday / Christmas time, it seems like the AWS cloud is running close to capacity and spot instances are often terminated.
To reduce flakiness, I propose we switch from spot instances to on-demand instances as a temporary measure., but only for scheduled workflows. This will allow us to get a more reliable signal for releases.
Example runs:
https://github.com/cilium/cilium/actions/runs/6966367287/job/18956418664
during test execution:
https://github.com/cilium/cilium/actions/runs/6961628303/job/18943657274
lastHeartbeatTime: "2023-11-22T19:18:24Z"
vs another node:
lastHeartbeatTime: "2023-11-22T19:21:07Z"
Probably this run as well: https://github.com/cilium/cilium/actions/runs/6964276880/job/18951275067
as we have this pod running in post information gathering at the end: client-996668d96-nwk6g
vs client-996668d96-gcwdm
is being used in tests (no sysdump though to verify). Similarly cilium-7f4bn
doesn't exist at the end of the test.
Estimated cost increase
setup-eks-cluster is used in the two workflows above. We use two spot instances: t3.medium
and t4g.medium
Both EKS and AWS CNI are run every 6h with the following regions:
"region": "eu-central-1"
"region": "ap-northeast-1"
"region": "us-east-2"
"region": "ca-central-1"
"region": "eu-north-1",
Spot t3.medium
instances:
"region": "eu-central-1" - $0.0153
"region": "ap-northeast-1" - $0.0176
"region": "us-east-2" - $0.0126
"region": "ca-central-1" - $0.0149
"region": "eu-north-1", - $0.017
vs on demand:
"region": "eu-central-1" - $0.048
"region": "ap-northeast-1" - $0.0544
"region": "us-east-2" - $0.0416
"region": "ca-central-1" - $0.0464
"region": "eu-north-1", - $0.0432
So roughly x3 increase
Spot t4g.medium
instances:
"region": "eu-central-1" - $0.0113
"region": "ap-northeast-1" - $0.0121
"region": "us-east-2" - $0.0055
"region": "ca-central-1" - $0.0064
"region": "eu-north-1", - $0.0105
vs on demand:
"region": "eu-central-1" - $0.0384
"region": "ap-northeast-1" - $0.0432
"region": "us-east-2" - $0.0336
"region": "ca-central-1" - $0.0368
"region": "eu-north-1", - $0.0344
x4 increase
Current total flakiness for these tests - scheduled workflows, main branch only:
"Conformance AWS-CNI (ci-awscni)" - 20 failed out of 50 = 40%
"Conformance EKS (ci-es)" - 17 failed out of 50 = 34%
Proposal
Let's switch to on-demand for two weeks as an experiment, gather an additional ~50 runs, and check flakiness again.
After that, we can decide to:
- if flakiness doesn't improve significantly revert this change
- If flakiness improves significantly, lower the schedule rate and stick with on-demand instances or accept higher cost
Total additional cost estimation for the experiment: assuming we run for 2 weeks, 4x a day, 2 workflows, 5 test cases each, duration 1h (this is actually overestimating)
14 days * 4 runs/day * 2 workflows * 5 test cases/workflow * 1 hour-duration/test case = 560 hours
price increase for instance t3.medium: (0.048 + 0.0544 + 0.0416 + 0.0464 + 0.0432) - (0.0153 + 0.0176 + 0.0126 + 0.0149 + 0.017) = 0.1562
price increase for instance t4g.medium: (0.0384 + 0.0432 + 0.0336 + 0.0368 + 0.0344) - (0.0113 + 0.0121 + 0.0055 + 0.0064 + 0.0105) = 0.1406
Total cost: 560 * (0.1562 + 0.1406) = 166.20800 USD