Skip to content

CI: Conformance EKS (ci-eks) / Conformance AWS-CNI (ci-awscni) flaky due to spot instance usage #29365

@marseel

Description

@marseel

CI failure

Due to Black Friday / Christmas time, it seems like the AWS cloud is running close to capacity and spot instances are often terminated.
To reduce flakiness, I propose we switch from spot instances to on-demand instances as a temporary measure., but only for scheduled workflows. This will allow us to get a more reliable signal for releases.

Example runs:
https://github.com/cilium/cilium/actions/runs/6966367287/job/18956418664
during test execution:
Screenshot from 2023-11-24 11-30-21

https://github.com/cilium/cilium/actions/runs/6961628303/job/18943657274

lastHeartbeatTime: "2023-11-22T19:18:24Z"
vs another node:
lastHeartbeatTime: "2023-11-22T19:21:07Z"

Probably this run as well: https://github.com/cilium/cilium/actions/runs/6964276880/job/18951275067
as we have this pod running in post information gathering at the end: client-996668d96-nwk6g vs client-996668d96-gcwdm is being used in tests (no sysdump though to verify). Similarly cilium-7f4bn doesn't exist at the end of the test.

Estimated cost increase

setup-eks-cluster is used in the two workflows above. We use two spot instances: t3.medium and t4g.medium
Both EKS and AWS CNI are run every 6h with the following regions:

      "region": "eu-central-1"
      "region": "ap-northeast-1"
      "region": "us-east-2"
      "region": "ca-central-1"
      "region": "eu-north-1",

Spot t3.medium instances:

      "region": "eu-central-1" - $0.0153
      "region": "ap-northeast-1" - $0.0176
      "region": "us-east-2" - $0.0126
      "region": "ca-central-1" - $0.0149
      "region": "eu-north-1", - $0.017

vs on demand:

      "region": "eu-central-1" - $0.048
      "region": "ap-northeast-1" - $0.0544
      "region": "us-east-2" -  $0.0416
      "region": "ca-central-1" - $0.0464
      "region": "eu-north-1", - $0.0432

So roughly x3 increase
Spot t4g.medium instances:

      "region": "eu-central-1" - $0.0113
      "region": "ap-northeast-1" - $0.0121
      "region": "us-east-2" -  $0.0055
      "region": "ca-central-1" - $0.0064
      "region": "eu-north-1", - $0.0105

vs on demand:

      "region": "eu-central-1" - $0.0384
      "region": "ap-northeast-1" - $0.0432
      "region": "us-east-2" -  $0.0336
      "region": "ca-central-1" - $0.0368
      "region": "eu-north-1", - $0.0344

x4 increase

Current total flakiness for these tests - scheduled workflows, main branch only:
"Conformance AWS-CNI (ci-awscni)" - 20 failed out of 50 = 40%
"Conformance EKS (ci-es)" - 17 failed out of 50 = 34%

Proposal

Let's switch to on-demand for two weeks as an experiment, gather an additional ~50 runs, and check flakiness again.
After that, we can decide to:

  • if flakiness doesn't improve significantly revert this change
  • If flakiness improves significantly, lower the schedule rate and stick with on-demand instances or accept higher cost

Total additional cost estimation for the experiment: assuming we run for 2 weeks, 4x a day, 2 workflows, 5 test cases each, duration 1h (this is actually overestimating)
14 days * 4 runs/day * 2 workflows * 5 test cases/workflow * 1 hour-duration/test case = 560 hours
price increase for instance t3.medium: (0.048 + 0.0544 + 0.0416 + 0.0464 + 0.0432) - (0.0153 + 0.0176 + 0.0126 + 0.0149 + 0.017) = 0.1562
price increase for instance t4g.medium: (0.0384 + 0.0432 + 0.0336 + 0.0368 + 0.0344) - (0.0113 + 0.0121 + 0.0055 + 0.0064 + 0.0105) = 0.1406

Total cost: 560 * (0.1562 + 0.1406) = 166.20800 USD

Metadata

Metadata

Assignees

Labels

area/CIContinuous Integration testing issue or flakeci/flakeThis is a known failure that occurs in the tree. Please investigate me!

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions