CI: Conformance EKS (ci-eks) / Conformance AWS-CNI (ci-awscni)  flaky due to spot instance usage

## CI failure

Due to Black Friday / Christmas time, it seems like the AWS cloud is running close to capacity and spot instances are often terminated. 
To reduce flakiness, I propose we switch from spot instances to on-demand instances as a temporary measure., **but only for scheduled** workflows. This will allow us to get a more reliable signal for releases.

Example runs:
https://github.com/cilium/cilium/actions/runs/6966367287/job/18956418664
during test execution:
![Screenshot from 2023-11-24 11-30-21](https://github.com/cilium/cilium/assets/2011575/0b28bf41-db0b-4e53-b114-69f10f70f67b)

https://github.com/cilium/cilium/actions/runs/6961628303/job/18943657274
```
lastHeartbeatTime: "2023-11-22T19:18:24Z"
vs another node:
lastHeartbeatTime: "2023-11-22T19:21:07Z"
```

Probably this run as well: https://github.com/cilium/cilium/actions/runs/6964276880/job/18951275067
as we have this pod running in post information gathering at the end: `client-996668d96-nwk6g` vs `client-996668d96-gcwdm` is being used in tests (no sysdump though to verify). Similarly `cilium-7f4bn` doesn't exist at the end of the test.


# Estimated cost increase
setup-eks-cluster is used in the two workflows above. We use two spot instances: `t3.medium` and `t4g.medium`
Both EKS and AWS CNI are run every 6h with the following regions:
```
      "region": "eu-central-1"
      "region": "ap-northeast-1"
      "region": "us-east-2"
      "region": "ca-central-1"
      "region": "eu-north-1",
```

Spot `t3.medium` instances:
```
      "region": "eu-central-1" - $0.0153
      "region": "ap-northeast-1" - $0.0176
      "region": "us-east-2" - $0.0126
      "region": "ca-central-1" - $0.0149
      "region": "eu-north-1", - $0.017
```
vs on demand:
```
      "region": "eu-central-1" - $0.048
      "region": "ap-northeast-1" - $0.0544
      "region": "us-east-2" -  $0.0416
      "region": "ca-central-1" - $0.0464
      "region": "eu-north-1", - $0.0432
```
So roughly x3 increase
Spot `t4g.medium` instances:
```
      "region": "eu-central-1" - $0.0113
      "region": "ap-northeast-1" - $0.0121
      "region": "us-east-2" -  $0.0055
      "region": "ca-central-1" - $0.0064
      "region": "eu-north-1", - $0.0105
```
vs on demand:
```
      "region": "eu-central-1" - $0.0384
      "region": "ap-northeast-1" - $0.0432
      "region": "us-east-2" -  $0.0336
      "region": "ca-central-1" - $0.0368
      "region": "eu-north-1", - $0.0344
```
x4 increase

Current total flakiness for these tests - scheduled workflows, main branch only:
"Conformance AWS-CNI (ci-awscni)" - 20 failed out of 50  = 40%
"Conformance EKS (ci-es)" - 17 failed out of 50  = 34%


# Proposal

Let's switch to on-demand for two weeks as an experiment, gather an additional ~50 runs, and check flakiness again. 
After that, we can decide to:
- if flakiness doesn't improve significantly revert this change 
- If flakiness improves significantly, lower the schedule rate and stick with on-demand instances or accept higher cost

Total additional cost estimation for the experiment: assuming we run for 2 weeks, 4x a day, 2 workflows, 5 test cases each, duration 1h (this is actually overestimating)
`14 days * 4 runs/day * 2 workflows * 5 test cases/workflow  * 1 hour-duration/test case = 560 hours`
price increase for instance t3.medium: `(0.048 + 0.0544 + 0.0416 + 0.0464 + 0.0432) - (0.0153 + 0.0176 + 0.0126 + 0.0149 + 0.017)  = 0.1562`
price increase for instance t4g.medium: `(0.0384 + 0.0432 + 0.0336 + 0.0368 + 0.0344) - (0.0113 + 0.0121 + 0.0055 + 0.0064 + 0.0105) =  0.1406`

Total cost: `560 * (0.1562 + 0.1406) = 166.20800 USD`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CI: Conformance EKS (ci-eks) / Conformance AWS-CNI (ci-awscni) flaky due to spot instance usage #29365

CI failure

Estimated cost increase

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CI: Conformance EKS (ci-eks) / Conformance AWS-CNI (ci-awscni) flaky due to spot instance usage #29365

Description

CI failure

Estimated cost increase

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions