Skip to content

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented Mar 4, 2025

Add a step at the end of the egress gateway scale test to assess the maximum number of parallel connections that can be opened via the egress gateway. The test leverages an extended version of the egw-scale-utils tools, with the server configured to keep connections open, and the client opening new connections until repeated failures are encountered.

The test is repeated four times, assessing the following combinations:

  1. base: client hosted on node N1, towards target at port P1.
  2. same-port-node: client hosted on node N1, towards target at port P1; it evaluates whether a client hosted on the same node, and targeting the same destination is affected by the other open connections.
  3. diff-node: client hosted on node N2, towards target at port P1; it evaluates whether a client hosted on a different node, and targeting the same destination is affected by the other open connections.
  4. diff-port: client hosted on node N1, towards target at port P2; it evaluates whether a client hosted on the same node, but targeting a different destination is affected by the other open connections.

Please review commit by commit, and refer to the individual commit descriptions for additional details.

@giorio94 giorio94 added area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI. feature/egress-gateway Impacts the egress IP gateway feature. labels Mar 4, 2025
@giorio94 giorio94 force-pushed the pr/giorio94/main/scale-test-egw-conn branch 3 times, most recently from b33ed4a to f2a382a Compare March 5, 2025 09:55
@giorio94 giorio94 requested a review from marseel March 5, 2025 09:55
@giorio94 giorio94 marked this pull request as ready for review March 5, 2025 09:55
@giorio94 giorio94 requested review from a team as code owners March 5, 2025 09:55
@giorio94 giorio94 added the dont-merge/blocked Another PR must be merged before this one. label Mar 5, 2025
@giorio94
Copy link
Member Author

giorio94 commented Mar 5, 2025

Marking ready for review to start collecting feedback. The last commit is temporary, and needs to be replaced once cilium/scaffolding#196 is merged.

@giorio94
Copy link
Member Author

giorio94 commented Mar 5, 2025

Converting back to draft, as a8e2e46 turned out not being enough.

@giorio94 giorio94 marked this pull request as draft March 5, 2025 10:11
@giorio94 giorio94 force-pushed the pr/giorio94/main/scale-test-egw-conn branch from f2a382a to c3fcdfe Compare March 5, 2025 11:20
@giorio94
Copy link
Member Author

giorio94 commented Mar 5, 2025

/scale-egw

giorio94 added 5 commits March 5, 2025 14:00
By default, credentials expire after 1h. Given that the test duration is
now close to that threshold, let's refresh the credentials before the
cluster cleanup step, so that it can always complete successfully.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, ClusterLoader2 does not play well with EKS 1.32, because
anonymous authentication to the API Server metrics endpoint is no
longer possible [1]. Let's ensure we use a supported version until
the issue gets addressed upstream [2].

[1]: https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions-standard.html#kubernetes-1-32
[2]: kubernetes/perf-tests#3214

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Thresholds are ignored if the enableViolations field is not set.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
By default, Cilium disables masquerading towards destinations within the
local VPC, as directly reachable. Yet, although the external target is
deployed in the same VPC for convenience, we want traffic directed to
it to be masqueraded, to reflect the most common type of real deployments,
in which the target is located in an external network. Hence, let's
configure the native routing CIDR to only match the availability zone
hosting the cluster nodes, considering that the external target is
explicitly pinned to a different one.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
So that we trigger an explicit failure in case the metrics don't
include all the expected client pods for whatever reason.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 force-pushed the pr/giorio94/main/scale-test-egw-conn branch 2 times, most recently from 585926b to 14ccc60 Compare March 5, 2025 14:06
@giorio94
Copy link
Member Author

giorio94 commented Mar 5, 2025

/scale-egw

@giorio94 giorio94 removed the dont-merge/blocked Another PR must be merged before this one. label Mar 5, 2025
@giorio94
Copy link
Member Author

giorio94 commented Mar 5, 2025

Ready for review again.

@giorio94 giorio94 marked this pull request as ready for review March 5, 2025 15:11
@giorio94
Copy link
Member Author

giorio94 commented Mar 5, 2025

/test

@giorio94 giorio94 force-pushed the pr/giorio94/main/scale-test-egw-conn branch from 14ccc60 to aeb53ee Compare March 6, 2025 11:02
@giorio94
Copy link
Member Author

giorio94 commented Mar 6, 2025

@giorio94
Copy link
Member Author

giorio94 commented Mar 6, 2025

/scale-egw

@giorio94
Copy link
Member Author

giorio94 commented Mar 6, 2025

/test

Copy link
Contributor

@marseel marseel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, only non-blocking nits.

giorio94 added 5 commits March 6, 2025 15:16
Uniform the metric names, so that they are always prepended by the test
name for easier visualization in perfdash. Additionally, let's rework
the collection of the Cilium CPU/memory metrics during the masquerade
test, to explicit their purpose and mimic the others. The queries are
slightly adapted considering the limited number of nodes. The other
metrics are no longer collected, as not significant in this test.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Sometimes the masquerade delay metrics collection returns no samples,
as the test completes very quickly, and Prometheus may not even have
discovered the scrape target at that point. As a workaround, let's
explicitly wait for it to appear before moving on with the actual test.
Once the target is known, the discovery of the actual pods (once
created) is supposed to be significantly faster.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Add a step at the end of the egress gateway scale test to assess the
maximum number of parallel connections that can be opened via the egress
gateway. The test leverages an extended version of the egw-scale-utils
tools, with the server configured to keep connections open, and the
client opening new connections until repeated failures are encountered.

The test is repeated four times, assessing the following combinations:

1. base: client hosted on node N1, towards target at port P1.
2. same-port-node: client hosted on node N1, towards target at port P1;
   it evaluates whether a client hosted on the same node, and targeting
   the same destination is affected by the other open connections.
3. diff-node: client hosted on node N2, towards target at port P1; it
   evaluates whether a client hosted on a different node, and targeting
   the same destination is affected by the other open connections.
4. diff-port: client hosted on node N1, towards target at port P2; it
   evaluates whether a client hosted on the same node, but targeting
   a different destination is affected by the other open connections.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Update the egress gateway scale test README file to include an overview
of the additional tests recently introduced. Overall, the README file is
simplified and generalized a bit, omitting more specific details, to
reduce the maintenance burden and the likelihood of divergences whenever
minor modifications to the scale test are performed.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 force-pushed the pr/giorio94/main/scale-test-egw-conn branch from aeb53ee to 7238c20 Compare March 6, 2025 14:17
@giorio94
Copy link
Member Author

giorio94 commented Mar 6, 2025

/scale-egw

@giorio94
Copy link
Member Author

giorio94 commented Mar 6, 2025

/test

@giorio94
Copy link
Member Author

giorio94 commented Mar 6, 2025

The failure in the last scale test run is unrelated, and caused by a flake during features collection. The scale test itself completed successfully, so I'll not re-run it.

@giorio94 giorio94 enabled auto-merge March 6, 2025 15:27
@giorio94 giorio94 added this pull request to the merge queue Mar 6, 2025
Merged via the queue into main with commit f01f60f Mar 6, 2025
292 of 294 checks passed
@giorio94 giorio94 deleted the pr/giorio94/main/scale-test-egw-conn branch March 6, 2025 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake feature/egress-gateway Impacts the egress IP gateway feature. release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants