-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Egress gateway parallel connections testing #37981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b33ed4a
to
f2a382a
Compare
Marking ready for review to start collecting feedback. The last commit is temporary, and needs to be replaced once cilium/scaffolding#196 is merged. |
Converting back to draft, as a8e2e46 turned out not being enough. |
f2a382a
to
c3fcdfe
Compare
/scale-egw |
By default, credentials expire after 1h. Given that the test duration is now close to that threshold, let's refresh the credentials before the cluster cleanup step, so that it can always complete successfully. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, ClusterLoader2 does not play well with EKS 1.32, because anonymous authentication to the API Server metrics endpoint is no longer possible [1]. Let's ensure we use a supported version until the issue gets addressed upstream [2]. [1]: https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions-standard.html#kubernetes-1-32 [2]: kubernetes/perf-tests#3214 Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Thresholds are ignored if the enableViolations field is not set. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
By default, Cilium disables masquerading towards destinations within the local VPC, as directly reachable. Yet, although the external target is deployed in the same VPC for convenience, we want traffic directed to it to be masqueraded, to reflect the most common type of real deployments, in which the target is located in an external network. Hence, let's configure the native routing CIDR to only match the availability zone hosting the cluster nodes, considering that the external target is explicitly pinned to a different one. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
So that we trigger an explicit failure in case the metrics don't include all the expected client pods for whatever reason. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
585926b
to
14ccc60
Compare
/scale-egw |
Ready for review again. |
/test |
14ccc60
to
aeb53ee
Compare
Additionally gathering the NAT source port saturation metric: https://github.com/cilium/cilium/compare/14ccc604bf7c4bd7cbfa9a380f913ea0b1cb7324..aeb53eec4cc667416a5bfd4776a1b9654a38b720 |
/scale-egw |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, only non-blocking nits.
Uniform the metric names, so that they are always prepended by the test name for easier visualization in perfdash. Additionally, let's rework the collection of the Cilium CPU/memory metrics during the masquerade test, to explicit their purpose and mimic the others. The queries are slightly adapted considering the limited number of nodes. The other metrics are no longer collected, as not significant in this test. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Sometimes the masquerade delay metrics collection returns no samples, as the test completes very quickly, and Prometheus may not even have discovered the scrape target at that point. As a workaround, let's explicitly wait for it to appear before moving on with the actual test. Once the target is known, the discovery of the actual pods (once created) is supposed to be significantly faster. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Add a step at the end of the egress gateway scale test to assess the maximum number of parallel connections that can be opened via the egress gateway. The test leverages an extended version of the egw-scale-utils tools, with the server configured to keep connections open, and the client opening new connections until repeated failures are encountered. The test is repeated four times, assessing the following combinations: 1. base: client hosted on node N1, towards target at port P1. 2. same-port-node: client hosted on node N1, towards target at port P1; it evaluates whether a client hosted on the same node, and targeting the same destination is affected by the other open connections. 3. diff-node: client hosted on node N2, towards target at port P1; it evaluates whether a client hosted on a different node, and targeting the same destination is affected by the other open connections. 4. diff-port: client hosted on node N1, towards target at port P2; it evaluates whether a client hosted on the same node, but targeting a different destination is affected by the other open connections. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Update the egress gateway scale test README file to include an overview of the additional tests recently introduced. Overall, the README file is simplified and generalized a bit, omitting more specific details, to reduce the maintenance burden and the likelihood of divergences whenever minor modifications to the scale test are performed. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
aeb53ee
to
7238c20
Compare
/scale-egw |
/test |
The failure in the last scale test run is unrelated, and caused by a flake during features collection. The scale test itself completed successfully, so I'll not re-run it. |
Add a step at the end of the egress gateway scale test to assess the maximum number of parallel connections that can be opened via the egress gateway. The test leverages an extended version of the egw-scale-utils tools, with the server configured to keep connections open, and the client opening new connections until repeated failures are encountered.
The test is repeated four times, assessing the following combinations:
Please review commit by commit, and refer to the individual commit descriptions for additional details.