Skip to content

Conversation

dylandreimerink
Copy link
Member

We already have extensive tests for actual behavior. These tests are written such that we directly call into the event processing code. This is done so all code is executed synchronously which makes them very stable.

A flaw of calling directly into the event processing code is that we don't test the actual lifecycle of the LB-IPAM controller and bugs that might arise in the path between the API client and the event processing logic.

This commit adds a test to exercise the lifecycle of the LB-IPAM controller. To do so we create a hive with smallest setup which still works. LB-IPAM now increments a few atomic counters (only during testing) which we use to glean information about the lifecycle of the controller and if its processing events or not. We start the hive as we would in production. Since everything now runs in its own goroutine we do not have strong guarantees about when the controller will process events. To compensate, relatively large timeouts are used when waiting for something to happen, in an attempt to prevent flaky tests. However due to the nature of the tests, it might still happen, and we will have to keep an eye on it.

When stress testing it locally it seems stable:

stress -o /tmp/test- -p 100 ./lbipam.test -test.run "TestLBIPAMStartupRestartShutdown" -test.timeout=30s
5s: 0 runs so far, 0 failures, 100 active
10s: 0 runs so far, 0 failures, 100 active
15s: 100 runs so far, 0 failures, 100 active
20s: 100 runs so far, 0 failures, 100 active
25s: 103 runs so far, 0 failures, 100 active
30s: 200 runs so far, 0 failures, 100 active
35s: 200 runs so far, 0 failures, 100 active
40s: 300 runs so far, 0 failures, 100 active
45s: 300 runs so far, 0 failures, 100 active
50s: 349 runs so far, 0 failures, 100 active
55s: 400 runs so far, 0 failures, 100 active
1m0s: 400 runs so far, 0 failures, 100 active
1m5s: 500 runs so far, 0 failures, 100 active
1m10s: 500 runs so far, 0 failures, 100 active
1m15s: 586 runs so far, 0 failures, 100 active
1m20s: 600 runs so far, 0 failures, 100 active
1m25s: 600 runs so far, 0 failures, 100 active
1m30s: 700 runs so far, 0 failures, 100 active
1m35s: 700 runs so far, 0 failures, 100 active
1m40s: 800 runs so far, 0 failures, 100 active
1m45s: 800 runs so far, 0 failures, 100 active
1m50s: 800 runs so far, 0 failures, 100 active
1m55s: 900 runs so far, 0 failures, 100 active
2m0s: 900 runs so far, 0 failures, 100 active

This should hopefully catch regressions of #37965 or similar cases.

@dylandreimerink dylandreimerink added the release-note/ci This PR makes changes to the CI. label Apr 15, 2025
@dylandreimerink dylandreimerink force-pushed the feature/lbipam-lifecycle-tests branch 3 times, most recently from 176bee7 to 0d4ef85 Compare April 16, 2025 10:37
@dylandreimerink
Copy link
Member Author

/test

@dylandreimerink dylandreimerink marked this pull request as ready for review April 22, 2025 09:36
@dylandreimerink dylandreimerink requested a review from a team as a code owner April 22, 2025 09:36
@dylandreimerink dylandreimerink requested a review from aspsk April 22, 2025 09:36
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 23, 2025
@dylandreimerink dylandreimerink added this pull request to the merge queue Apr 23, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 23, 2025
@dylandreimerink dylandreimerink force-pushed the feature/lbipam-lifecycle-tests branch from 0d4ef85 to dee886f Compare April 23, 2025 09:42
We already have extensive tests for actual behavior. These tests are
written such that we directly call into the event processing code.
This is done so all code is executed synchronously which makes them
very stable.

A flaw of calling directly into the event processing code is that we
don't test the actual lifecycle of the LB-IPAM controller and bugs that
might arise in the path between the API client and the event processing
logic.

This commit adds a test to exercise the lifecycle of the LB-IPAM
controller. To do so we create a hive with smallest setup which still
works. LB-IPAM now increments a few atomic counters (only during
testing) which we use to glean information about the lifecycle of the
controller and if its processing events or not. We start the hive as
we would in production. Since everything now runs in its own goroutine
we do not have strong guarantees about when the controller will process
events. To compensate, relatively large timeouts are used when waiting
for something to happen, in an attempt to prevent flaky tests. However
due to the nature of the tests, it might still happen, and we will have
to keep an eye on it.

Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>
@dylandreimerink dylandreimerink force-pushed the feature/lbipam-lifecycle-tests branch from dee886f to ff7ec47 Compare April 23, 2025 09:44
@dylandreimerink
Copy link
Member Author

/test

1 similar comment
@dylandreimerink
Copy link
Member Author

/test

@dylandreimerink dylandreimerink added this pull request to the merge queue Apr 23, 2025
Merged via the queue into cilium:main with commit 2b47cfd Apr 23, 2025
65 of 66 checks passed
@dylandreimerink dylandreimerink deleted the feature/lbipam-lifecycle-tests branch April 23, 2025 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants