-
Notifications
You must be signed in to change notification settings - Fork 3.4k
clustermesh: fix remote service deletion on endpointslicesync #32961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clustermesh: fix remote service deletion on endpointslicesync #32961
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I've left a couple of comments inline.
47aae40
to
8bf26e9
Compare
8bf26e9
to
ce016b6
Compare
Thanks for the feedback and finding the issue + the root cause of the test @giorio94 🙏. I unfortunately had to remove one of the test involving a remote service and no local service because I cannot really create an endpointslice locally and check that it's removed because this deletion is made through the normal ownerreferences. |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of final comments, but overall looks good to me, thanks!
How about asserting in one of the tests that the resulting endpointslice has the correct owner reference? That would give us confidence that k8s will then take care of removing the stale endpointslices. |
e41827b
to
6e91b95
Compare
/test |
6639d0a
to
922c957
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, just a couple of nits inline. I've re-run the stress test, and hit no failures after 5k runs.
922c957
to
ea9d527
Compare
/test |
ea9d527
to
caf7894
Compare
/test |
@MrFreezeex One of the unit test failures are legitimate:
|
caf7894
to
16fdde9
Compare
Woops indeed my bad! |
/test |
@MrFreezeex, could you please rebase this PR? That should fix the failing tests |
16fdde9
to
057adb4
Compare
/test |
/ci-ingress |
/ci-ginkgo |
/ci-eks |
/ci-ipsec-upgrade |
/ci-ingress |
@MrFreezeex can you rebase? it looks there are a couple of conflicting files. |
Use EventuallyWithT instead of waiting for the controller queue to be empty. The queue being empty does not signify that the reconciliation is done as the controller pops elements when the reconciliation starts and not when it ends. Suggested-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
This commit fixes the case when a remote service gets deleted while the local service remains. Previously the Reconcile on the controller wasn't called because we weren't returning the deleted Service in the service informer List method which may leave some EndpointSlice in the cluster that should be deleted. In an usual Kubernetes environment this is not an issue since the OwnerReference should be deleting the EndpointSlices as well. In our case the actual Service still exist because we have this mechanism of creating "virtual" service by adding the cluster name in suffix to trigger a reconcile on each remote cluster. To fix that we are now returning the combination of all the local Services and the remote clusters instead of returning all the remote services. This allows to trigger a reconcile on all the possible services including some of them that don't exist which would make the Get method of the Service informer to return a not found error which will then trigger a deletion via our cleanup hook. Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
057adb4
to
217acee
Compare
/test |
This fixes remote service deletion in endpointslicesync (see second commit).
And it also fixes tests by using EventuallyWithT instead of waiting for the controller queue to be empty. The queue being empty does not signify that the reconciliation is done as the controller pops elements when the reconciliation starts and not when it ends.
cc @giorio94
Fixes #32931