Skip to content

Conversation

MrFreezeex
Copy link
Member

@MrFreezeex MrFreezeex commented Jun 7, 2024

This fixes remote service deletion in endpointslicesync (see second commit).

And it also fixes tests by using EventuallyWithT instead of waiting for the controller queue to be empty. The queue being empty does not signify that the reconciliation is done as the controller pops elements when the reconciliation starts and not when it ends.

cc @giorio94
Fixes #32931

@MrFreezeex MrFreezeex requested a review from a team as a code owner June 7, 2024 13:14
@MrFreezeex MrFreezeex requested a review from thorn3r June 7, 2024 13:14
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 7, 2024
@giorio94 giorio94 added area/clustermesh Relates to multi-cluster routing functionality in Cilium. release-note/misc This PR makes changes that have no direct user impact. labels Jun 7, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 7, 2024
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I've left a couple of comments inline.

@MrFreezeex MrFreezeex force-pushed the endpointslice-sync-race-test-fix branch from 47aae40 to 8bf26e9 Compare June 8, 2024 00:12
@MrFreezeex MrFreezeex changed the title clustermesh: use EventuallyWithT in endpointslicemeshsync clustermesh: fix remote service deletion on endpointslicesync Jun 8, 2024
@MrFreezeex MrFreezeex force-pushed the endpointslice-sync-race-test-fix branch from 8bf26e9 to ce016b6 Compare June 8, 2024 00:17
@MrFreezeex
Copy link
Member Author

Thanks for the feedback and finding the issue + the root cause of the test @giorio94 🙏.

I unfortunately had to remove one of the test involving a remote service and no local service because I cannot really create an endpointslice locally and check that it's removed because this deletion is made through the normal ownerreferences.

@MrFreezeex
Copy link
Member Author

/test

@MrFreezeex MrFreezeex requested a review from giorio94 June 8, 2024 00:24
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of final comments, but overall looks good to me, thanks!

@giorio94
Copy link
Member

I unfortunately had to remove one of the test involving a remote service and no local service because I cannot really create an endpointslice locally and check that it's removed because this deletion is made through the normal ownerreferences.

How about asserting in one of the tests that the resulting endpointslice has the correct owner reference? That would give us confidence that k8s will then take care of removing the stale endpointslices.

@MrFreezeex MrFreezeex force-pushed the endpointslice-sync-race-test-fix branch 2 times, most recently from e41827b to 6e91b95 Compare June 10, 2024 11:07
@MrFreezeex MrFreezeex requested a review from a team as a code owner June 10, 2024 12:37
@MrFreezeex MrFreezeex requested a review from nathanjsweet June 10, 2024 12:37
@MrFreezeex
Copy link
Member Author

/test

@MrFreezeex MrFreezeex force-pushed the endpointslice-sync-race-test-fix branch from 6639d0a to 922c957 Compare June 10, 2024 13:41
@MrFreezeex
Copy link
Member Author

/test

Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just a couple of nits inline. I've re-run the stress test, and hit no failures after 5k runs.

@giorio94 giorio94 removed the request for review from thorn3r June 11, 2024 08:03
@MrFreezeex MrFreezeex force-pushed the endpointslice-sync-race-test-fix branch from 922c957 to ea9d527 Compare June 11, 2024 08:48
@MrFreezeex
Copy link
Member Author

/test

@MrFreezeex MrFreezeex force-pushed the endpointslice-sync-race-test-fix branch from ea9d527 to caf7894 Compare June 11, 2024 09:00
@MrFreezeex
Copy link
Member Author

/test

@giorio94
Copy link
Member

@MrFreezeex One of the unit test failures are legitimate:

 --- FAIL: Test_meshEndpointSlice_Reconcile/Create_service_then_global_service (0.01s)

I0611 09:13:26.678082   49847 endpointslice_controller.go:296] "Starting endpoint slice controller"
I0611 09:13:26.678103   49847 shared_informer.go:313] Waiting for caches to sync for endpoint_slice
I0611 09:13:26.678110   49847 shared_informer.go:320] Caches are synced for endpoint_slice
    endpointslice_test.go:161: 
        	Error Trace:	/home/runner/work/cilium/cilium/pkg/clustermesh/endpointslicesync/endpointslice_test.go:161
        	Error:      	Not equal: 
        	            	expected: "service"
        	            	actual  : "Service"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-service
        	            	+Service
        	Test:       	Test_meshEndpointSlice_Reconcile/Create_service_then_global_service

@MrFreezeex MrFreezeex force-pushed the endpointslice-sync-race-test-fix branch from caf7894 to 16fdde9 Compare June 13, 2024 12:48
@MrFreezeex
Copy link
Member Author

@MrFreezeex One of the unit test failures are legitimate:

 --- FAIL: Test_meshEndpointSlice_Reconcile/Create_service_then_global_service (0.01s)

I0611 09:13:26.678082   49847 endpointslice_controller.go:296] "Starting endpoint slice controller"
I0611 09:13:26.678103   49847 shared_informer.go:313] Waiting for caches to sync for endpoint_slice
I0611 09:13:26.678110   49847 shared_informer.go:320] Caches are synced for endpoint_slice
    endpointslice_test.go:161: 
        	Error Trace:	/home/runner/work/cilium/cilium/pkg/clustermesh/endpointslicesync/endpointslice_test.go:161
        	Error:      	Not equal: 
        	            	expected: "service"
        	            	actual  : "Service"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-service
        	            	+Service
        	Test:       	Test_meshEndpointSlice_Reconcile/Create_service_then_global_service

Woops indeed my bad!

@MrFreezeex
Copy link
Member Author

/test

@giorio94
Copy link
Member

@MrFreezeex, could you please rebase this PR? That should fix the failing tests

@MrFreezeex MrFreezeex force-pushed the endpointslice-sync-race-test-fix branch from 16fdde9 to 057adb4 Compare June 14, 2024 15:50
@MrFreezeex
Copy link
Member Author

/test

@MrFreezeex
Copy link
Member Author

/ci-ingress

@MrFreezeex
Copy link
Member Author

/ci-ginkgo

@MrFreezeex
Copy link
Member Author

/ci-eks

@MrFreezeex
Copy link
Member Author

/ci-ipsec-upgrade

@MrFreezeex
Copy link
Member Author

/ci-ingress

@aanm aanm added the kind/bug This is a bug in the Cilium logic. label Jun 17, 2024
@aanm
Copy link
Member

aanm commented Jun 17, 2024

@MrFreezeex can you rebase? it looks there are a couple of conflicting files.

Use EventuallyWithT instead of waiting for the controller queue to be
empty. The queue being empty does not signify that the reconciliation is
done as the controller pops elements when the reconciliation starts and
not when it ends.

Suggested-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
This commit fixes the case when a remote service gets deleted while the
local service remains. Previously the Reconcile on the controller wasn't
called because we weren't returning the deleted Service in the service
informer List method which may leave some EndpointSlice in the cluster
that should be deleted.

In an usual Kubernetes environment this is not an issue since the
OwnerReference should be deleting the EndpointSlices as well.
In our case the actual Service still exist because we have this
mechanism of creating "virtual" service by adding the cluster name in
suffix to trigger a reconcile on each remote cluster.

To fix that we are now returning the combination of all the local
Services and the remote clusters instead of returning all the remote
services. This allows to trigger a reconcile on all the possible
services including some of them that don't exist which would
make the Get method of the Service informer to return a not found error
which will then trigger a deletion via our cleanup hook.

Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
@MrFreezeex MrFreezeex force-pushed the endpointslice-sync-race-test-fix branch from 057adb4 to 217acee Compare June 17, 2024 08:52
@MrFreezeex
Copy link
Member Author

/test

@aanm aanm enabled auto-merge June 17, 2024 10:48
@aanm aanm added this pull request to the merge queue Jun 17, 2024
Merged via the queue into cilium:main with commit b7237b1 Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clustermesh Relates to multi-cluster routing functionality in Cilium. kind/bug This is a bug in the Cilium logic. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: Integration tests - Test_meshEndpointSlice_Reconcile
3 participants