Skip to content

Conversation

jibi
Copy link
Member

@jibi jibi commented Jun 1, 2021

When an endpoint's identity is updated, Cilium does not sync immediately
the new state with k8s, but rather waits up to 10 seconds for the
sync-to-k8s-ciliumendpoint controller to run, meaning that the the new
identity can remain unannounced for up to 10 seconds.

This commit fixes this by explicitly triggering the k8s sync controller
whenever an endpoint's identity is updated.

Fixes #15097

@jibi jibi added area/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Jun 1, 2021
@jibi jibi requested review from gandro, aanm and a team June 1, 2021 13:43
@jibi jibi requested a review from a team as a code owner June 1, 2021 13:43
@jibi jibi requested a review from a team June 1, 2021 13:43
@maintainer-s-little-helper maintainer-s-little-helper bot assigned aanm and gandro and unassigned aanm Jun 1, 2021
Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton for taking care of this! Looks good overall. I have some minor nits (none of which need to be blocking)

@@ -135,7 +136,7 @@ func (m *Manager) removeController(ctrl *Controller) {
ctrl.getLogger().Debug("Removed controller")
}

func (m *Manager) lookup(name string) *Controller {
func (m *Manager) Lookup(name string) *Controller {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No action required, just a thought:

One thing that I just noticed is that this export here is the first export that gives access to a Controller struct outside of the package. All other methods of the Manager do not allow direct access to the Controller. While it seems safe (i.e. it seems that m.mutex is not protecting it), I wonder if it might be cleaner and more future-proof to expose a TriggerController(name string) method instead (similar to TerminationChannel), instead of allowing a a direct lookup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes sense, that's where (e *Endpoint) TriggerController(name string) should actually live 👍

Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
@jibi jibi force-pushed the pr/jibi/ipcache-propagation-delay branch from a3e881a to 26d3fca Compare June 1, 2021 14:25
@jibi
Copy link
Member Author

jibi commented Jun 1, 2021

test-me-please

@aanm
Copy link
Member

aanm commented Jun 1, 2021

@jibi can you add the GH issue ID this fixes?

@jibi
Copy link
Member Author

jibi commented Jun 1, 2021

@jibi can you add the GH issue ID this fixes?

@aanm it's in the second commit (I only linked it to #15097 as given the latest comment I don't think this fully fixes also #16302)

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I have one suggestion below, but LMK what you think on it.

@aanm
Copy link
Member

aanm commented Jun 1, 2021

@jibi can you add the GH issue ID this fixes?

@aanm it's in the second commit (I only linked it to #15097 as given the latest comment I don't think this fully fixes also #16302)

@jibi thank you, I meant in the PR description so that GH will close them automatically as soon this PR is merged.

When an endpoint's identity is updated, Cilium does not sync immediately
the new state with k8s, but rather waits up to 10 seconds for the
sync-to-k8s-ciliumendpoint controller to run, meaning that the the new
identity can remain unannounced for up to 10 seconds.

This commit fixes this by explicitly triggering the k8s sync controller
whenever an endpoint's identity is updated.

Fixes: #15097
Suggested-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
@jibi jibi force-pushed the pr/jibi/ipcache-propagation-delay branch from 26d3fca to 35b2ed8 Compare June 2, 2021 06:34
@jibi
Copy link
Member Author

jibi commented Jun 2, 2021

@jibi can you add the GH issue ID this fixes?

@aanm it's in the second commit (I only linked it to #15097 as given the latest comment I don't think this fully fixes also #16302)

@jibi thank you, I meant in the PR description so that GH will close them automatically as soon this PR is merged.

My bad, I thought GH would look for that also in commit messages 🤦‍♂️

@jibi
Copy link
Member Author

jibi commented Jun 2, 2021

test-me-please

@jibi
Copy link
Member Author

jibi commented Jun 2, 2021

retest-5.4

@gandro gandro added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 2, 2021
@gandro
Copy link
Member

gandro commented Jun 2, 2021

Marked as ready to merge. All tests have passed, but the bot is currently known to be broken due to Validate & Build HTML being skipped on Non-Docs PR.

@nbusseneau
Copy link
Member

Added for backporting to 1.10, otherwise we'll continue having these flakes in 1.10 automated jobs ad vitam æternam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: AutoDirectNodeRoutes Check connectivity with automatic direct nodes routes
8 participants