Skip to content

Conversation

gandro
Copy link
Member

@gandro gandro commented Jul 21, 2021

This fixes a bug where the IncTimer.After would fire immediately due
to a rare race. See the comment in the diff as to how the race
could occur. This commit also adds a unit test which has a high
likelihood of triggering the bug in the old code.

Discovered this will debugging #15442 (which is not fully fixed by this PR,
but the reason why that other unit test faililure surfaced sporadically was
because of this bug here)

Fix bug where timers used for retries sometimes fired immediately

@gandro gandro added kind/bug This is a bug in the Cilium logic. needs-backport/1.10 labels Jul 21, 2021
@gandro gandro requested a review from a team as a code owner July 21, 2021 11:48
@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jul 21, 2021
@gandro gandro added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Jul 21, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 21, 2021
@gandro gandro force-pushed the pr/gandro/fix-inctimer-firing-too-soon branch from 8c28cb6 to 47531c0 Compare July 21, 2021 11:51
Copy link
Member

@rolinh rolinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch ! And the fix is also extremely well documented 🚀
image

@gandro gandro force-pushed the pr/gandro/fix-inctimer-firing-too-soon branch from 47531c0 to 68123f3 Compare July 21, 2021 12:00
@gandro
Copy link
Member Author

gandro commented Jul 21, 2021

test-me-please

gandro added a commit to gandro/cilium that referenced this pull request Jul 21, 2021
This commit fixes cilium#15442 (and variants), where the `done` channel used
to indicate completion to the test driver could be closed twice.  This
happened because at the end of a test, most mock client will start
returning `io.EOF`.

Due to cilium#16955, this sometimes caused the peer manager to reconnect
immediately and create a new mock client, which would then attempt to
re-run the test-logic again.

This commit addresses this issue by ensuring that all mock clients
within a test share the same state (i.e. the `i` counter and `once`
instance). This way, each mock client instance will continue the work of
its predecessor instead of replaying the whole test sequence.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro
Copy link
Member Author

gandro commented Jul 21, 2021

test-1.16-netnext

Edit: net-next hit a variant of #16399
https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-net-next/1135/

Since this PR fixes another CI flake, this should not block this PR.

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor nit and one curious question. The documentation is as Michi would say, "amazing man" 🚀

This fixes a bug where the `IncTimer.After` would fire immediately due
to a rare race. See the comment in the diff as to how the race
could occur. This commit also adds a unit test which has a high
likelihood of triggering the bug in the old code.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/fix-inctimer-firing-too-soon branch from 68123f3 to 65cd5fb Compare July 22, 2021 08:19
@gandro
Copy link
Member Author

gandro commented Jul 22, 2021

test-me-please

aanm pushed a commit that referenced this pull request Jul 22, 2021
This commit fixes #15442 (and variants), where the `done` channel used
to indicate completion to the test driver could be closed twice.  This
happened because at the end of a test, most mock client will start
returning `io.EOF`.

Due to #16955, this sometimes caused the peer manager to reconnect
immediately and create a new mock client, which would then attempt to
re-run the test-logic again.

This commit addresses this issue by ensuring that all mock clients
within a test share the same state (i.e. the `i` counter and `once`
instance). This way, each mock client instance will continue the work of
its predecessor instead of replaying the whole test sequence.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro
Copy link
Member Author

gandro commented Jul 22, 2021

The failure in net-next is unrelated, very similar to #16928: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-net-next/1140/testReport/junit/Suite-k8s-1/16/K8sPolicyTest_Basic_Test_Validate_to_entities_policies_Validate_toEntities_All/

2021-07-22T12:09:00.762112917Z level=error msg="Error while rewriting endpoint BPF program" containerID= datapathPolicyRevision=40 desiredPolicyRevision=41 endpointID=61 error="Failed to replace Qdisc for lxc92200f0d9c8b: Link not found" identity=33765 ipv4= ipv6= k8sPodName=/ subsys=endpoint

Marking this ready to merge as soon as remaining reviews are in, as this is a bug fix and thus should be exempt from the merge freeze.

jibi pushed a commit that referenced this pull request Jul 26, 2021
[ upstream commit 31176f7 ]

This commit fixes #15442 (and variants), where the `done` channel used
to indicate completion to the test driver could be closed twice.  This
happened because at the end of a test, most mock client will start
returning `io.EOF`.

Due to #16955, this sometimes caused the peer manager to reconnect
immediately and create a new mock client, which would then attempt to
re-run the test-logic again.

This commit addresses this issue by ensuring that all mock clients
within a test share the same state (i.e. the `i` counter and `once`
instance). This way, each mock client instance will continue the work of
its predecessor instead of replaying the whole test sequence.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
jibi pushed a commit that referenced this pull request Jul 26, 2021
[ upstream commit 31176f7 ]

This commit fixes #15442 (and variants), where the `done` channel used
to indicate completion to the test driver could be closed twice.  This
happened because at the end of a test, most mock client will start
returning `io.EOF`.

Due to #16955, this sometimes caused the peer manager to reconnect
immediately and create a new mock client, which would then attempt to
re-run the test-logic again.

This commit addresses this issue by ensuring that all mock clients
within a test share the same state (i.e. the `i` counter and `once`
instance). This way, each mock client instance will continue the work of
its predecessor instead of replaying the whole test sequence.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
@gandro gandro added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 26, 2021
jibi pushed a commit that referenced this pull request Jul 27, 2021
[ upstream commit 31176f7 ]

This commit fixes #15442 (and variants), where the `done` channel used
to indicate completion to the test driver could be closed twice.  This
happened because at the end of a test, most mock client will start
returning `io.EOF`.

Due to #16955, this sometimes caused the peer manager to reconnect
immediately and create a new mock client, which would then attempt to
re-run the test-logic again.

This commit addresses this issue by ensuring that all mock clients
within a test share the same state (i.e. the `i` counter and `once`
instance). This way, each mock client instance will continue the work of
its predecessor instead of replaying the whole test sequence.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
@brb brb merged commit a315b45 into cilium:master Jul 27, 2021
aanm pushed a commit that referenced this pull request Jul 27, 2021
[ upstream commit 31176f7 ]

This commit fixes #15442 (and variants), where the `done` channel used
to indicate completion to the test driver could be closed twice.  This
happened because at the end of a test, most mock client will start
returning `io.EOF`.

Due to #16955, this sometimes caused the peer manager to reconnect
immediately and create a new mock client, which would then attempt to
re-run the test-logic again.

This commit addresses this issue by ensuring that all mock clients
within a test share the same state (i.e. the `i` counter and `once`
instance). This way, each mock client instance will continue the work of
its predecessor instead of replaying the whole test sequence.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
brb pushed a commit that referenced this pull request Jul 28, 2021
[ upstream commit 31176f7 ]

This commit fixes #15442 (and variants), where the `done` channel used
to indicate completion to the test driver could be closed twice.  This
happened because at the end of a test, most mock client will start
returning `io.EOF`.

Due to #16955, this sometimes caused the peer manager to reconnect
immediately and create a new mock client, which would then attempt to
re-run the test-logic again.

This commit addresses this issue by ensuring that all mock clients
within a test share the same state (i.e. the `i` counter and `once`
instance). This way, each mock client instance will continue the work of
its predecessor instead of replaying the whole test sequence.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Gilberto Bertin <gilberto@isovalent.com>
@christarazi
Copy link
Member

Marking for backport to v1.9 for the same reasons as #14380 (comment).

krishgobinath pushed a commit to krishgobinath/cilium that referenced this pull request Oct 20, 2021
This commit fixes cilium#15442 (and variants), where the `done` channel used
to indicate completion to the test driver could be closed twice.  This
happened because at the end of a test, most mock client will start
returning `io.EOF`.

Due to cilium#16955, this sometimes caused the peer manager to reconnect
immediately and create a new mock client, which would then attempt to
re-run the test-logic again.

This commit addresses this issue by ensuring that all mock clients
within a test share the same state (i.e. the `i` counter and `once`
instance). This way, each mock client instance will continue the work of
its predecessor instead of replaying the whole test sequence.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants