Skip to content

Conversation

tgraf
Copy link
Member

@tgraf tgraf commented Jul 20, 2020

Several code paths which are blocking on other channels or waiting for certain conditions to be met do not properly respect the user and client context to abort.

The most serious consequence can be endless loops attempting to retry an etcd operation forever.

tgraf added 4 commits July 20, 2020 13:30
…ncelled

Connected() is being used for two use cases:

1. To wait for a particular client to be connected and reach quorum.
2. To wait for the overall agent to connect to a kvstore and reach quorum.

The first use case is tied to a particular kvstore client. The second use case
is not.

The current implementation will block Connected() forever until
isConnectedAndHasQuorum() returns success. If the context is cancelled or if
the client is closed, this never happens as an error will always be returned.

Change etcd's Connected() implementation to check the context and the etcd
client context and return an error on the channel. This allows Watch() to
return if this condition is met to properly return the Watch() call which is
made against a particular client. This fixes a Watch() call never returning in
case the etcd client is closed before connectivity was ever achieved.

To account for the second use case, introduce an overall Connected() function
which will block closing the channel until kvstore connectivity has been
achieved.

Signed-off-by: Thomas Graf <thomas@cilium.io>
When Watch() is called on an etcd client that is closing or the context has
been cancelled then the Get() call will fail. The error handler will retry the
loop with no chance of success as the context is already cancelled or the
client has been closed. The retry will occur forever. The lack of sleep in
the retry path will further put massive stress on the etcd rate limiter and
thus will reduce the probabilit for other etcd operations to succeed in time.

Fix Watch() to break out of the loop if the user or etcd client context has
been cancelled.

Signed-off-by: Thomas Graf <thomas@cilium.io>
Don't rely on timeout if the client is being cancelled. If the client is
closing, a lock can never succeed again.

Signed-off-by: Thomas Graf <thomas@cilium.io>
Don't rely on timeout of user context while waiting for initial connect if the
client is closing.

Signed-off-by: Thomas Graf <thomas@cilium.io>
@tgraf tgraf added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.7 labels Jul 20, 2020
@tgraf tgraf requested a review from a team July 20, 2020 11:59
@tgraf tgraf requested a review from a team as a code owner July 20, 2020 11:59
@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 37.009% when pulling 0e7e85c on pr/tgraf/etcd-bugs into 2aae988 on master.

@tgraf
Copy link
Member Author

tgraf commented Jul 20, 2020

test-me-please

@tgraf tgraf merged commit f42c7d4 into master Jul 20, 2020
@tgraf tgraf deleted the pr/tgraf/etcd-bugs branch July 20, 2020 14:32
case <-ctx.Done():
return ctx.Err()
if err := <-kvstore.Client().Connected(ctx); err != nil {
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional? If yes a comment would be nice to have here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this was not intentional. I'll fix it up.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants