Fix hanging sentinels #659

hmac · 2019-06-03T11:50:11Z

This change fixes an issue we were seeing where our sentinels would occasionally freeze, not responding to etcd changes or making any decisions. This could only be resolved by restarting them all.

We tracked this down to an issue with how the sentinels use etcd elections to determine the leader. An error interacting with etcd can lead to old election keys hanging around, blocking any future elections from completing. The end result is that the sentinels are unable to elect a leader and cannot make any progress.

There's more detail on the specifics of this problem in the commit message, which I'll reproduce here:

If etcd returned an error, our previous behaviour would be to report the
error and restart an election. The outer loop never responded to the
error by stopping the election, which would cancel the context, and
instead would restart a new campaign.

The first campaigns session will continue to exist on the context the
election loop had originally provided it, which is context.Background.
We'll come round to perform a new leadership election and the
etcdElection.Campaign() method will block waiting for all existing
leased keys to be removed from the prefix. Because the old campaign
sessions are still alive, they'll continue to renew their leases and
ensure those revisions never get deleted, thus blocking our sentinel
campaign loop indefinitely.

This change ensures we end our session whenever we return from the
campaign function which should ensure leased values can't persist beyond
the lifetime of our campaign. In future we should look to simplify this
election interface, as I have no confidence that we don't have other
bugs of a similar vein waiting to catch us.

The fix is subtle, but hopefully not too controversial. I'd love to get some feedback on whether this makes sense to you and if there's anything you'd like us to add or change. Ideally we'd have a test for this but I think this is going to be a very hard thing to test...

sgotti · 2019-06-04T13:40:10Z

@hmac Great catch!

Such condition should only happen if inside campaign there's an error on NewSession or NewElection since in such case the session is not closed (but I wasn't able to reproduce such error condition).

But, I think this case will happen also with the other stores (libkv based) and probably the solution is just to Stop the election when an error occurs inside the sentinel election loop instead of adding another subcontext inside etcdv3 campaign function.

Can you please try with this patch?

hmac · 2019-06-04T14:45:21Z

@sgotti thanks! We'll try this out and get back to you.

lawrencejones · 2019-06-05T12:31:32Z

Hey @sgotti- it makes sense for me to take this over from @hmac as I did the initial work catching this.

Such condition should only happen if inside campaign there's an error on NewSession or NewElection since in such case the session is not closed (but I wasn't able to reproduce such error condition).

The Campaign() call fails very reproducibly if the etcd cluster is configured to auto-compact frequently. Because etcd waits for deletes using a watch stream, compaction can terminate the watch when revisions you're watching have been compacted. This was leading to all our sentinels jamming after just a couple of hours running.

But, I think this case will happen also with the other stores (libkv based) and probably the solution is just to Stop the election when an error occurs inside the sentinel election loop instead of adding another subcontext inside etcdv3 campaign function.

If this is an issue with the other stores then this makes a lot of sense and calling Stop() will cancel the primary context, leading to the session terminating, fixing this issue.

If we choose to leave the etcd campaign() implementation as is, I'm slightly worried that failing to clean-up existing sessions and relying on the caller to know to call Stop() might be a nasty surprise for people later on. I think we have two options here:

Document RunForElection() with a clear warning that developers must call Stop() in response to any error sent down the error channel if they want to avoid stalling future elections (across all machines involved in the election)
Keep the change to campaign() that cleans up sessions if the election subsequently fails

If this issue exists in other stores, perhaps we'll need to document this caveat regardless or audit the other implementations to remove the problem?

sgotti · 2019-06-06T11:08:28Z

The Campaign() call fails very reproducibly if the etcd cluster is configured to auto-compact frequently. Because etcd waits for deletes using a watch stream, compaction can terminate the watch when revisions you're watching have been compacted. This was leading to all our sentinels jamming after just a couple of hours running.

Good to know. But I wasn't able to reproduce it and nor create a test case and I never seen this in our clusters 😞

If we choose to leave the etcd campaign() implementation as is, I'm slightly worried that failing to clean-up existing sessions and relying on the caller to know to call Stop() might be a nasty surprise for people later on. I think we have two options here:

These are all good points. Since it's an internal package used only by the sentinel I'm not very worried on api stability or improper use.
So I'll go with documenting. I'll avoid adding another subcontext just to workaround a missing call to Stop().

Some history: I modeled the etcdv3 and k8s store campaign around the github.com/docker/leadership api since it was already used by the previous libkv stores (to avoid adding two different implementations). If you look at it's doc it's not documented the behavior so it must be discovered understanding the code.

lawrencejones · 2019-06-06T11:21:29Z

Since it's an internal package used only by the sentinel I'm not very worried on api stability or improper use. So I'll go with documenting.

Great, going to modify this change so we document the sharp edge of the RunForElection() API and stop each campaign in the loop. Will remove the sub-context in the etcdv3 store if we're assuming this is just part of the API.

Thanks for the feedback!

In e7a0ed8, the etcdv3 election process was modified to ensure the session terminates if the election fails, preventing an awkward situation where sentinels hang indefinitely with no elected leader. After discussion in sorintlab#659, it became clear that most of the store implementations are likely to have similar issues, and instead of fixing the underlying implementations we decided to amend the interface to warn consumers that they need to terminate existing elections before starting new ones. This is a sharp edge of the election API and one that might catch us in a future refactor. This is balanced against a hope that we'll refactor the election interface before a second use case is likely to arise.

lawrencejones · 2019-06-13T16:53:43Z

I think this should now include the changes we discussed @sgotti. Ready for another review!

sgotti

@lawrencejones can you please also add the changes to etcdv3.go in sgotti@4d81843 ?

In e7a0ed8, the etcdv3 election process was modified to ensure the session terminates if the election fails, preventing an awkward situation where sentinels hang indefinitely with no elected leader. After discussion in sorintlab#659, it became clear that most of the store implementations are likely to have similar issues, and instead of fixing the underlying implementations we decided to amend the interface to warn consumers that they need to terminate existing elections before starting new ones. This is a sharp edge of the election API and one that might catch us in a future refactor. This is balanced against a hope that we'll refactor the election interface before a second use case is likely to arise.

lawrencejones · 2019-06-18T14:00:15Z

@sgotti thanks, had forgotten those.

sgotti · 2019-07-03T14:04:31Z

@lawrencejones Thanks. Can you please squash in a single commit and rebase so I can merge it?

lawrencejones · 2019-07-03T17:52:24Z

Do we want to squash these or keep them? I wrote the messages to explain why the previous approach was changed, so it may be worth keeping the commits.

sgotti · 2019-07-09T07:43:02Z

@lawrencejones It just feels strange to keep a commit and then subsequently amend it (the code review is used to iterate until the final version before merging). It could be better to just describe in the final commit why the other approach was not used. But it's not a big issue, we can keep both commit.

lawrencejones · 2019-07-09T12:19:56Z

No this is ok- I'll squash them! My reasoning is that no one knows what the other approach is when you squash the PR, at which point the commit message (and associated explanation) is redundant. I'd usually use the merge commits to signify the final state of what is merged, and leave commits that produced a working state of the code and were viable.

This is not a big deal though, just wanted to explain my reasoning on a general practice level. Always good to share ideas and perspectives!

If etcd returned an error, our previous behaviour would be to report the error and restart an election. The outer loop never responded to the error by stopping the election, which would cancel the context, and instead would restart a new campaign. The first campaigns session will continue to exist on the context the election loop had originally provided it, which is context.Background. We'll come round to perform a new leadership election and the etcdElection.Campaign() method will block waiting for all existing leased keys to be removed from the prefix. Because the old campaign sessions are still alive, they'll continue to renew their leases and ensure those revisions never get deleted, thus blocking our sentinel campaign loop indefinitely. After discussion in sorintlab#659, it became clear that most of the store implementations are likely to have similar issues, and instead of fixing the underlying implementations we decided to amend the interface to warn consumers that they need to terminate existing elections before starting new ones. This is a sharp edge of the election API and one that might catch us in a future refactor. This is balanced against a hope that we'll refactor the election interface before a second use case is likely to arise.

lawrencejones · 2019-07-16T16:14:14Z

Finally remembered to squash and rebase: thanks @sgotti!

sgotti · 2019-07-20T07:07:22Z

@hmac @lawrencejones Thanks! Merging.

If etcd returned an error, our previous behaviour would be to report the error and restart an election. The outer loop never responded to the error by stopping the election, which would cancel the context, and instead would restart a new campaign. The first campaigns session will continue to exist on the context the election loop had originally provided it, which is context.Background. We'll come round to perform a new leadership election and the etcdElection.Campaign() method will block waiting for all existing leased keys to be removed from the prefix. Because the old campaign sessions are still alive, they'll continue to renew their leases and ensure those revisions never get deleted, thus blocking our sentinel campaign loop indefinitely. After discussion in sorintlab#659, it became clear that most of the store implementations are likely to have similar issues, and instead of fixing the underlying implementations we decided to amend the interface to warn consumers that they need to terminate existing elections before starting new ones. This is a sharp edge of the election API and one that might catch us in a future refactor. This is balanced against a hope that we'll refactor the election interface before a second use case is likely to arise.

sgotti added this to the v0.14.0 milestone Jun 6, 2019

sgotti requested changes Jun 14, 2019

View reviewed changes

lawrencejones force-pushed the hmac/fix-hanging-sentinels branch from 2bb7cff to a547003 Compare June 18, 2019 14:00

sgotti approved these changes Jul 3, 2019

View reviewed changes

lawrencejones force-pushed the hmac/fix-hanging-sentinels branch from a547003 to beadd2a Compare July 16, 2019 16:13

sgotti merged commit 54df13a into sorintlab:master Jul 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix hanging sentinels #659

Fix hanging sentinels #659

Uh oh!

hmac commented Jun 3, 2019

Uh oh!

sgotti commented Jun 4, 2019

Uh oh!

hmac commented Jun 4, 2019

Uh oh!

lawrencejones commented Jun 5, 2019 •

edited

Loading

Uh oh!

sgotti commented Jun 6, 2019

Uh oh!

lawrencejones commented Jun 6, 2019 •

edited

Loading

Uh oh!

lawrencejones commented Jun 13, 2019

Uh oh!

sgotti left a comment

Uh oh!

lawrencejones commented Jun 18, 2019

Uh oh!

sgotti commented Jul 3, 2019

Uh oh!

lawrencejones commented Jul 3, 2019

Uh oh!

sgotti commented Jul 9, 2019

Uh oh!

lawrencejones commented Jul 9, 2019

Uh oh!

lawrencejones commented Jul 16, 2019

Uh oh!

sgotti commented Jul 20, 2019

Uh oh!

Uh oh!

Fix hanging sentinels #659

Fix hanging sentinels #659

Uh oh!

Conversation

hmac commented Jun 3, 2019

Uh oh!

sgotti commented Jun 4, 2019

Uh oh!

hmac commented Jun 4, 2019

Uh oh!

lawrencejones commented Jun 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgotti commented Jun 6, 2019

Uh oh!

lawrencejones commented Jun 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lawrencejones commented Jun 13, 2019

Uh oh!

sgotti left a comment

Choose a reason for hiding this comment

Uh oh!

lawrencejones commented Jun 18, 2019

Uh oh!

sgotti commented Jul 3, 2019

Uh oh!

lawrencejones commented Jul 3, 2019

Uh oh!

sgotti commented Jul 9, 2019

Uh oh!

lawrencejones commented Jul 9, 2019

Uh oh!

lawrencejones commented Jul 16, 2019

Uh oh!

sgotti commented Jul 20, 2019

Uh oh!

Uh oh!

lawrencejones commented Jun 5, 2019 •

edited

Loading

lawrencejones commented Jun 6, 2019 •

edited

Loading