Skip to content

Conversation

hmac
Copy link
Contributor

@hmac hmac commented Jun 3, 2019

This change fixes an issue we were seeing where our sentinels would occasionally freeze, not responding to etcd changes or making any decisions. This could only be resolved by restarting them all.

We tracked this down to an issue with how the sentinels use etcd elections to determine the leader. An error interacting with etcd can lead to old election keys hanging around, blocking any future elections from completing. The end result is that the sentinels are unable to elect a leader and cannot make any progress.

There's more detail on the specifics of this problem in the commit message, which I'll reproduce here:

If etcd returned an error, our previous behaviour would be to report the
error and restart an election. The outer loop never responded to the
error by stopping the election, which would cancel the context, and
instead would restart a new campaign.

The first campaigns session will continue to exist on the context the
election loop had originally provided it, which is context.Background.
We'll come round to perform a new leadership election and the
etcdElection.Campaign() method will block waiting for all existing
leased keys to be removed from the prefix. Because the old campaign
sessions are still alive, they'll continue to renew their leases and
ensure those revisions never get deleted, thus blocking our sentinel
campaign loop indefinitely.

This change ensures we end our session whenever we return from the
campaign function which should ensure leased values can't persist beyond
the lifetime of our campaign. In future we should look to simplify this
election interface, as I have no confidence that we don't have other
bugs of a similar vein waiting to catch us.

The fix is subtle, but hopefully not too controversial. I'd love to get some feedback on whether this makes sense to you and if there's anything you'd like us to add or change. Ideally we'd have a test for this but I think this is going to be a very hard thing to test...

@sgotti
Copy link
Member

sgotti commented Jun 4, 2019

@hmac Great catch!

Such condition should only happen if inside campaign there's an error on NewSession or NewElection since in such case the session is not closed (but I wasn't able to reproduce such error condition).

But, I think this case will happen also with the other stores (libkv based) and probably the solution is just to Stop the election when an error occurs inside the sentinel election loop instead of adding another subcontext inside etcdv3 campaign function.

Can you please try with this patch?

@hmac
Copy link
Contributor Author

hmac commented Jun 4, 2019

@sgotti thanks! We'll try this out and get back to you.

@lawrencejones
Copy link
Contributor

lawrencejones commented Jun 5, 2019

Hey @sgotti- it makes sense for me to take this over from @hmac as I did the initial work catching this.

Such condition should only happen if inside campaign there's an error on NewSession or NewElection since in such case the session is not closed (but I wasn't able to reproduce such error condition).

The Campaign() call fails very reproducibly if the etcd cluster is configured to auto-compact frequently. Because etcd waits for deletes using a watch stream, compaction can terminate the watch when revisions you're watching have been compacted. This was leading to all our sentinels jamming after just a couple of hours running.

But, I think this case will happen also with the other stores (libkv based) and probably the solution is just to Stop the election when an error occurs inside the sentinel election loop instead of adding another subcontext inside etcdv3 campaign function.

If this is an issue with the other stores then this makes a lot of sense and calling Stop() will cancel the primary context, leading to the session terminating, fixing this issue.

If we choose to leave the etcd campaign() implementation as is, I'm slightly worried that failing to clean-up existing sessions and relying on the caller to know to call Stop() might be a nasty surprise for people later on. I think we have two options here:

  • Document RunForElection() with a clear warning that developers must call Stop() in response to any error sent down the error channel if they want to avoid stalling future elections (across all machines involved in the election)
  • Keep the change to campaign() that cleans up sessions if the election subsequently fails

If this issue exists in other stores, perhaps we'll need to document this caveat regardless or audit the other implementations to remove the problem?

@sgotti
Copy link
Member

sgotti commented Jun 6, 2019

The Campaign() call fails very reproducibly if the etcd cluster is configured to auto-compact frequently. Because etcd waits for deletes using a watch stream, compaction can terminate the watch when revisions you're watching have been compacted. This was leading to all our sentinels jamming after just a couple of hours running.

Good to know. But I wasn't able to reproduce it and nor create a test case and I never seen this in our clusters 😞

If we choose to leave the etcd campaign() implementation as is, I'm slightly worried that failing to clean-up existing sessions and relying on the caller to know to call Stop() might be a nasty surprise for people later on. I think we have two options here:

These are all good points. Since it's an internal package used only by the sentinel I'm not very worried on api stability or improper use.
So I'll go with documenting. I'll avoid adding another subcontext just to workaround a missing call to Stop().

Some history: I modeled the etcdv3 and k8s store campaign around the github.com/docker/leadership api since it was already used by the previous libkv stores (to avoid adding two different implementations). If you look at it's doc it's not documented the behavior so it must be discovered understanding the code.

@lawrencejones
Copy link
Contributor

lawrencejones commented Jun 6, 2019

Since it's an internal package used only by the sentinel I'm not very worried on api stability or improper use. So I'll go with documenting.

Great, going to modify this change so we document the sharp edge of the RunForElection() API and stop each campaign in the loop. Will remove the sub-context in the etcdv3 store if we're assuming this is just part of the API.

Thanks for the feedback!

@sgotti sgotti added this to the v0.14.0 milestone Jun 6, 2019
lawrencejones added a commit to gocardless/stolon that referenced this pull request Jun 13, 2019
In e7a0ed8, the etcdv3 election process was modified to ensure the
session terminates if the election fails, preventing an awkward
situation where sentinels hang indefinitely with no elected leader.

After discussion in sorintlab#659, it became clear that most of the store
implementations are likely to have similar issues, and instead of fixing
the underlying implementations we decided to amend the interface to warn
consumers that they need to terminate existing elections before starting
new ones.

This is a sharp edge of the election API and one that might catch us in
a future refactor. This is balanced against a hope that we'll refactor
the election interface before a second use case is likely to arise.
@lawrencejones
Copy link
Contributor

I think this should now include the changes we discussed @sgotti. Ready for another review!

Copy link
Member

@sgotti sgotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lawrencejones can you please also add the changes to etcdv3.go in sgotti@4d81843 ?

@lawrencejones lawrencejones force-pushed the hmac/fix-hanging-sentinels branch from 2bb7cff to a547003 Compare June 18, 2019 14:00
lawrencejones added a commit to gocardless/stolon that referenced this pull request Jun 18, 2019
In e7a0ed8, the etcdv3 election process was modified to ensure the
session terminates if the election fails, preventing an awkward
situation where sentinels hang indefinitely with no elected leader.

After discussion in sorintlab#659, it became clear that most of the store
implementations are likely to have similar issues, and instead of fixing
the underlying implementations we decided to amend the interface to warn
consumers that they need to terminate existing elections before starting
new ones.

This is a sharp edge of the election API and one that might catch us in
a future refactor. This is balanced against a hope that we'll refactor
the election interface before a second use case is likely to arise.
@lawrencejones
Copy link
Contributor

@sgotti thanks, had forgotten those.

@sgotti
Copy link
Member

sgotti commented Jul 3, 2019

@lawrencejones Thanks. Can you please squash in a single commit and rebase so I can merge it?

@lawrencejones
Copy link
Contributor

Do we want to squash these or keep them? I wrote the messages to explain why the previous approach was changed, so it may be worth keeping the commits.

@sgotti
Copy link
Member

sgotti commented Jul 9, 2019

@lawrencejones It just feels strange to keep a commit and then subsequently amend it (the code review is used to iterate until the final version before merging). It could be better to just describe in the final commit why the other approach was not used. But it's not a big issue, we can keep both commit.

@lawrencejones
Copy link
Contributor

No this is ok- I'll squash them! My reasoning is that no one knows what the other approach is when you squash the PR, at which point the commit message (and associated explanation) is redundant. I'd usually use the merge commits to signify the final state of what is merged, and leave commits that produced a working state of the code and were viable.

This is not a big deal though, just wanted to explain my reasoning on a general practice level. Always good to share ideas and perspectives!

If etcd returned an error, our previous behaviour would be to report the
error and restart an election. The outer loop never responded to the
error by stopping the election, which would cancel the context, and
instead would restart a new campaign.

The first campaigns session will continue to exist on the context the
election loop had originally provided it, which is context.Background.
We'll come round to perform a new leadership election and the
etcdElection.Campaign() method will block waiting for all existing
leased keys to be removed from the prefix. Because the old campaign
sessions are still alive, they'll continue to renew their leases and
ensure those revisions never get deleted, thus blocking our sentinel
campaign loop indefinitely.

After discussion in sorintlab#659, it became clear that most of the store
implementations are likely to have similar issues, and instead of fixing
the underlying implementations we decided to amend the interface to warn
consumers that they need to terminate existing elections before starting
new ones.

This is a sharp edge of the election API and one that might catch us in
a future refactor. This is balanced against a hope that we'll refactor
the election interface before a second use case is likely to arise.
@lawrencejones lawrencejones force-pushed the hmac/fix-hanging-sentinels branch from a547003 to beadd2a Compare July 16, 2019 16:13
@lawrencejones
Copy link
Contributor

Finally remembered to squash and rebase: thanks @sgotti!

@sgotti
Copy link
Member

sgotti commented Jul 20, 2019

@hmac @lawrencejones Thanks! Merging.

@sgotti sgotti merged commit 54df13a into sorintlab:master Jul 20, 2019
johannesboon pushed a commit to johannesboon/stolon that referenced this pull request Dec 25, 2019
If etcd returned an error, our previous behaviour would be to report the
error and restart an election. The outer loop never responded to the
error by stopping the election, which would cancel the context, and
instead would restart a new campaign.

The first campaigns session will continue to exist on the context the
election loop had originally provided it, which is context.Background.
We'll come round to perform a new leadership election and the
etcdElection.Campaign() method will block waiting for all existing
leased keys to be removed from the prefix. Because the old campaign
sessions are still alive, they'll continue to renew their leases and
ensure those revisions never get deleted, thus blocking our sentinel
campaign loop indefinitely.

After discussion in sorintlab#659, it became clear that most of the store
implementations are likely to have similar issues, and instead of fixing
the underlying implementations we decided to amend the interface to warn
consumers that they need to terminate existing elections before starting
new ones.

This is a sharp edge of the election API and one that might catch us in
a future refactor. This is balanced against a hope that we'll refactor
the election interface before a second use case is likely to arise.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants