-
Notifications
You must be signed in to change notification settings - Fork 446
Fix hanging sentinels #659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix hanging sentinels #659
Conversation
@hmac Great catch! Such condition should only happen if inside But, I think this case will happen also with the other stores (libkv based) and probably the solution is just to Stop the election when an error occurs inside the sentinel election loop instead of adding another subcontext inside etcdv3 campaign function. Can you please try with this patch? |
@sgotti thanks! We'll try this out and get back to you. |
Hey @sgotti- it makes sense for me to take this over from @hmac as I did the initial work catching this.
The
If this is an issue with the other stores then this makes a lot of sense and calling If we choose to leave the etcd
If this issue exists in other stores, perhaps we'll need to document this caveat regardless or audit the other implementations to remove the problem? |
Good to know. But I wasn't able to reproduce it and nor create a test case and I never seen this in our clusters 😞
These are all good points. Since it's an internal package used only by the sentinel I'm not very worried on api stability or improper use. Some history: I modeled the etcdv3 and k8s store campaign around the |
Great, going to modify this change so we document the sharp edge of the Thanks for the feedback! |
In e7a0ed8, the etcdv3 election process was modified to ensure the session terminates if the election fails, preventing an awkward situation where sentinels hang indefinitely with no elected leader. After discussion in sorintlab#659, it became clear that most of the store implementations are likely to have similar issues, and instead of fixing the underlying implementations we decided to amend the interface to warn consumers that they need to terminate existing elections before starting new ones. This is a sharp edge of the election API and one that might catch us in a future refactor. This is balanced against a hope that we'll refactor the election interface before a second use case is likely to arise.
I think this should now include the changes we discussed @sgotti. Ready for another review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lawrencejones can you please also add the changes to etcdv3.go in sgotti@4d81843 ?
2bb7cff
to
a547003
Compare
In e7a0ed8, the etcdv3 election process was modified to ensure the session terminates if the election fails, preventing an awkward situation where sentinels hang indefinitely with no elected leader. After discussion in sorintlab#659, it became clear that most of the store implementations are likely to have similar issues, and instead of fixing the underlying implementations we decided to amend the interface to warn consumers that they need to terminate existing elections before starting new ones. This is a sharp edge of the election API and one that might catch us in a future refactor. This is balanced against a hope that we'll refactor the election interface before a second use case is likely to arise.
@sgotti thanks, had forgotten those. |
@lawrencejones Thanks. Can you please squash in a single commit and rebase so I can merge it? |
Do we want to squash these or keep them? I wrote the messages to explain why the previous approach was changed, so it may be worth keeping the commits. |
@lawrencejones It just feels strange to keep a commit and then subsequently amend it (the code review is used to iterate until the final version before merging). It could be better to just describe in the final commit why the other approach was not used. But it's not a big issue, we can keep both commit. |
No this is ok- I'll squash them! My reasoning is that no one knows what the other approach is when you squash the PR, at which point the commit message (and associated explanation) is redundant. I'd usually use the merge commits to signify the final state of what is merged, and leave commits that produced a working state of the code and were viable. This is not a big deal though, just wanted to explain my reasoning on a general practice level. Always good to share ideas and perspectives! |
If etcd returned an error, our previous behaviour would be to report the error and restart an election. The outer loop never responded to the error by stopping the election, which would cancel the context, and instead would restart a new campaign. The first campaigns session will continue to exist on the context the election loop had originally provided it, which is context.Background. We'll come round to perform a new leadership election and the etcdElection.Campaign() method will block waiting for all existing leased keys to be removed from the prefix. Because the old campaign sessions are still alive, they'll continue to renew their leases and ensure those revisions never get deleted, thus blocking our sentinel campaign loop indefinitely. After discussion in sorintlab#659, it became clear that most of the store implementations are likely to have similar issues, and instead of fixing the underlying implementations we decided to amend the interface to warn consumers that they need to terminate existing elections before starting new ones. This is a sharp edge of the election API and one that might catch us in a future refactor. This is balanced against a hope that we'll refactor the election interface before a second use case is likely to arise.
a547003
to
beadd2a
Compare
Finally remembered to squash and rebase: thanks @sgotti! |
@hmac @lawrencejones Thanks! Merging. |
If etcd returned an error, our previous behaviour would be to report the error and restart an election. The outer loop never responded to the error by stopping the election, which would cancel the context, and instead would restart a new campaign. The first campaigns session will continue to exist on the context the election loop had originally provided it, which is context.Background. We'll come round to perform a new leadership election and the etcdElection.Campaign() method will block waiting for all existing leased keys to be removed from the prefix. Because the old campaign sessions are still alive, they'll continue to renew their leases and ensure those revisions never get deleted, thus blocking our sentinel campaign loop indefinitely. After discussion in sorintlab#659, it became clear that most of the store implementations are likely to have similar issues, and instead of fixing the underlying implementations we decided to amend the interface to warn consumers that they need to terminate existing elections before starting new ones. This is a sharp edge of the election API and one that might catch us in a future refactor. This is balanced against a hope that we'll refactor the election interface before a second use case is likely to arise.
This change fixes an issue we were seeing where our sentinels would occasionally freeze, not responding to etcd changes or making any decisions. This could only be resolved by restarting them all.
We tracked this down to an issue with how the sentinels use etcd elections to determine the leader. An error interacting with etcd can lead to old election keys hanging around, blocking any future elections from completing. The end result is that the sentinels are unable to elect a leader and cannot make any progress.
There's more detail on the specifics of this problem in the commit message, which I'll reproduce here:
The fix is subtle, but hopefully not too controversial. I'd love to get some feedback on whether this makes sense to you and if there's anything you'd like us to add or change. Ideally we'd have a test for this but I think this is going to be a very hard thing to test...