-
Notifications
You must be signed in to change notification settings - Fork 937
reset-master operation: wait for replication to stop #762
reset-master operation: wait for replication to stop #762
Conversation
go/inst/instance_dao.go
Outdated
ioThreadRunning := (m.GetString("Slave_IO_Running") == "Yes") | ||
sqlThreadRunning := (m.GetString("Slave_SQL_Running") == "Yes") | ||
replicationThreadsRunning = ioThreadRunning && sqlThreadRunning | ||
ioThreadRunning = (m.GetString("Slave_IO_Running") == "Yes") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An issue with this is that you're deciding that Slave_*_Running = Yes
is the boolean check against which to decide that replication is fully running or fully stopped.
I'm not sure about Slave_SQL_Running
's options but Slave_IO_Running
can also be "Connecting". So that's at least one case where this check would say that replication is not running even though the IO thread is (or, starting to, or trying to).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Co-Authored-By: shlomi-noach <shlomi-noach@github.com>
I will merge this PR even though we haven't yet verified the reasoning for the Reiterating an internal issue, @ggunson suggests the retries can be the cause of the crash, as follows:
We're yet to reproduce this reliably, but will not be working on this actively in the short term. |
As pointed out by @ggunson ,
ErrantGTIDResetMaster()
issues areset master
immediately following astop slave
operation, but without verifying that replication has indeed stopped. e.g. SQL thread could still be busy.We've seen crashes in production at running
reset master
.In this PR we actively wait (or timeout) for replication to stop, before running
reset master
.