reset-master operation: wait for replication to stop #762

shlomi-noach · 2018-12-19T08:01:52Z

As pointed out by @ggunson , ErrantGTIDResetMaster() issues a reset master immediately following a stop slave operation, but without verifying that replication has indeed stopped. e.g. SQL thread could still be busy.

We've seen crashes in production at running reset master.

In this PR we actively wait (or timeout) for replication to stop, before running reset master.

go/inst/instance_dao.go

ggunson · 2018-12-19T22:41:07Z

go/inst/instance_dao.go

-		ioThreadRunning := (m.GetString("Slave_IO_Running") == "Yes")
-		sqlThreadRunning := (m.GetString("Slave_SQL_Running") == "Yes")
-		replicationThreadsRunning = ioThreadRunning && sqlThreadRunning
+		ioThreadRunning = (m.GetString("Slave_IO_Running") == "Yes")


An issue with this is that you're deciding that Slave_*_Running = Yes is the boolean check against which to decide that replication is fully running or fully stopped.

I'm not sure about Slave_SQL_Running's options but Slave_IO_Running can also be "Connecting". So that's at least one case where this check would say that replication is not running even though the IO thread is (or, starting to, or trying to).

That's a good point.

Co-Authored-By: shlomi-noach <shlomi-noach@github.com>

shlomi-noach · 2018-12-23T06:01:43Z

I will merge this PR even though we haven't yet verified the reasoning for the reset master crash, or at least have not provided a reliable way to reproduce the error.

Reiterating an internal issue, @ggunson suggests the retries can be the cause of the crash, as follows:

a 1st connection breaks on Connection Invalid
but the reset master it invoked is still underway
orchestrator issues a retry, running a 2nd, concurrent reset master
crash

We're yet to reproduce this reliably, but will not be working on this actively in the short term.

Shlomi Noach added 2 commits December 19, 2018 09:59

reset-master operation: wait for replication to stop

9fac01f

language

8cd8113

shlomi-noach temporarily deployed to production/mysql_cluster=conductor December 19, 2018 08:05 Inactive

Shlomi Noach added 2 commits December 19, 2018 10:19

comments

3936ad1

wording

e97a5cf

shlomi-noach temporarily deployed to production/mysql_cluster=conductor December 19, 2018 08:25 Inactive

increase sleep interval

6a14635

shlomi-noach temporarily deployed to production/mysql_cluster=conductor December 19, 2018 09:22 Inactive

tomkrouper reviewed Dec 19, 2018

View reviewed changes

go/inst/instance_dao.go Outdated Show resolved Hide resolved

ggunson reviewed Dec 19, 2018

View reviewed changes

Update go/inst/instance_dao.go

c117c5a

Co-Authored-By: shlomi-noach <shlomi-noach@github.com>

shlomi-noach temporarily deployed to production/mysql_cluster=conductor December 20, 2018 06:02 Inactive

more accurate replication thread state expectation

61c07dd

shlomi-noach temporarily deployed to production/mysql_cluster=conductor December 20, 2018 11:07 Inactive

really long interval time

6067d5f

shlomi-noach temporarily deployed to production/mysql_cluster=conductor December 23, 2018 05:56 Inactive

Merge branch 'master' into reset-master-wait-for-replication-to-stop

69e1cb7

shlomi-noach temporarily deployed to production/mysql_cluster=conductor December 23, 2018 13:33 Inactive

NoReplicationThreadRunning

302bfc4

shlomi-noach temporarily deployed to production/mysql_cluster=conductor December 24, 2018 06:09 Inactive

shlomi-noach merged commit 82757e9 into master Dec 24, 2018

shlomi-noach deleted the reset-master-wait-for-replication-to-stop branch December 24, 2018 06:18

shlomi-noach mentioned this pull request Dec 24, 2018

Experiment: No retries for RESET MASTER on errant GTID cleanup #761

Closed

ggunson mentioned this pull request Dec 26, 2018

Incorrect usage of ReplicaRunning() #763

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

reset-master operation: wait for replication to stop #762

reset-master operation: wait for replication to stop #762

Uh oh!

shlomi-noach commented Dec 19, 2018

Uh oh!

Uh oh!

ggunson Dec 19, 2018

Uh oh!

shlomi-noach Dec 20, 2018

Uh oh!

shlomi-noach Dec 20, 2018

Uh oh!

shlomi-noach commented Dec 23, 2018

Uh oh!

Uh oh!

reset-master operation: wait for replication to stop #762

reset-master operation: wait for replication to stop #762

Uh oh!

Conversation

shlomi-noach commented Dec 19, 2018

Uh oh!

Uh oh!

ggunson Dec 19, 2018

Choose a reason for hiding this comment

Uh oh!

shlomi-noach Dec 20, 2018

Choose a reason for hiding this comment

Uh oh!

shlomi-noach Dec 20, 2018

Choose a reason for hiding this comment

Uh oh!

shlomi-noach commented Dec 23, 2018

Uh oh!

Uh oh!