Incorrect usage of ReplicaRunning()

In investigating https://github.com/github/orchestrator/pull/761 with the concern that `RESET MASTER` might be being issued before replication had fully stopped, I found that the check for whether replication was running or not wasn't complete.

The check used in some parts of orchestrator to tell that replication was running on a server was:

https://github.com/github/orchestrator/blob/a58ad0fdd55b42fddf4553f7db02002db9d6c9d6/go/inst/instance.go#L256-L259

The function returns true if the host is a replica and both sql and IO threads are "Yes". It can then return false if one thread is still running (or the IO thread is "Connecting") as well as when both threads are stopped. So ReplicaRunning() is an invalid function to use to check that replication has 100% stopped.

Currently the function is being used in a few places, but the places where I see a possible concern are:

- [SyncReplicaRelayLogs()](https://github.com/github/orchestrator/blob/master/go/agent/instance_topology_agent.go#L27) -- if the IO thread is still running then the relay log contents are still being written to, and if the SQL thread is still on then it may still be applying relay log contents.

- [StartSlaveUntilMasterCoordinates()](https://github.com/github/orchestrator/blob/a58ad0fdd55b42fddf4553f7db02002db9d6c9d6/go/inst/instance_topology_dao.go#L479) -- orc might try running `start slave until` on a host where one of the threads is already running, which in the console results in only a warning and **no change** to the `until` values. 

- [EnableMasterSSL()](https://github.com/github/orchestrator/blob/master/go/inst/instance_topology_dao.go#L578) -- `CHANGE MASTER` could try executing on a replica with one of the threads running (which should error in this case)*

- [ChangeMasterTo()](https://github.com/github/orchestrator/blob/master/go/inst/instance_topology_dao.go#L605) -- similar to above

- [ResetSlave()](https://github.com/github/orchestrator/blob/master/go/inst/instance_topology_dao.go#L706) -- [RESET SLAVE](https://dev.mysql.com/doc/refman/5.7/en/reset-slave.html) could be run on a replica where one of the threads is still running, which I believe would result in an error but you'll want to double-check

- [ResetMaster()](https://github.com/github/orchestrator/blob/master/go/inst/instance_topology_dao.go#L739) -- RESET MASTER could happen while, say, the SQL thread is still running, so writes could happen after the RESET MASTER that you didn't intend (and if you're going to change gtid_purged or something after this it'll be incorrect)

- [DetachReplica()](https://github.com/github/orchestrator/blob/master/go/inst/instance_topology_dao.go#L867) -- orc could detach a replica from a topology with the SQL thread still running and get the logged coordinates for the `CHANGE MASTER` wrong

- [ReattachReplica()](https://github.com/github/orchestrator/blob/master/go/inst/instance_topology_dao.go#L901) -- less likely as the previous function but it still just checks for both threads running

*Note with MySQL 5.7 `CHANGE MASTER` will run in some cases when replication is still partially up (either IO or SQL threads running). See https://dev.mysql.com/doc/refman/5.7/en/change-master-to.html starting at "In MySQL 5.7.4 and later". 

/cc https://github.com/github/orchestrator/pull/762 
/cc https://github.com/github/orchestrator/pull/761
/cc @github/database-infrastructure 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorrect usage of ReplicaRunning() #763

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incorrect usage of ReplicaRunning() #763

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions