Skip to content
This repository was archived by the owner on Feb 18, 2025. It is now read-only.
This repository was archived by the owner on Feb 18, 2025. It is now read-only.

Incorrect usage of ReplicaRunning() #763

@ggunson

Description

@ggunson

In investigating #761 with the concern that RESET MASTER might be being issued before replication had fully stopped, I found that the check for whether replication was running or not wasn't complete.

The check used in some parts of orchestrator to tell that replication was running on a server was:

https://github.com/github/orchestrator/blob/a58ad0fdd55b42fddf4553f7db02002db9d6c9d6/go/inst/instance.go#L256-L259

The function returns true if the host is a replica and both sql and IO threads are "Yes". It can then return false if one thread is still running (or the IO thread is "Connecting") as well as when both threads are stopped. So ReplicaRunning() is an invalid function to use to check that replication has 100% stopped.

Currently the function is being used in a few places, but the places where I see a possible concern are:

  • SyncReplicaRelayLogs() -- if the IO thread is still running then the relay log contents are still being written to, and if the SQL thread is still on then it may still be applying relay log contents.

  • StartSlaveUntilMasterCoordinates() -- orc might try running start slave until on a host where one of the threads is already running, which in the console results in only a warning and no change to the until values.

  • EnableMasterSSL() -- CHANGE MASTER could try executing on a replica with one of the threads running (which should error in this case)*

  • ChangeMasterTo() -- similar to above

  • ResetSlave() -- RESET SLAVE could be run on a replica where one of the threads is still running, which I believe would result in an error but you'll want to double-check

  • ResetMaster() -- RESET MASTER could happen while, say, the SQL thread is still running, so writes could happen after the RESET MASTER that you didn't intend (and if you're going to change gtid_purged or something after this it'll be incorrect)

  • DetachReplica() -- orc could detach a replica from a topology with the SQL thread still running and get the logged coordinates for the CHANGE MASTER wrong

  • ReattachReplica() -- less likely as the previous function but it still just checks for both threads running

*Note with MySQL 5.7 CHANGE MASTER will run in some cases when replication is still partially up (either IO or SQL threads running). See https://dev.mysql.com/doc/refman/5.7/en/change-master-to.html starting at "In MySQL 5.7.4 and later".

/cc #762
/cc #761
/cc @github/database-infrastructure

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions