Orchestrator prematurely promotes lagging replica when configured with MariaDB GTID

Hi!

( This is an issue created after the follow-up discussion on the MySQL Community slack: https://mysqlcommunity.slack.com/archives/C9AB5JVNG/p1604228957063100 )

I am setting up automatic failovers on MariaDB and I chose to use orchestrator as the tool to help me there. I’m doing the tests and so far, it’s working great. However, testing data loss on failovers, I can’t seem to make `FailMasterPromotionIfSQLThreadNotUpToDate` or `DelayMasterPromotionIfSQLThreadNotUpToDate` work.

Test environment:
* primary with GTID and semi-sync enabled (semi-sync timeout huge enough. I prefer to be down than have data loss).
* secondaries with GTID and semi-sync enabled (relay_log_purge=0 and relay-log-recovery=0).

One of the tests I’m doing is to prove that orchestrator allows the candidate replica to apply all the relay logs before orchestrator resets the slave configurations and promotes it as primary.
So, I have sysbench running on a 4th node with enough workload to make replicas lag. Once they are lagging for, like 30 seconds, I shutdown the primary node.
Once the primary node is shutdown, orchestrator promotes one replica but doesn’t allow the replica to apply all the relay logs (with `DelayMasterPromotionIfSQLThreadNotUpToDate` ) or doesn’t stop the failover (with `FailMasterPromotionIfSQLThreadNotUpToDate` ).

Seems that when using `master_use_gtid` to `current_pos` (or `slave_pos` ) and both replica threads (i/o and sql) are restarted, i/o thread purges all relay logs and starts pulling from the last position that sql thread has applied.

In orchestrator, this happens when `RestartReplicationQuick` comes and restarts both threads [here](https://github.com/openark/orchestrator/blob/master/go/inst/instance_topology_dao.go#L244). Then, because i/o thread and sql thread have the same binlog coordinates, orchestrator thinks that replica is in sync and proceeds with the failover [here](https://github.com/openark/orchestrator/blob/37c255e150545b6176c00186471e39e005930638/go/logic/topology_recovery.go#L864-L872) .

To recap:

- this has nothing to do with heartbeat/lag evaluation
- in mariadb
- upon master failure
- orchestrator runs a RestartReplicationQuick (stop slave sql_thread, stop slave io_thread, start slave io_thread, start slave sql_thread). This happens specifically when replicas are lagging at the tim emaster is failed
- causing IO thread to jump back to an earlier position (this only happens with mariadb)
- causing both SQL and IO threads to point to same position
- leading orchestrator to think SQL thread is up to date with IO thread
- promoting the replica prematurely

Thank you!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Orchestrator prematurely promotes lagging replica when configured with MariaDB GTID #1260

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Orchestrator prematurely promotes lagging replica when configured with MariaDB GTID #1260

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions