-
Notifications
You must be signed in to change notification settings - Fork 937
Hook for take-master / GracefulIntermediateMasterTakeover #799
Description
Hello Shlomi, we have the following test topology and db hosts:
1 MASTER --> shadowmaster (has all schemas)
3 SLAVES --> sm hosts (have only certain schemas being replicated, but they all have the same filter)
Initial topology:
shadowmaster:3306 [unknown,invalid,Unknown,rw,nobinlog,downtimed]
+ sm-ohq-applogdb-1:3306 [0s,ok,10.3.12-MariaDB-log,rw,MIXED,>>]
+ sm-atl-applogdb-3:3306 [0s,ok,10.3.12-MariaDB-log,ro,MIXED,>>]
+ sm-ohq-applogdb-2:3306 [0s,ok,10.3.12-MariaDB-log,ro,MIXED,>>]
We are using AutoPseudoGTID and everything is working as expected.
Here's a scenario I'm trying to make work, but so far have not been able to:
We would like to be able to drag/drop (promote) sm-atl-applogdb-3 so it becomes master of both sm-ohq-applogdb-1 and sm-ohq-applogdb-2, and have sm-atl-applogdb-3 replicate from shadowmaster, as shown below:
shadowmaster:3306 [unknown,invalid,Unknown,rw,nobinlog,downtimed]
+ sm-atl-applogdb-3:3306 [0s,ok,10.3.12-MariaDB-log,rw,MIXED,>>]
+ sm-ohq-applogdb-1:3306 [0s,ok,10.3.12-MariaDB-log,ro,MIXED,>>]
+ sm-ohq-applogdb-2:3306 [0s,ok,10.3.12-MariaDB-log,ro,MIXED,>>]
Unfortunately, this does not happen, and we end up with the following: (notice sm-ohq-applogdb-2 remained as slave of the old master)
shadowmaster:3306 [unknown,invalid,Unknown,rw,nobinlog,downtimed]
+ sm-atl-applogdb-3:3306 [0s,ok,10.3.12-MariaDB-log,ro,MIXED,>>]
+ sm-ohq-applogdb-1:3306 [0s,ok,10.3.12-MariaDB-log,rw,MIXED,>>]
+ sm-ohq-applogdb-2:3306 [0s,ok,10.3.12-MariaDB-log,ro,MIXED,>>]
I attempted to use a hook, as I thought this would fall under PostIntermediateMasterFailoverProcesses. I created a hook that would move all slaves of the old intermediary master (in this case sm-ohq-applogdb-1) as slaves of the new master (sm-atl-applogdb-3), but it never got called.
When troubleshooting the PostIntermediateMasterFailoverProcesses hook to find out why it was not being called, I noticed it never get's triggered, and maybe it is because this is being handled during the take-master call, and not as a graceful intermediate master promotion.
Here are the logs:
2019-02-08 17:05:18 DEBUG raft leader is 10.0.84.117:10008 (this host); state: Leader
[martini] Started GET /api/take-master/sm-atl-applogdb-3/3306 for 69.41.14.254:15162
2019-02-08 17:05:22 DEBUG TakeMaster: will attempt making sm-atl-applogdb-3:3306 take its master sm-ohq-applogdb-1:3306, now resolved as sm-ohq-applogdb-1:3306
2019-02-08 17:05:22 INFO Stopped replication on sm-ohq-applogdb-1:3306, Self:mysql-bin-sm-ohq-applogdb-1.000057:169115058, Exec:shadowmaster.027720:455315683
2019-02-08 17:05:23 DEBUG analysis: IsMaster: true, LastCheckValid: false, LastCheckPartialSuccess: true, CountReplicas: 1, CountValidReplicatingReplicas: 0, CountLaggingReplicas: 0, CountDelayedReplicas: 0,
2019-02-08 17:05:23 DEBUG raft leader is 10.0.84.117:10008 (this host); state: Leader
2019-02-08 17:05:23 DEBUG orchestrator/raft: applying command 1863: request-health-report
[martini] Started GET /api/raft-follower-health-report/4c36b85e/sm-ohq-proxysql-1/sm-ohq-proxysql-1 for 10.0.84.117:50200
[martini] Completed 200 OK in 582.534µs
[martini] Started GET /api/raft-follower-health-report/4c36b85e/sm-ohq-proxysql-2/sm-ohq-proxysql-2 for 10.0.84.118:10344
[martini] Completed 200 OK in 580.334µs
[martini] Started GET /api/raft-follower-health-report/4c36b85e/sm-atl-proxysql-3/sm-atl-proxysql-3 for 10.5.4.171:47266
[martini] Completed 200 OK in 566.785µs
2019-02-08 17:05:24 INFO Stopped replication on sm-atl-applogdb-3:3306, Self:mysql-bin-sm-atl-applogdb-3.000057:169115074, Exec:mysql-bin-sm-ohq-applogdb-1.000057:169115058
2019-02-08 17:05:24 INFO Will start replication on sm-atl-applogdb-3:3306 until coordinates: mysql-bin-sm-ohq-applogdb-1.000057:169115058
2019-02-08 17:05:26 INFO Stopped replication on sm-atl-applogdb-3:3306, Self:mysql-bin-sm-atl-applogdb-3.000057:169115074, Exec:mysql-bin-sm-ohq-applogdb-1.000057:169115058
2019-02-08 17:05:26 DEBUG ChangeMasterTo: will attempt changing master on sm-atl-applogdb-3:3306 to shadowmaster:3306, shadowmaster.027720:455315683
2019-02-08 17:05:26 INFO ChangeMasterTo: Changed master on sm-atl-applogdb-3:3306 to: shadowmaster:3306, shadowmaster.027720:455315683. GTID: false
2019-02-08 17:05:26 DEBUG ChangeMasterTo: will attempt changing master on sm-ohq-applogdb-1:3306 to sm-atl-applogdb-3:3306, mysql-bin-sm-atl-applogdb-3.000057:169115074
2019-02-08 17:05:26 INFO ChangeMasterTo: Changed master on sm-ohq-applogdb-1:3306 to: sm-atl-applogdb-3:3306, mysql-bin-sm-atl-applogdb-3.000057:169115074. GTID: false
2019-02-08 17:05:27 WARNING executeCheckAndRecoverFunction: ignoring analysisEntry that has no action plan: AllIntermediateMasterSlavesNotReplicating; key: sm-atl-applogdb-3:3306
2019-02-08 17:05:27 INFO Started replication on sm-atl-applogdb-3:3306
2019-02-08 17:05:28 DEBUG raft leader is 10.0.84.117:10008 (this host); state: Leader
2019-02-08 17:05:28 INFO Started replication on sm-ohq-applogdb-1:3306
2019-02-08 17:05:29 INFO auditType:take-master instance:sm-atl-applogdb-3:3306 cluster:shadowmaster:3306 message:took master: sm-ohq-applogdb-1:3306
Would it be possible to create a GracefulIntermediateMasterTakeover hook or a hook for the take-master call above?
Thanks for your time, please let me know if you have any questions and I can try to explain more if needed.