Skip to content
This repository was archived by the owner on Feb 18, 2025. It is now read-only.

Conversation

shlomi-noach
Copy link
Collaborator

As pointed out by @ggunson , ErrantGTIDResetMaster() issues a reset master immediately following a stop slave operation, but without verifying that replication has indeed stopped. e.g. SQL thread could still be busy.

We've seen crashes in production at running reset master.

In this PR we actively wait (or timeout) for replication to stop, before running reset master.

@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=conductor December 19, 2018 08:05 Inactive
Shlomi Noach added 2 commits December 19, 2018 10:19
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=conductor December 19, 2018 08:25 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=conductor December 19, 2018 09:22 Inactive
ioThreadRunning := (m.GetString("Slave_IO_Running") == "Yes")
sqlThreadRunning := (m.GetString("Slave_SQL_Running") == "Yes")
replicationThreadsRunning = ioThreadRunning && sqlThreadRunning
ioThreadRunning = (m.GetString("Slave_IO_Running") == "Yes")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An issue with this is that you're deciding that Slave_*_Running = Yes is the boolean check against which to decide that replication is fully running or fully stopped.

I'm not sure about Slave_SQL_Running's options but Slave_IO_Running can also be "Connecting". So that's at least one case where this check would say that replication is not running even though the IO thread is (or, starting to, or trying to).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Co-Authored-By: shlomi-noach <shlomi-noach@github.com>
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=conductor December 20, 2018 06:02 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=conductor December 20, 2018 11:07 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=conductor December 23, 2018 05:56 Inactive
@shlomi-noach
Copy link
Collaborator Author

I will merge this PR even though we haven't yet verified the reasoning for the reset master crash, or at least have not provided a reliable way to reproduce the error.

Reiterating an internal issue, @ggunson suggests the retries can be the cause of the crash, as follows:

  • a 1st connection breaks on Connection Invalid
  • but the reset master it invoked is still underway
  • orchestrator issues a retry, running a 2nd, concurrent reset master
  • crash

We're yet to reproduce this reliably, but will not be working on this actively in the short term.

@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=conductor December 23, 2018 13:33 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=conductor December 24, 2018 06:09 Inactive
@shlomi-noach shlomi-noach merged commit 82757e9 into master Dec 24, 2018
@shlomi-noach shlomi-noach deleted the reset-master-wait-for-replication-to-stop branch December 24, 2018 06:18
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants