Skip to content
This repository was archived by the owner on Feb 18, 2025. It is now read-only.
This repository was archived by the owner on Feb 18, 2025. It is now read-only.

RestartReplicationQuick causing increase in replication lag #1308

@gsraman

Description

@gsraman

When an UnreachableMasterWithLaggingReplicas is detected on the master, the SQL thread and and I/O threads are being restarted as part of the emergent action by the Orchestrator.

We noticed that stop and start of SQL thread on the replicas causes increase in the replication lag as the transaction being applied has to be rolled back and re-applied from start.

This change was introduced as part of #1010 where SQL thread is being restarted which we believe is causing this issue.

func RestartReplicationQuick(instanceKey *InstanceKey) error {
	for _, cmd := range []string{`stop slave sql_thread`, `stop slave io_thread`, `start slave io_thread`, `start slave sql_thread`} {
		if _, err := ExecInstance(instanceKey, cmd); err != nil {
			return log.Errorf("%+v: RestartReplicationQuick: '%q' failed: %+v", *instanceKey, cmd, err)
		} else {
			log.Infof("%s on %+v as part of RestartReplicationQuick", cmd, *instanceKey)
		}
	}
	return nil
} 

Orchestrator would still be able detect "Too Many Connections" issue even if only the I/O thread of the replica is restarted.

@shlomi-noach Will submit a PR as discussed reverting the code to restart only the I/O thread.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions