RestartReplicationQuick causing increase in replication lag

When an **UnreachableMasterWithLaggingReplicas** is detected on the master, the SQL thread and and I/O threads are being restarted as part of the emergent action by the Orchestrator.

We noticed that stop and start of SQL thread on the replicas causes increase in the replication lag  as the transaction being applied has to be rolled back and re-applied from start. 

This change was introduced as part of https://github.com/openark/orchestrator/pull/1010 where SQL thread is being restarted which we believe is causing this issue.  

```go
func RestartReplicationQuick(instanceKey *InstanceKey) error {
	for _, cmd := range []string{`stop slave sql_thread`, `stop slave io_thread`, `start slave io_thread`, `start slave sql_thread`} {
		if _, err := ExecInstance(instanceKey, cmd); err != nil {
			return log.Errorf("%+v: RestartReplicationQuick: '%q' failed: %+v", *instanceKey, cmd, err)
		} else {
			log.Infof("%s on %+v as part of RestartReplicationQuick", cmd, *instanceKey)
		}
	}
	return nil
} 
```

Orchestrator would still be able detect "Too Many Connections" issue even if only the I/O thread of the replica is restarted.

@shlomi-noach  Will submit a PR as discussed reverting the code to restart only the I/O thread. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RestartReplicationQuick causing increase in replication lag #1308

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RestartReplicationQuick causing increase in replication lag #1308

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions