Skip to content
This repository was archived by the owner on Feb 18, 2025. It is now read-only.
This repository was archived by the owner on Feb 18, 2025. It is now read-only.

Instance check leads to UnreachableMaster (LastCheckValid: false) if instance check takes longer than 1s #1367

@binwiederhier

Description

@binwiederhier

Our MySQL hosts are under considerable load, to the point that quite frequently, the analysis returns LastCheckValid: false, even though the host is up and responsive:

analysis: ClusterName: ...:3306, IsMaster: true, LastCheckValid: false, LastCheckPartialSuccess: true, 
CountReplicas: 2, CountValidReplicas: 2, CountValidReplicatingReplicas: 2, CountLaggingReplicas: 0, 
CountDelayedReplicas: 0, CountReplicasFailingToConnectToMaster: 0

After looking at Wireshark dumps, it seems that our hosts sometimes take slightly longer than 1s to perform these checks (2-3s, sometimes even longer), which is arguably not great, but still ok for us. From my understanding of the code, if a check takes longer than 1s (hardcoded), Orchestrator considers the host to be down, which may lead to emergent actions and eventually to failovers if those fail.

I've deduced this from these parts of the code:

I propose to making this 1s timeout configurable as ReasonableInstanceCheckSeconds.

Did I get all of this right? What do you think about the proposal? I am happy to provide all the info that's needed here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions