Skip to content

Incorrect computation of initial-cluster-state during single member restoration which can lead to cluster ID mismatch errors #847

@unmarshall

Description

@unmarshall

How to categorize this issue?

/area control-plane
/kind bug

What happened:
A specific gardener e2e kind test is failing often - Shoot Tests Hibernated Shoot [It] Create, Migrate and Delete [Shoot, control-plane-migration, hibernated]

Creation, Migration and hibernation steps succeed. To do the deletion of the migrated shoot which is currently hibernated, you need to wake up the etcd-cluster. At this stage the etcd cluster is not getting ready.

In one such occurrence we see the following logs in etcd-events-2 (backup-restore container):

2025-02-17T12:45:52.969873914Z stderr F 2025-02-17 12:45:52.968607 E | rafthttp: request sent was ignored (cluster ID mismatch: peer[6fdaf30df04c0245]=4ffa550a92b87675, local=39b1e34c77b1db7a)
2025-02-17T12:45:52.970531124Z stderr F 2025-02-17 12:45:52.970317 E | rafthttp: request sent was ignored (cluster ID mismatch: peer[6fdaf30df04c0245]=4ffa550a92b87675, local=39b1e34c77b1db7a)
2025-02-17T12:45:53.055124837Z stderr F 2025-02-17 12:45:53.054945 E | rafthttp: request sent was ignored (cluster ID mismatch: peer[6fdaf30df04c0245]=4ffa550a92b87675, local=39b1e34c77b1db7a)
2025-02-17T12:45:53.062374513Z stderr F 2025-02-17 12:45:53.062106 E | rafthttp: request sent was ignored (cluster ID mismatch: peer[6fdaf30df04c0245]=4ffa550a92b87675, local=39b1e34c77b1db7a)
2025-02-17T12:45:53.153435731Z stderr F 2025-02-17 12:45:53.153314 E | rafthttp: request sent was ignored (cluster ID mismatch: peer[6fdaf30df04c0245]=4ffa550a92b87675, local=39b1e34c77b1db7a)
2025-02-17T12:45:53.160917167Z stderr F 2025-02-17 12:45:53.160807 E | rafthttp: request sent was ignored (cluster ID mismatch: peer[6fdaf30df04c0245]=4ffa550a92b87675, local=39b1e34c77b1db7a)
2025-02-17T12:45:53.251792044Z stderr F 2025-02-17 12:45:53.251680 E | rafthttp: request sent was ignored (cluster ID mismatch: peer[6fdaf30df04c0245]=4ffa550a92b87675, local=39b1e34c77b1db7a)
2025-02-17T12:45:53.264667024Z stderr F 2025-02-17 12:45:53.264552 E | rafthttp: request sent was ignored (cluster ID mismatch: peer[6fdaf30df04c0245]=4ffa550a92b87675, local=39b1e34c77b1db7a)

For complete logs see: etcd-events-2-backup-restore.log

You would typically see cluster ID mismatch in the 3 scenarios that are documented here.

Prior to starting the embedded etcd process, initialization is triggered by etcd-wrapper. Once the initialization succeeds, etcd-wrapper requests for etcd config. etcd-backup-restore computes the etcd config here. One of the key parameters in the etcd config is to determine the initial-cluster-state which is done here to distinguish if this member bootstraps/joins a new cluster or joins an existing cluster.

If member list API call fails (see IsLearnerPresent) due to any reason then this function correctly returns an error which is swallowed by the calling function (see here) and the calling function assumes initial-cluster-state=new. This is done for 0->3 replicas bootstrap case because while bootstrapping a new cluster etcd Member API calls will never succeed. Even in case of errors, we have to serve the config with initial-cluster-state=new to let the bootstrap succeed.

However, the above code-flow has a negative consequence as well. Consider the following case:

  • Data directory of one of the etcd member gets corrupted while bringing up the cluster from 0->3.
  • Etcd-backup-restore validates the data directory and finds it corrupt. It will trigger the single member restoration (see this for more information).
  • As part of single-member-restoration, it will add this member as a learner after which it will trigger the initialization. Once initialization is successful, it will serve an etcd config.
  • While computing the initial-cluster-state if there is an error while making the etcd Member API call (due to transient quorum loss - possible due to VPA eviction etc.) then it assumes initial-cluster-state as new. This will cause Cluster ID mismatch as this state for a learner as it's not the correct inital-cluster state.
  • This will force this member to create a new member ID which will never match with the member IDs that are known by other 2 members of the etcd cluster. Once it dials the other 2 members then they will reject the call with the Cluster ID mismatch response.

What you expected to happen:
initial-cluster-state should always be computed correctly.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions