Unstable leadership when running server is demoted/removed without its participation

Hi 👋 

We have discovered an issue with the library that results in unstable leadership when running server is demoted/removed without its participation, which we have managed to reasonably reproduce in Consul OSS/Enterprise >= v1.13.0 and Vault Enterprise >= v1.11.0. We believe the livelock behavior was uncovered by the logically correct check introduced in #477 and released as part of Raft [v1.3.7](https://github.com/hashicorp/raft/releases/tag/v1.3.7). 

Reproduction: https://github.com/szechuen/raft-524

## Precondition

A running server is demoted or removed from the cluster without its participation. Realistically, this is likely to occur when a server is temporarily disconnected from the rest of the cluster and automatically demoted/removed by the remaining servers via a process like [raft-autopilot](https://github.com/hashicorp/raft-autopilot) which both Consul and Vault utilizes before being reconnected. 

### Consul

As of Consul v1.13.x, the behavior is tied to the [`CleanupDeadServers`](https://learn.hashicorp.com/tutorials/consul/autopilot-datacenter-operations#dead-server-cleanup) option (default: `true`) offered by [raft-autopilot](https://pkg.go.dev/github.com/hashicorp/raft-autopilot#Config) which when invoked removes all failed servers subject to the configured `MinQuorum` (default: `0`). This is triggered once  `LastContactThreshold` (default: `200ms`) is exceeded and the server is considered unhealthy. 

### Vault

As of Vault v1.11.x, the [`cleanup_dead_servers`](https://www.vaultproject.io/docs/concepts/integrated-storage/autopilot#dead-server-cleanup) option is `false` by default and even when enabled defaults to the much higher `dead_server_last_contact_threshold` of `24h`, hence less likely to occur with reasonable configuration. 

However, the [Automated Upgrades](https://www.vaultproject.io/docs/concepts/integrated-storage/autopilot#automated-upgrades) Autopilot feature was introduced in the Enterprise variant which is enabled by default ([`disable_upgrade_migration`](https://www.vaultproject.io/api-docs/system/storage/raftautopilot#set-configuration), default: `false`) and automatically demotes any server with an `UpgradeVersionTag` not matching the majority of the servers to a non-voter. There must be a leadership transfer during the loss of connectivity for the last known information to be reset and thus triggering the behavior. This is ensured when the server cut off is the existing leader. 

## Behavior

All log samples reference the Consul and Vault issue reproduction linked above with a 3-node (`node0`, `node1`, `node2`) cluster setup where `node0` is the initial node bootstrapped (hence existing leader) and disconnected from the remaining nodes for a period of 30s after the cluster has been formed. 

First, ensure that the cluster has stabilized and the remaining nodes have been promoted to voters if the `ServerStabilizationTime` Autopilot config is set (default: `10s` for both Consul and Vault). 

```
# Consul

consul-node0-1  | 2022-09-06T17:02:28.227Z [INFO]  agent.server.autopilot: Promoting server: id=6d5eb694-fa44-1622-2b81-8e8f5a162251 address=172.20.0.2:8300 name=node2
consul-node0-1  | 2022-09-06T17:02:28.230Z [INFO]  agent.server.raft: updating configuration: command=AddVoter server-id=6d5eb694-fa44-1622-2b81-8e8f5a162251 server-addr=172.20.0.2:8300 servers="[{Suffrage:Voter ID:2332eb10-d58f-1820-d7fc-153abfaa7111 Address:172.20.0.3:8300} {Suffrage:Voter ID:6d5eb694-fa44-1622-2b81-8e8f5a162251 Address:172.20.0.2:8300} {Suffrage:Nonvoter ID:11e503a2-f8e5-a39c-d1c0-834b15ae01a6 Address:172.20.0.4:8300}]"
consul-node0-1  | 2022-09-06T17:02:28.231Z [INFO]  agent.server.autopilot: Promoting server: id=11e503a2-f8e5-a39c-d1c0-834b15ae01a6 address=172.20.0.4:8300 name=node1
consul-node0-1  | 2022-09-06T17:02:28.231Z [INFO]  agent.server.raft: updating configuration: command=AddVoter server-id=11e503a2-f8e5-a39c-d1c0-834b15ae01a6 server-addr=172.20.0.4:8300 servers="[{Suffrage:Voter ID:2332eb10-d58f-1820-d7fc-153abfaa7111 Address:172.20.0.3:8300} {Suffrage:Voter ID:6d5eb694-fa44-1622-2b81-8e8f5a162251 Address:172.20.0.2:8300} {Suffrage:Voter ID:11e503a2-f8e5-a39c-d1c0-834b15ae01a6 Address:172.20.0.4:8300}]"

# Vault

vault-node0-1  | 2022-09-06T17:04:24.259Z [INFO]  storage.raft.autopilot: Promoting server: id=node1 address=node1:8201 name=node1
vault-node0-1  | 2022-09-06T17:04:24.259Z [INFO]  storage.raft: updating configuration: command=AddVoter server-id=node1 server-addr=node1:8201 servers="[{Suffrage:Voter ID:node0 Address:node0:8201} {Suffrage:Voter ID:node1 Address:node1:8201} {Suffrage:Nonvoter ID:node2 Address:node2:8201}]"
vault-node0-1  | 2022-09-06T17:04:24.261Z [INFO]  storage.raft.autopilot: Promoting server: id=node2 address=node2:8201 name=node2
vault-node0-1  | 2022-09-06T17:04:24.261Z [INFO]  storage.raft: updating configuration: command=AddVoter server-id=node2 server-addr=node2:8201 servers="[{Suffrage:Voter ID:node0 Address:node0:8201} {Suffrage:Voter ID:node1 Address:node1:8201} {Suffrage:Voter ID:node2 Address:node2:8201}]"
```

Next, `node0` is disconnected from the cluster for 30s during which it would be either demoted to a non-voter (Vault) or removed as a server entirely (Consul). Since `node0` can't learn about the configuration change and hear from the leader, it would continuously attempt to start an election and request for votes as a candidate with progressively higher terms. 

```
# Consul

consul-node0-1  | Disconnecting node...
consul-node0-1  | 2022-09-06T17:02:40.621Z [WARN]  agent.server.raft: failed to contact quorum of nodes, stepping down
consul-node0-1  | 2022-09-06T17:02:40.621Z [INFO]  agent.server.raft: entering follower state: follower="Node at 172.20.0.3:8300 [Follower]" leader-address= leader-id=
consul-node0-1  | 2022-09-06T17:02:45.924Z [WARN]  agent.server.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
consul-node0-1  | 2022-09-06T17:02:45.924Z [INFO]  agent.server.raft: entering candidate state: node="Node at 172.20.0.3:8300 [Candidate]" term=3
consul-node0-1  | 2022-09-06T17:02:51.257Z [WARN]  agent.server.raft: Election timeout reached, restarting election
consul-node0-1  | 2022-09-06T17:02:51.257Z [INFO]  agent.server.raft: entering candidate state: node="Node at 172.20.0.3:8300 [Candidate]" term=4

consul-node2-1  | 2022-09-06T17:02:52.733Z [INFO]  agent.server.raft: election won: tally=2
consul-node2-1  | 2022-09-06T17:02:52.733Z [INFO]  agent.server.raft: entering leader state: leader="Node at 172.20.0.2:8300 [Leader]"
consul-node2-1  | 2022-09-06T17:02:58.170Z [INFO]  agent.server.autopilot: Attempting removal of failed server node: id=2332eb10-d58f-1820-d7fc-153abfaa7111 name=node0 address=172.20.0.3:8300
consul-node2-1  | 2022-09-06T17:02:58.172Z [INFO]  agent.server.raft: updating configuration: command=RemoveServer server-id=2332eb10-d58f-1820-d7fc-153abfaa7111 server-addr= servers="[{Suffrage:Voter ID:6d5eb694-fa44-1622-2b81-8e8f5a162251 Address:172.20.0.2:8300} {Suffrage:Voter ID:11e503a2-f8e5-a39c-d1c0-834b15ae01a6 Address:172.20.0.4:8300}]"
consul-node2-1  | 2022-09-06T17:02:58.175Z [INFO]  agent.server.autopilot: removed server: id=2332eb10-d58f-1820-d7fc-153abfaa7111

# Vault

vault-node0-1  | Disconnecting node...
vault-node0-1  | 2022-09-06T17:04:42.574Z [WARN]  storage.raft: failed to contact quorum of nodes, stepping down
vault-node0-1  | 2022-09-06T17:04:42.574Z [INFO]  storage.raft: entering follower state: follower="Node at node0:8201 [Follower]" leader-address= leader-id=
vault-node0-1  | 2022-09-06T17:04:50.062Z [WARN]  storage.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
vault-node0-1  | 2022-09-06T17:04:50.062Z [INFO]  storage.raft: entering candidate state: node="Node at node0:8201 [Candidate]" term=4
vault-node0-1  | 2022-09-06T17:04:58.072Z [WARN]  storage.raft: Election timeout reached, restarting election
vault-node0-1  | 2022-09-06T17:04:58.072Z [INFO]  storage.raft: entering candidate state: node="Node at node0:8201 [Candidate]" term=5

vault-node2-1  | 2022-09-06T17:04:52.566Z [INFO]  storage.raft: election won: tally=2
vault-node2-1  | 2022-09-06T17:04:52.566Z [INFO]  storage.raft: entering leader state: leader="Node at node2:8201 [Leader]"
vault-node2-1  | 2022-09-06T17:05:02.612Z [ERROR] storage.raft.autopilot: UpgradeVersionTag not found in server metadata: id=node0 name=node0 address=node0:8201 upgrade_version_tag=upgrade_version
vault-node2-1  | 2022-09-06T17:05:02.612Z [INFO]  storage.raft.autopilot: Demoting server: id=node0 address=node0:8201 name=node0
vault-node2-1  | 2022-09-06T17:05:02.612Z [INFO]  storage.raft: updating configuration: command=DemoteVoter server-id=node0 server-addr= servers="[{Suffrage:Nonvoter ID:node0 Address:node0:8201} {Suffrage:Voter ID:node1 Address:node1:8201} {Suffrage:Voter ID:node2 Address:node2:8201}]"
```

Once `node0` is reconnected, we see that it continues to attempt starting an election and request for votes as a candidate since it does not learn about the configuration change. The check introduced in #477 makes the remaining nodes ignore the vote request but because `node0` has a higher term, they abandon their quorum and step down to followers before restarting elections. However, the terms on `node1`/`node2` are not updated to the highest term seen (from `node0`) and the next round of elections continues to be made with a lower term than that of `node0` causing the process to repeat all over again. 

Note that prior to #477, `node1`/`node2` would have accepted the vote request from `node0` despite not being a voter and elected it as the new leader which while incorrect would have allowed the cluster to make forward progress. 

```
# Consul

consul-node0-1  | Reconnecting node...

consul-node2-1  | 2022-09-06T17:03:08.443Z [INFO]  agent.server.raft: updating configuration: command=AddNonvoter server-id=2332eb10-d58f-1820-d7fc-153abfaa7111 server-addr=172.20.0.3:8300 servers="[{Suffrage:Voter ID:6d5eb694-fa44-1622-2b81-8e8f5a162251 Address:172.20.0.2:8300} {Suffrage:Voter ID:11e503a2-f8e5-a39c-d1c0-834b15ae01a6 Address:172.20.0.4:8300} {Suffrage:Nonvoter ID:2332eb10-d58f-1820-d7fc-153abfaa7111 Address:172.20.0.3:8300}]"
consul-node2-1  | 2022-09-06T17:03:08.448Z [ERROR] agent.server.raft: peer has newer term, stopping replication: peer="{Nonvoter 2332eb10-d58f-1820-d7fc-153abfaa7111 172.20.0.3:8300}"
consul-node2-1  | 2022-09-06T17:03:08.448Z [INFO]  agent.server.raft: entering follower state: follower="Node at 172.20.0.2:8300 [Follower]" leader-address= leader-id=
consul-node2-1  | 2022-09-06T17:03:08.448Z [ERROR] agent.server.autopilot: failed to add raft non-voting peer: id=2332eb10-d58f-1820-d7fc-153abfaa7111 address=172.20.0.3:8300 error="leadership lost while committing log"

consul-node0-1  | 2022-09-06T17:03:12.909Z [WARN]  agent.server.raft: Election timeout reached, restarting election
consul-node0-1  | 2022-09-06T17:03:12.909Z [INFO]  agent.server.raft: entering candidate state: node="Node at 172.20.0.3:8300 [Candidate]" term=7
consul-node1-1  | 2022-09-06T17:03:12.916Z [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=172.20.0.3:8300
consul-node2-1  | 2022-09-06T17:03:12.916Z [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=172.20.0.3:8300

consul-node2-1  | 2022-09-06T17:03:17.405Z [WARN]  agent.server.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
consul-node2-1  | 2022-09-06T17:03:17.405Z [INFO]  agent.server.raft: entering candidate state: node="Node at 172.20.0.2:8300 [Candidate]" term=5
consul-node2-1  | 2022-09-06T17:03:17.413Z [INFO]  agent.server.raft: election won: tally=2
consul-node2-1  | 2022-09-06T17:03:17.414Z [INFO]  agent.server.raft: entering leader state: leader="Node at 172.20.0.2:8300 [Leader]"

consul-node2-1  | 2022-09-06T17:03:17.414Z [INFO]  agent.server.raft: added peer, starting replication: peer=11e503a2-f8e5-a39c-d1c0-834b15ae01a6
consul-node2-1  | 2022-09-06T17:03:17.414Z [INFO]  agent.server.raft: pipelining replication: peer="{Voter 11e503a2-f8e5-a39c-d1c0-834b15ae01a6 172.20.0.4:8300}"
consul-node2-1  | 2022-09-06T17:03:17.415Z [ERROR] agent.server.raft: peer has newer term, stopping replication: peer="{Nonvoter 2332eb10-d58f-1820-d7fc-153abfaa7111 172.20.0.3:8300}"
consul-node2-1  | 2022-09-06T17:03:17.416Z [INFO]  agent.server.raft: entering follower state: follower="Node at 172.20.0.2:8300 [Follower]" leader-address= leader-id=

# Vault

vault-node0-1  | Reconnecting node...

vault-node0-1  | 2022-09-06T17:05:12.453Z [WARN]  storage.raft: Election timeout reached, restarting election
vault-node0-1  | 2022-09-06T17:05:12.453Z [INFO]  storage.raft: entering candidate state: node="Node at node0:8201 [Candidate]" term=7
vault-node1-1  | 2022-09-06T17:05:12.477Z [WARN]  storage.raft: rejecting vote request since node is not a voter: from=node0:8201
vault-node2-1  | 2022-09-06T17:05:12.484Z [WARN]  storage.raft: rejecting vote request since node is not a voter: from=node0:8201

vault-node2-1  | 2022-09-06T17:05:12.558Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter node0 node0:8201}" error="dial tcp 172.19.0.3:8201: i/o timeout"
vault-node2-1  | 2022-09-06T17:05:12.570Z [ERROR] storage.raft: peer has newer term, stopping replication: peer="{Voter node0 node0:8201}"
vault-node2-1  | 2022-09-06T17:05:12.570Z [INFO]  storage.raft: entering follower state: follower="Node at node2:8201 [Follower]" leader-address= leader-id=

vault-node0-1  | 2022-09-06T17:05:17.671Z [WARN]  storage.raft: Election timeout reached, restarting election
vault-node0-1  | 2022-09-06T17:05:17.671Z [INFO]  storage.raft: entering candidate state: node="Node at node0:8201 [Candidate]" term=8
vault-node2-1  | 2022-09-06T17:05:17.675Z [WARN]  storage.raft: rejecting vote request since node is not a voter: from=node0:8201
vault-node1-1  | 2022-09-06T17:05:17.675Z [WARN]  storage.raft: rejecting vote request since node is not a voter: from=node0:8201

vault-node2-1  | 2022-09-06T17:05:21.425Z [INFO]  storage.raft: entering candidate state: node="Node at node2:8201 [Candidate]" term=6
vault-node2-1  | 2022-09-06T17:05:21.432Z [INFO]  storage.raft: election won: tally=2
vault-node2-1  | 2022-09-06T17:05:21.432Z [INFO]  storage.raft: entering leader state: leader="Node at node2:8201 [Leader]"

vault-node2-1  | 2022-09-06T17:05:21.432Z [INFO]  storage.raft: added peer, starting replication: peer=node0
vault-node2-1  | 2022-09-06T17:05:21.432Z [ERROR] storage.raft: peer has newer term, stopping replication: peer="{Nonvoter node0 node0:8201}"
vault-node2-1  | 2022-09-06T17:05:21.434Z [INFO]  storage.raft: entering follower state: follower="Node at node2:8201 [Follower]" leader-address= leader-id=
```

The result of which is a cluster that is henceforth trapped in a livelock and never able to regain stable leadership until `node0` is reset. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unstable leadership when running server is demoted/removed without its participation #524

Precondition

Consul

Vault

Behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unstable leadership when running server is demoted/removed without its participation #524

Description

Precondition

Consul

Vault

Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions