-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Hi 👋
We have discovered an issue with the library that results in unstable leadership when running server is demoted/removed without its participation, which we have managed to reasonably reproduce in Consul OSS/Enterprise >= v1.13.0 and Vault Enterprise >= v1.11.0. We believe the livelock behavior was uncovered by the logically correct check introduced in #477 and released as part of Raft v1.3.7.
Reproduction: https://github.com/szechuen/raft-524
Precondition
A running server is demoted or removed from the cluster without its participation. Realistically, this is likely to occur when a server is temporarily disconnected from the rest of the cluster and automatically demoted/removed by the remaining servers via a process like raft-autopilot which both Consul and Vault utilizes before being reconnected.
Consul
As of Consul v1.13.x, the behavior is tied to the CleanupDeadServers
option (default: true
) offered by raft-autopilot which when invoked removes all failed servers subject to the configured MinQuorum
(default: 0
). This is triggered once LastContactThreshold
(default: 200ms
) is exceeded and the server is considered unhealthy.
Vault
As of Vault v1.11.x, the cleanup_dead_servers
option is false
by default and even when enabled defaults to the much higher dead_server_last_contact_threshold
of 24h
, hence less likely to occur with reasonable configuration.
However, the Automated Upgrades Autopilot feature was introduced in the Enterprise variant which is enabled by default (disable_upgrade_migration
, default: false
) and automatically demotes any server with an UpgradeVersionTag
not matching the majority of the servers to a non-voter. There must be a leadership transfer during the loss of connectivity for the last known information to be reset and thus triggering the behavior. This is ensured when the server cut off is the existing leader.
Behavior
All log samples reference the Consul and Vault issue reproduction linked above with a 3-node (node0
, node1
, node2
) cluster setup where node0
is the initial node bootstrapped (hence existing leader) and disconnected from the remaining nodes for a period of 30s after the cluster has been formed.
First, ensure that the cluster has stabilized and the remaining nodes have been promoted to voters if the ServerStabilizationTime
Autopilot config is set (default: 10s
for both Consul and Vault).
# Consul
consul-node0-1 | 2022-09-06T17:02:28.227Z [INFO] agent.server.autopilot: Promoting server: id=6d5eb694-fa44-1622-2b81-8e8f5a162251 address=172.20.0.2:8300 name=node2
consul-node0-1 | 2022-09-06T17:02:28.230Z [INFO] agent.server.raft: updating configuration: command=AddVoter server-id=6d5eb694-fa44-1622-2b81-8e8f5a162251 server-addr=172.20.0.2:8300 servers="[{Suffrage:Voter ID:2332eb10-d58f-1820-d7fc-153abfaa7111 Address:172.20.0.3:8300} {Suffrage:Voter ID:6d5eb694-fa44-1622-2b81-8e8f5a162251 Address:172.20.0.2:8300} {Suffrage:Nonvoter ID:11e503a2-f8e5-a39c-d1c0-834b15ae01a6 Address:172.20.0.4:8300}]"
consul-node0-1 | 2022-09-06T17:02:28.231Z [INFO] agent.server.autopilot: Promoting server: id=11e503a2-f8e5-a39c-d1c0-834b15ae01a6 address=172.20.0.4:8300 name=node1
consul-node0-1 | 2022-09-06T17:02:28.231Z [INFO] agent.server.raft: updating configuration: command=AddVoter server-id=11e503a2-f8e5-a39c-d1c0-834b15ae01a6 server-addr=172.20.0.4:8300 servers="[{Suffrage:Voter ID:2332eb10-d58f-1820-d7fc-153abfaa7111 Address:172.20.0.3:8300} {Suffrage:Voter ID:6d5eb694-fa44-1622-2b81-8e8f5a162251 Address:172.20.0.2:8300} {Suffrage:Voter ID:11e503a2-f8e5-a39c-d1c0-834b15ae01a6 Address:172.20.0.4:8300}]"
# Vault
vault-node0-1 | 2022-09-06T17:04:24.259Z [INFO] storage.raft.autopilot: Promoting server: id=node1 address=node1:8201 name=node1
vault-node0-1 | 2022-09-06T17:04:24.259Z [INFO] storage.raft: updating configuration: command=AddVoter server-id=node1 server-addr=node1:8201 servers="[{Suffrage:Voter ID:node0 Address:node0:8201} {Suffrage:Voter ID:node1 Address:node1:8201} {Suffrage:Nonvoter ID:node2 Address:node2:8201}]"
vault-node0-1 | 2022-09-06T17:04:24.261Z [INFO] storage.raft.autopilot: Promoting server: id=node2 address=node2:8201 name=node2
vault-node0-1 | 2022-09-06T17:04:24.261Z [INFO] storage.raft: updating configuration: command=AddVoter server-id=node2 server-addr=node2:8201 servers="[{Suffrage:Voter ID:node0 Address:node0:8201} {Suffrage:Voter ID:node1 Address:node1:8201} {Suffrage:Voter ID:node2 Address:node2:8201}]"
Next, node0
is disconnected from the cluster for 30s during which it would be either demoted to a non-voter (Vault) or removed as a server entirely (Consul). Since node0
can't learn about the configuration change and hear from the leader, it would continuously attempt to start an election and request for votes as a candidate with progressively higher terms.
# Consul
consul-node0-1 | Disconnecting node...
consul-node0-1 | 2022-09-06T17:02:40.621Z [WARN] agent.server.raft: failed to contact quorum of nodes, stepping down
consul-node0-1 | 2022-09-06T17:02:40.621Z [INFO] agent.server.raft: entering follower state: follower="Node at 172.20.0.3:8300 [Follower]" leader-address= leader-id=
consul-node0-1 | 2022-09-06T17:02:45.924Z [WARN] agent.server.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
consul-node0-1 | 2022-09-06T17:02:45.924Z [INFO] agent.server.raft: entering candidate state: node="Node at 172.20.0.3:8300 [Candidate]" term=3
consul-node0-1 | 2022-09-06T17:02:51.257Z [WARN] agent.server.raft: Election timeout reached, restarting election
consul-node0-1 | 2022-09-06T17:02:51.257Z [INFO] agent.server.raft: entering candidate state: node="Node at 172.20.0.3:8300 [Candidate]" term=4
consul-node2-1 | 2022-09-06T17:02:52.733Z [INFO] agent.server.raft: election won: tally=2
consul-node2-1 | 2022-09-06T17:02:52.733Z [INFO] agent.server.raft: entering leader state: leader="Node at 172.20.0.2:8300 [Leader]"
consul-node2-1 | 2022-09-06T17:02:58.170Z [INFO] agent.server.autopilot: Attempting removal of failed server node: id=2332eb10-d58f-1820-d7fc-153abfaa7111 name=node0 address=172.20.0.3:8300
consul-node2-1 | 2022-09-06T17:02:58.172Z [INFO] agent.server.raft: updating configuration: command=RemoveServer server-id=2332eb10-d58f-1820-d7fc-153abfaa7111 server-addr= servers="[{Suffrage:Voter ID:6d5eb694-fa44-1622-2b81-8e8f5a162251 Address:172.20.0.2:8300} {Suffrage:Voter ID:11e503a2-f8e5-a39c-d1c0-834b15ae01a6 Address:172.20.0.4:8300}]"
consul-node2-1 | 2022-09-06T17:02:58.175Z [INFO] agent.server.autopilot: removed server: id=2332eb10-d58f-1820-d7fc-153abfaa7111
# Vault
vault-node0-1 | Disconnecting node...
vault-node0-1 | 2022-09-06T17:04:42.574Z [WARN] storage.raft: failed to contact quorum of nodes, stepping down
vault-node0-1 | 2022-09-06T17:04:42.574Z [INFO] storage.raft: entering follower state: follower="Node at node0:8201 [Follower]" leader-address= leader-id=
vault-node0-1 | 2022-09-06T17:04:50.062Z [WARN] storage.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
vault-node0-1 | 2022-09-06T17:04:50.062Z [INFO] storage.raft: entering candidate state: node="Node at node0:8201 [Candidate]" term=4
vault-node0-1 | 2022-09-06T17:04:58.072Z [WARN] storage.raft: Election timeout reached, restarting election
vault-node0-1 | 2022-09-06T17:04:58.072Z [INFO] storage.raft: entering candidate state: node="Node at node0:8201 [Candidate]" term=5
vault-node2-1 | 2022-09-06T17:04:52.566Z [INFO] storage.raft: election won: tally=2
vault-node2-1 | 2022-09-06T17:04:52.566Z [INFO] storage.raft: entering leader state: leader="Node at node2:8201 [Leader]"
vault-node2-1 | 2022-09-06T17:05:02.612Z [ERROR] storage.raft.autopilot: UpgradeVersionTag not found in server metadata: id=node0 name=node0 address=node0:8201 upgrade_version_tag=upgrade_version
vault-node2-1 | 2022-09-06T17:05:02.612Z [INFO] storage.raft.autopilot: Demoting server: id=node0 address=node0:8201 name=node0
vault-node2-1 | 2022-09-06T17:05:02.612Z [INFO] storage.raft: updating configuration: command=DemoteVoter server-id=node0 server-addr= servers="[{Suffrage:Nonvoter ID:node0 Address:node0:8201} {Suffrage:Voter ID:node1 Address:node1:8201} {Suffrage:Voter ID:node2 Address:node2:8201}]"
Once node0
is reconnected, we see that it continues to attempt starting an election and request for votes as a candidate since it does not learn about the configuration change. The check introduced in #477 makes the remaining nodes ignore the vote request but because node0
has a higher term, they abandon their quorum and step down to followers before restarting elections. However, the terms on node1
/node2
are not updated to the highest term seen (from node0
) and the next round of elections continues to be made with a lower term than that of node0
causing the process to repeat all over again.
Note that prior to #477, node1
/node2
would have accepted the vote request from node0
despite not being a voter and elected it as the new leader which while incorrect would have allowed the cluster to make forward progress.
# Consul
consul-node0-1 | Reconnecting node...
consul-node2-1 | 2022-09-06T17:03:08.443Z [INFO] agent.server.raft: updating configuration: command=AddNonvoter server-id=2332eb10-d58f-1820-d7fc-153abfaa7111 server-addr=172.20.0.3:8300 servers="[{Suffrage:Voter ID:6d5eb694-fa44-1622-2b81-8e8f5a162251 Address:172.20.0.2:8300} {Suffrage:Voter ID:11e503a2-f8e5-a39c-d1c0-834b15ae01a6 Address:172.20.0.4:8300} {Suffrage:Nonvoter ID:2332eb10-d58f-1820-d7fc-153abfaa7111 Address:172.20.0.3:8300}]"
consul-node2-1 | 2022-09-06T17:03:08.448Z [ERROR] agent.server.raft: peer has newer term, stopping replication: peer="{Nonvoter 2332eb10-d58f-1820-d7fc-153abfaa7111 172.20.0.3:8300}"
consul-node2-1 | 2022-09-06T17:03:08.448Z [INFO] agent.server.raft: entering follower state: follower="Node at 172.20.0.2:8300 [Follower]" leader-address= leader-id=
consul-node2-1 | 2022-09-06T17:03:08.448Z [ERROR] agent.server.autopilot: failed to add raft non-voting peer: id=2332eb10-d58f-1820-d7fc-153abfaa7111 address=172.20.0.3:8300 error="leadership lost while committing log"
consul-node0-1 | 2022-09-06T17:03:12.909Z [WARN] agent.server.raft: Election timeout reached, restarting election
consul-node0-1 | 2022-09-06T17:03:12.909Z [INFO] agent.server.raft: entering candidate state: node="Node at 172.20.0.3:8300 [Candidate]" term=7
consul-node1-1 | 2022-09-06T17:03:12.916Z [WARN] agent.server.raft: rejecting vote request since node is not a voter: from=172.20.0.3:8300
consul-node2-1 | 2022-09-06T17:03:12.916Z [WARN] agent.server.raft: rejecting vote request since node is not a voter: from=172.20.0.3:8300
consul-node2-1 | 2022-09-06T17:03:17.405Z [WARN] agent.server.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
consul-node2-1 | 2022-09-06T17:03:17.405Z [INFO] agent.server.raft: entering candidate state: node="Node at 172.20.0.2:8300 [Candidate]" term=5
consul-node2-1 | 2022-09-06T17:03:17.413Z [INFO] agent.server.raft: election won: tally=2
consul-node2-1 | 2022-09-06T17:03:17.414Z [INFO] agent.server.raft: entering leader state: leader="Node at 172.20.0.2:8300 [Leader]"
consul-node2-1 | 2022-09-06T17:03:17.414Z [INFO] agent.server.raft: added peer, starting replication: peer=11e503a2-f8e5-a39c-d1c0-834b15ae01a6
consul-node2-1 | 2022-09-06T17:03:17.414Z [INFO] agent.server.raft: pipelining replication: peer="{Voter 11e503a2-f8e5-a39c-d1c0-834b15ae01a6 172.20.0.4:8300}"
consul-node2-1 | 2022-09-06T17:03:17.415Z [ERROR] agent.server.raft: peer has newer term, stopping replication: peer="{Nonvoter 2332eb10-d58f-1820-d7fc-153abfaa7111 172.20.0.3:8300}"
consul-node2-1 | 2022-09-06T17:03:17.416Z [INFO] agent.server.raft: entering follower state: follower="Node at 172.20.0.2:8300 [Follower]" leader-address= leader-id=
# Vault
vault-node0-1 | Reconnecting node...
vault-node0-1 | 2022-09-06T17:05:12.453Z [WARN] storage.raft: Election timeout reached, restarting election
vault-node0-1 | 2022-09-06T17:05:12.453Z [INFO] storage.raft: entering candidate state: node="Node at node0:8201 [Candidate]" term=7
vault-node1-1 | 2022-09-06T17:05:12.477Z [WARN] storage.raft: rejecting vote request since node is not a voter: from=node0:8201
vault-node2-1 | 2022-09-06T17:05:12.484Z [WARN] storage.raft: rejecting vote request since node is not a voter: from=node0:8201
vault-node2-1 | 2022-09-06T17:05:12.558Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter node0 node0:8201}" error="dial tcp 172.19.0.3:8201: i/o timeout"
vault-node2-1 | 2022-09-06T17:05:12.570Z [ERROR] storage.raft: peer has newer term, stopping replication: peer="{Voter node0 node0:8201}"
vault-node2-1 | 2022-09-06T17:05:12.570Z [INFO] storage.raft: entering follower state: follower="Node at node2:8201 [Follower]" leader-address= leader-id=
vault-node0-1 | 2022-09-06T17:05:17.671Z [WARN] storage.raft: Election timeout reached, restarting election
vault-node0-1 | 2022-09-06T17:05:17.671Z [INFO] storage.raft: entering candidate state: node="Node at node0:8201 [Candidate]" term=8
vault-node2-1 | 2022-09-06T17:05:17.675Z [WARN] storage.raft: rejecting vote request since node is not a voter: from=node0:8201
vault-node1-1 | 2022-09-06T17:05:17.675Z [WARN] storage.raft: rejecting vote request since node is not a voter: from=node0:8201
vault-node2-1 | 2022-09-06T17:05:21.425Z [INFO] storage.raft: entering candidate state: node="Node at node2:8201 [Candidate]" term=6
vault-node2-1 | 2022-09-06T17:05:21.432Z [INFO] storage.raft: election won: tally=2
vault-node2-1 | 2022-09-06T17:05:21.432Z [INFO] storage.raft: entering leader state: leader="Node at node2:8201 [Leader]"
vault-node2-1 | 2022-09-06T17:05:21.432Z [INFO] storage.raft: added peer, starting replication: peer=node0
vault-node2-1 | 2022-09-06T17:05:21.432Z [ERROR] storage.raft: peer has newer term, stopping replication: peer="{Nonvoter node0 node0:8201}"
vault-node2-1 | 2022-09-06T17:05:21.434Z [INFO] storage.raft: entering follower state: follower="Node at node2:8201 [Follower]" leader-address= leader-id=
The result of which is a cluster that is henceforth trapped in a livelock and never able to regain stable leadership until node0
is reset.