Skip to content

Joining new nodes to recovered vault after quorum lost causes "storage.raft.snapshot: failed to move snapshot into place (Invalid handle)" in Win 10 #12116

@ferAbleTech

Description

@ferAbleTech

Describe the bug
I followed this tutorial to create a Vault cluster with Raft as storage. After simulating an outage with all 3 nodes lost and recovering a single node using peers.json, I tried joining new nodes to the recovered node, however after the command (vault operator raft join http:...) its console throws these errors periodically:

[ERROR] storage.raft: failed to send snapshot to: peer="{Non_voter vault_3 127.0.0.1:8401}" error="sync vault\raft\snapshots: Handle non valido." (last part is in italian, same language as OS, it means Invalid Handle)
[ERROR] storage.raft: failed to get log: index=1 error="log not found"
[ERROR] storage.raft: failed to install snapshot: id=bolt-snapshot error="sync vault\raft\snapshots: Handle non valido."

The joining node console throws these errors:

[INFO] storage.raft.snapshot: creating new snapshot: path=vault\raft\snapshots\......
[ERROR] storage.raft.snapshot: failed to move snapshot into place: error="sync vault\raft\snapshots: Handle non valido"
[ERROR] storage.raft.snapshot: failed to finalize snapshot: error="sync vault\raft\snapshots: Handle non valido"
[INFO] storage.raft.snapshot: reaping snapshot: path=vault\raft\snapshots\.......

This behaviour only happens in Windows, using the online enviroment offered in the tutorial it doesn't happen.

To Reproduce
Steps to reproduce the behavior:
Follow the steps in the tutorial until "Retry Join" to create a cluster of 3 nodes (plus a server for the autounseal using Transit Secret Engine.)
Stop all nodes in the cluster.
Recover vault_2 using the peers.json method.
Try joining a new node to vault_2.

As this is a bit tedious to reproduce in Windows, since the automated script offered in the tutorial only works for Linux OS, I made a semi-automated equivalent using bat files:
https://drive.google.com/file/d/1GHbNmBG0niRkIYB6Qc4KVHdGjPrqPPIi/view?usp=sharing
Follow the README to simulate the error.

Expected behavior
The new nodes successfully joins the cluster.

Environment:

  • Vault Server Version: 1.7.3
  • Vault CLI Version: Vault v1.7.3 (5d517c8)
  • Server Operating System/Architecture: Windows 10 x64 20H2

Vault server configuration file(s):

Autounseal vault:

storage "raft" {
path    = "./vault"
node_id = "vault_1"
}
listener "tcp" {
address = "127.0.0.1:8200"
cluster_address = "127.0.0.1:8201"
tls_disable = true
}

disable_mlock = true
cluster_addr = "http://127.0.0.1:8201"
api_addr = "http://127.0.0.1:8200"

Cluster nodes (with different api/cluster ports):

storage "raft" {
path    = "./vault"
node_id = "vault_2"
}
listener "tcp" {
address = "127.0.0.1:8300"
cluster_address = "127.0.0.1:8301"
tls_disable = true
}

seal "transit" {
address            = "http://127.0.0.1:8200"
# token is read from VAULT_TOKEN env
# token              = ""
disable_renewal    = "false"
key_name           = "autounseal"
mount_path         = "transit/"
tls_skip_verify = "true"
}
disable_mlock = true
cluster_addr = "http://127.0.0.1:8301"
api_addr = "http://127.0.0.1:8300"
ui	=	true

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions