Skip to content

Conversation

agourlay
Copy link
Member

@agourlay agourlay commented Feb 21, 2025

Alternative to #6020

This PR makes sure that a failed snapshot restore does not prevent the service from restarting.

This is achieved by performing two things:

  • keeping the shard's folder inner configuration in case of failed recovery, instead of deleting it completely
  • making the restore process crash safe using a flag file

Not deleting the shard folder pleases the ShardHolder which expects one folder per shard.
Without a WAL found for the shard, it is turned into a DummyShard at load time.

ERROR collection::shards::replica_set: Failed to load local shard "./storage/collections/test_collection/0", initializing "dummy" shard instead: Service internal error: Wal error: Can't init WAL: Os { code: 2, kind: NotFound, message: "No such file or directory" }

More importantly, keeping the shard configuration provides the latest valid replica_state.json to access the remote shards.

Concerning the crash safety, a file flag is created before the shard restore and deleted afterwards.

The presence of a flag at startup means that either:

  • the restore process crashed midway
  • the restore process failed and left the flag behind

When starting up, the presence of the flag is used to delete the potentially corrupted segments data and WAL.
This dummy shard can be the target of a new snapshot restore OR manual shard replication to fix it.

There is an E2E integration test that generates a corrupted snapshot to prove the existence of the flag, the correct startup of the service and the recovery of the shard.

@agourlay agourlay force-pushed the keep-initial-shard-configuration-on-failed-restore branch from 60bba3f to b322512 Compare February 25, 2025 11:31
@agourlay agourlay merged commit e50c843 into dev Feb 25, 2025
17 checks passed
@agourlay agourlay deleted the keep-initial-shard-configuration-on-failed-restore branch February 25, 2025 12:26
timvisee pushed a commit that referenced this pull request Mar 21, 2025
* Keep initial shard configuration on failed restore

* Set initialization flag for crash safety
@timvisee timvisee mentioned this pull request Mar 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants