Keep initial shard configuration on failed restore #6038

agourlay · 2025-02-21T16:15:28Z

Alternative to #6020

This PR makes sure that a failed snapshot restore does not prevent the service from restarting.

This is achieved by performing two things:

keeping the shard's folder inner configuration in case of failed recovery, instead of deleting it completely
making the restore process crash safe using a flag file

Not deleting the shard folder pleases the ShardHolder which expects one folder per shard.
Without a WAL found for the shard, it is turned into a DummyShard at load time.

ERROR collection::shards::replica_set: Failed to load local shard "./storage/collections/test_collection/0", initializing "dummy" shard instead: Service internal error: Wal error: Can't init WAL: Os { code: 2, kind: NotFound, message: "No such file or directory" }

More importantly, keeping the shard configuration provides the latest valid replica_state.json to access the remote shards.

Concerning the crash safety, a file flag is created before the shard restore and deleted afterwards.

The presence of a flag at startup means that either:

the restore process crashed midway
the restore process failed and left the flag behind

When starting up, the presence of the flag is used to delete the potentially corrupted segments data and WAL.
This dummy shard can be the target of a new snapshot restore OR manual shard replication to fix it.

There is an E2E integration test that generates a corrupted snapshot to prove the existence of the flag, the correct startup of the service and the recovery of the shard.

lib/collection/src/shards/shard_holder/mod.rs

lib/collection/src/shards/local_shard/mod.rs

* Keep initial shard configuration on failed restore * Set initialization flag for crash safety

github-actions bot mentioned this pull request Feb 21, 2025

Flaky test hnsw_discover_test::hnsw_discover_precision #2973

Open

agourlay marked this pull request as ready for review February 21, 2025 17:04

agourlay requested review from generall and timvisee February 21, 2025 17:05

agourlay mentioned this pull request Feb 21, 2025

Handle malformed shard after incomplete restore #6020

Closed

timvisee requested changes Feb 24, 2025

View reviewed changes

lib/collection/src/shards/shard_holder/mod.rs Show resolved Hide resolved

lib/collection/src/shards/shard_holder/mod.rs Show resolved Hide resolved

lib/collection/src/shards/local_shard/mod.rs Show resolved Hide resolved

agourlay added 2 commits February 25, 2025 12:16

Keep initial shard configuration on failed restore

66b5a6f

Set initialization flag for crash safety

b322512

agourlay force-pushed the keep-initial-shard-configuration-on-failed-restore branch from 60bba3f to b322512 Compare February 25, 2025 11:31

timvisee approved these changes Feb 25, 2025

View reviewed changes

agourlay merged commit e50c843 into dev Feb 25, 2025
17 checks passed

agourlay deleted the keep-initial-shard-configuration-on-failed-restore branch February 25, 2025 12:26

timvisee pushed a commit that referenced this pull request Mar 21, 2025

Keep initial shard configuration on failed restore (#6038)

2791bb1

* Keep initial shard configuration on failed restore * Set initialization flag for crash safety

timvisee mentioned this pull request Mar 21, 2025

Bump version to 1.13.5 #6223

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Keep initial shard configuration on failed restore #6038

Keep initial shard configuration on failed restore #6038

Uh oh!

agourlay commented Feb 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Keep initial shard configuration on failed restore #6038

Keep initial shard configuration on failed restore #6038

Uh oh!

Conversation

agourlay commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

agourlay commented Feb 21, 2025 •

edited

Loading