Handle malformed shard after incomplete restore #6020

agourlay · 2025-02-19T10:52:58Z

Alternative to #5984

This PR makes sure that a failed snapshot restore does not prevent the service from restarting.

This is achieved by making the restore process crash safe using a flag file.
The flag is created before the shard restore and deleted afterwards.

The presence of a flag at startup means that either:

the restore process crashed midway
the restore process failed and left the flag behind

When starting up, the presence of the flag is used to load an empty shard marked as dead.
This empty shard can be the target of a new snapshot restore.

There is an E2E integration test that generates a corrupted snapshot to prove the existence of the flag and the correct startup of the service.

Status

At this point, it is possible to restart the service & try another non-corrupted snapshot to fix the shard.

However we want to be able to resync the node "automatically" which requires a shard transfer.
The absence of a proper replica state makes it difficult to achieve without hacks.

agourlay · 2025-02-20T15:53:23Z

I am still trying to improve the test to show that the shard can be moved.

coszio · 2025-02-20T18:43:30Z

lib/collection/src/shards/mod.rs

@@ -41,6 +41,11 @@ pub fn shard_path(collection_path: &Path, shard_id: ShardId) -> PathBuf {
    collection_path.join(format!("{shard_id}"))
 }

+/// Path to a shard directory
+pub fn shard_initialized_flag_path(collection_path: &Path, shard_id: ShardId) -> PathBuf {
+    collection_path.join(format!("shard_{shard_id}.initialized"))


Suggested change

collection_path.join(format!("shard_{shard_id}.initialized"))

collection_path.join(format!("shard_{shard_id}.initializing"))

To indicate that the file is temporary

coszio · 2025-02-20T18:49:56Z

lib/collection/src/shards/shard_holder/mod.rs

+                // Delete the initialized flag
+                tokio::fs::remove_file(&initialized_flag)
+                    .await
+                    .unwrap_or_else(|e| {
+                        log::error!("Failed to remove initialized flag for shard {collection_id}:{shard_id}: {e}");
+                    });


Should we delete the directory as well? Otherwise I suspect the service can be restarted again and load a corrupt shard

I don't think we can just remove the directory. The replica (local shard) must exist, because consensus tells us it should.

If we just remove the directory, we must remove the replica from consensus (and confirm that, which is not trivial).

You make a good point though. Just removing the initialized file here may be incorrect, because another crash will not get it into the same state on restart.

I think we must do either of:

load dummy shard, mark shard as dead (if not done already), confirm with consensus, only then remove initialized file

remove replica entirely, confirm with consensus, remove directory

I strongly prefer the first, because I see two more problems with the second approach:

we might accidentally remove the last replica of some shard

we lose the replica set state, which we should always have, even if not having a local replica

timvisee · 2025-02-21T10:49:27Z

lib/collection/src/shards/transfer/helpers.rs

    let is_active = matches!(
        source_replicas.get(&transfer.from),
-        Some(ReplicaState::Active | ReplicaState::ReshardingScaleDown),
+        Some(ReplicaState::Active | ReplicaState::ReshardingScaleDown) | None,


In many other cases we consider None to be equal to Dead.

I'll have to check if this is compatible with all the other places where we explicitly use None (which I'll do a bit later).

Edit: if we keep the replica state, as discussed in the call, this may be unnecessary all together.

timvisee · 2025-02-21T11:40:07Z

lib/collection/src/shards/shard_holder/mod.rs

+                // Delete the initialized flag
+                tokio::fs::remove_file(&initialized_flag)
+                    .await
+                    .unwrap_or_else(|e| {
+                        log::error!("Failed to remove initialized flag for shard {collection_id}:{shard_id}: {e}");
+                    });


I don't think we can just remove the directory. The replica (local shard) must exist, because consensus tells us it should.

If we just remove the directory, we must remove the replica from consensus (and confirm that, which is not trivial).

You make a good point though. Just removing the initialized file here may be incorrect, because another crash will not get it into the same state on restart.

I think we must do either of:

load dummy shard, mark shard as dead (if not done already), confirm with consensus, only then remove initialized file

remove replica entirely, confirm with consensus, remove directory

I strongly prefer the first, because I see two more problems with the second approach:

we might accidentally remove the last replica of some shard

we lose the replica set state, which we should always have, even if not having a local replica

agourlay · 2025-02-21T17:07:11Z

superseded by #6038

timvisee · 2025-02-25T10:23:30Z

Closing for #6038

github-actions bot mentioned this pull request Feb 19, 2025

Flaky test hnsw_discover_test::hnsw_discover_precision #2973

Open

agourlay force-pushed the handle-malfomed-shard-after-incomplete-restore branch from 3e272fd to beb0bf4 Compare February 19, 2025 18:29

github-actions bot mentioned this pull request Feb 19, 2025

Flaky test payload_index_test::test_keyword_facet #5059

Closed

agourlay force-pushed the handle-malfomed-shard-after-incomplete-restore branch from 4de4dc6 to de44188 Compare February 20, 2025 09:34

github-actions bot mentioned this pull request Feb 20, 2025

Flaky test multivector_filtrable_hnsw_test::test_multi_filterable_hnsw::case_5_recommend_eq #5902

Closed

agourlay marked this pull request as ready for review February 20, 2025 15:52

agourlay requested a review from timvisee February 20, 2025 15:53

coszio reviewed Feb 20, 2025

View reviewed changes

agourlay added 6 commits February 21, 2025 10:15

Handle malformed shard after incomplete restore

a5e050d

add logs

8168a82

hmmm

d378b22

use empty dead shard instead of dummy shard

a94bdad

delete initialized flag after recreating empty shard

414b6ee

add E2E consensus test for corrupted snapshot restore

9824b57

agourlay force-pushed the handle-malfomed-shard-after-incomplete-restore branch from a520e79 to e271679 Compare February 21, 2025 09:20

add log that replica can not be automatically restored

6cdc332

agourlay force-pushed the handle-malfomed-shard-after-incomplete-restore branch from e271679 to 6cdc332 Compare February 21, 2025 09:24

github-actions bot mentioned this pull request Feb 21, 2025

Flaky test hnsw_quantized_search_test::hnsw_quantized_search_manhattan_test #3978

Open

agourlay added 2 commits February 21, 2025 11:35

hack to enable recovering the placeholder recovery shard

7eec464

another hack to restore one remote shard pointer

4eadff0

timvisee reviewed Feb 21, 2025

View reviewed changes

agourlay mentioned this pull request Feb 21, 2025

Keep initial shard configuration on failed restore #6038

Merged

timvisee closed this Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle malformed shard after incomplete restore #6020

Handle malformed shard after incomplete restore #6020

Uh oh!

agourlay commented Feb 19, 2025 •

edited

Loading

Uh oh!

agourlay commented Feb 20, 2025

Uh oh!

coszio Feb 20, 2025

Uh oh!

coszio Feb 20, 2025

Uh oh!

timvisee Feb 21, 2025

Uh oh!

timvisee Feb 21, 2025 •

edited

Loading

Uh oh!

timvisee Feb 21, 2025

Uh oh!

agourlay commented Feb 21, 2025

Uh oh!

timvisee commented Feb 25, 2025

Uh oh!

Uh oh!

	collection_path.join(format!("shard_{shard_id}.initialized"))
	collection_path.join(format!("shard_{shard_id}.initializing"))

Handle malformed shard after incomplete restore #6020

Handle malformed shard after incomplete restore #6020

Uh oh!

Conversation

agourlay commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status

Uh oh!

agourlay commented Feb 20, 2025

Uh oh!

coszio Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

coszio Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

timvisee Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

timvisee Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timvisee Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

agourlay commented Feb 21, 2025

Uh oh!

timvisee commented Feb 25, 2025

Uh oh!

Uh oh!

agourlay commented Feb 19, 2025 •

edited

Loading

timvisee Feb 21, 2025 •

edited

Loading