Recover dirty shards using other replicas when marked Dead #6293

KShivendu · 2025-04-01T15:29:15Z

Fixes the bug demonstrated in #6292

All Submissions:

Contributions should target the dev branch. Did you create your branch from dev?
Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?

TODO:

Delete the shard initializing flag after initiating transfer to the dummy shard for recovery
~~Try to load shards even if they are dirty instead of directly recreating as empty shard - especially important when we have 0 replicas?~~ Not desirable anymore.

lib/collection/src/shards/replica_set/snapshots.rs

lib/collection/src/shards/replica_set/mod.rs

timvisee · 2025-04-02T09:32:45Z

lib/collection/src/collection/mod.rs

@@ -681,7 +681,7 @@ impl Collection {
                continue;
            }

-            if this_peer_state != Some(Dead) || replica_set.is_dummy().await {
+            if !(this_peer_state == Some(Dead) || replica_set.is_dirty().await) {


I wonder, we might always want to recover dummy shards here. In that case we can remove the dirty flag, and just use is_dummy(). If indeed possible, I'd prefer to remove it to keep our state simpler.

This might conflict with recovery mode though.

Let me have a bit of a thought about this.

This might conflict with recovery mode though.

Exactly why I avoided it.

We should probably disable shard transfers all together when in recovery mode.

If we do that, we can simply check is_dummy() here without problems.

Recovery mode means is_dummy() => true, is_dirty() => false.
Dirty shard means is_dummy() => true, is_dirty() => true

disable shard transfers all together when in recovery mode

Okay but we must init a transfer when shard is dirty to fix it. That's why I'm using replica_set.is_dirty().await. not replica_set.is_dummy().await

I don't see checking a dirty flag as a stable way to see if recovery mode is enabled.

Instead we should explicitly check for recovery mode in the place where we drive transfers.

So I still prefer to:

just check is_dummy() here

explicitly check recovery mode

timvisee · 2025-04-02T09:36:57Z

lib/collection/src/collection/shard_transfer.rs

            if replica_set.is_dummy().await {
+                // Check if shard was dirty before init_empty_local_shard
+                let was_dirty = replica_set.is_dirty().await;
+                // TODO: If dirty, still try to load the shard and init empty shard only if it's not recoverable?
                replica_set.init_empty_local_shard().await?;
+
+                if was_dirty {
+                    let shard_flag = shard_initializing_flag_path(&collection_path, shard_id);
+                    tokio::fs::remove_file(&shard_flag).await?;
+                }
            }


Yes, we should initialize an empty shard here as cannot do our recovery process on top of a corrupt shard, and so we must empty it here.

Suggested change

if replica_set.is_dummy().await {

// Check if shard was dirty before init_empty_local_shard

let was_dirty = replica_set.is_dirty().await;

// TODO: If dirty, still try to load the shard and init empty shard only if it's not recoverable?

replica_set.init_empty_local_shard().await?;

if was_dirty {

let shard_flag = shard_initializing_flag_path(&collection_path, shard_id);

tokio::fs::remove_file(&shard_flag).await?;

}

}

if replica_set.is_dummy().await {

// If shard was dirty, remove initializing flag after initializing empty

let was_dirty = replica_set.is_dirty().await;

replica_set.init_empty_local_shard().await?;

if was_dirty {

let shard_flag = shard_initializing_flag_path(&collection_path, shard_id);

tokio::fs::remove_file(&shard_flag).await?;

}

}

It's fine to empty it here, because at this point we're sure that there's some other replica that still has our data. We therefore have no data loss.

because at this point we're sure that there's some other replica that still has our data. We therefore have no data loss

Because there was a shard initializing flag and hence there must be a source replica? Hmm, makes sense.

lib/collection/src/collection/shard_transfer.rs

tests/consensus_tests/test_failed_snapshot_recovery.py

KShivendu · 2025-04-15T14:47:00Z

So I still prefer to:

just check is_dummy() here

explicitly check recovery mode

Okay. Dropped is_dirty bool from the DummyShard struct. But still keeping a is_dirty() in ShardReplicaSet since it keeps the code cleaner.

timvisee · 2025-04-15T15:34:52Z

So I still prefer to:

just check is_dummy() here

explicitly check recovery mode

Okay. Dropped is_dirty bool from the DummyShard struct. But still keeping a is_dirty() in ShardReplicaSet since it keeps the code cleaner.

I don't agree. sync_local_state must never initiate a shard transfer if in recovery mode, and so we shall explicitly check it.

Then the dirty flag becomes obsolete.

The dirty flag magically acting different based on external state is not obvious and a potential footgun. I strongly prefer explicit checks here.

I must admit, I don't recall exactly what we agreed on in terms of what states to auto recover from. I'll have a proper thought about it over the night, as all the possible scenarios aren't super obvious.

lib/collection/src/collection/mod.rs

… flag

lib/collection/src/collection/mod.rs

generall · 2025-04-16T15:23:33Z

lib/collection/src/collection/shard_transfer.rs

+                // We can reach here because of either of these:
+                // 1. Qdrant is in recovery mode, and user intentionally triggered a transfer
+                // 2. Shard is dirty (shard initializing flag), and Qdrant automatically triggered a transfer to recover dead state
+                //    (note: initializing flag means there must be another replica)


AFAIK initializing flag does NOT mean, that there is another replica, but shard transfer does.

* Test behaviour of Qdrant with shard initializing flag * Corrupt shard directory and let Qdrant panic like prod * Wait for shard transfer * Restore dirty shards using other replicas * remove unused code * Request transfer only if replica is dead or dirty * fmt * remove comment * fix clippy * Delete shard initializing flag after initializing empty local shard * Expect test to recover shard in existing test * Review suggestions * Run tests for longer * Simplify tests * Use 2k points * condition for point_count * Add comment * fix flaky tests * fix flaky tests * handle edge case * Include Active in expected states list * Introduce is_recovery * simplify tests * get rid of is_dirty bool in DummyShard * add missing negation in condition * fix condition * final fix for transfer condition * Don't auto recover if in recovery mode, simplify state checking * minor comment improvements * tests scenario where node is killed after deleting shard initializing flag * Fix failing CI * Only automatically recover dead replicas * Mark replica as dead to recover dummy shard * fix failing test * Sleep one second after killing peer, give time to release WAL lock * Prevent waiting for peer to come online indefinitely * update comment * minor typo --------- Co-authored-by: timvisee <tim@visee.me>

KShivendu commented Apr 1, 2025

View reviewed changes

lib/collection/src/shards/replica_set/snapshots.rs Outdated Show resolved Hide resolved

KShivendu changed the title ~~Heal dirty shards~~ Recover dirty shards using other replicas Apr 1, 2025

KShivendu marked this pull request as draft April 1, 2025 16:02

This was referenced Apr 1, 2025

Flaky test multivector_filtrable_hnsw_test::test_multi_filterable_hnsw::case_5_recommend_eq #5902

Closed

Flaky test hnsw_discover_test::hnsw_discover_precision #2973

Open

timvisee reviewed Apr 2, 2025

View reviewed changes

timvisee mentioned this pull request Apr 2, 2025

Test behaviour of Qdrant with shard initializing flag #6292

Closed

3 tasks

KShivendu mentioned this pull request Apr 2, 2025

Mark dummy shard as dead after snapshot restore failure #6297

Merged

3 tasks

KShivendu force-pushed the test-shard-initializing-behaviour branch from 9b1c82b to c9a638f Compare April 3, 2025 09:04

KShivendu force-pushed the heal-dirty-shards branch from b30c887 to 3d9c388 Compare April 3, 2025 09:11

github-actions bot mentioned this pull request Apr 3, 2025

Flaky test multivector_filtrable_hnsw_test::test_multi_filterable_hnsw::case_8_recosumscores_multi #6313

Closed

KShivendu marked this pull request as ready for review April 3, 2025 11:04

This was referenced Apr 4, 2025

Flaky test index::tests::hw_counter_test::test_hw_counter_for_plain_sparse_search #6231

Closed

Flaky test multivector_filtrable_hnsw_test::test_multi_filterable_hnsw::case_5_recobestscore_eq #6318

Closed

KShivendu force-pushed the heal-dirty-shards branch from 3b3fa4b to 77823a6 Compare April 4, 2025 09:03

KShivendu added 15 commits April 8, 2025 12:14

Test behaviour of Qdrant with shard initializing flag

57905b2

Corrupt shard directory and let Qdrant panic like prod

dbaac80

Wait for shard transfer

a5867bc

Restore dirty shards using other replicas

6185e8c

remove unused code

7991e2f

Request transfer only if replica is dead or dirty

a94aaac

fmt

4f40758

remove comment

93a6506

fix clippy

a96bce6

Delete shard initializing flag after initializing empty local shard

e369a32

Expect test to recover shard in existing test

0598a77

Review suggestions

aa0056f

Run tests for longer

3ea9d20

Simplify tests

7deee8e

Use 2k points

9e5bfe0

KShivendu added 3 commits April 15, 2025 19:59

add missing negation in condition

132d259

fix condition

6fc1ef0

final fix for transfer condition

5dd89f6

KShivendu requested a review from timvisee April 15, 2025 15:26

github-actions bot mentioned this pull request Apr 15, 2025

Flaky test multivector_filtrable_hnsw_test::test_multi_filterable_hnsw::case_6_recobestscore_multi #6382

Closed

Don't auto recover if in recovery mode, simplify state checking

1a4ce81

timvisee force-pushed the heal-dirty-shards branch from 7593f6c to 1a4ce81 Compare April 15, 2025 16:04

KShivendu commented Apr 15, 2025

View reviewed changes

lib/collection/src/collection/mod.rs Outdated Show resolved Hide resolved

KShivendu and others added 4 commits April 15, 2025 21:40

minor comment improvements

672e84d

tests scenario where node is killed after deleting shard initializing…

51150f8

… flag

Fix failing CI

4d503df

Only automatically recover dead replicas

9844776

timvisee reviewed Apr 16, 2025

View reviewed changes

lib/collection/src/collection/mod.rs Outdated Show resolved Hide resolved

timvisee and others added 4 commits April 16, 2025 10:51

Mark replica as dead to recover dummy shard

1155f3a

fix failing test

ce7648f

Sleep one second after killing peer, give time to release WAL lock

38b9052

Prevent waiting for peer to come online indefinitely

1b4112f

timvisee approved these changes Apr 16, 2025

View reviewed changes

timvisee requested a review from generall April 16, 2025 13:54

generall reviewed Apr 16, 2025

View reviewed changes

generall approved these changes Apr 16, 2025

View reviewed changes

KShivendu added 2 commits April 16, 2025 22:40

update comment

586b597

minor typo

f35ff5b

KShivendu changed the base branch from test-shard-initializing-behaviour to dev April 16, 2025 17:15

KShivendu changed the title ~~Recover dirty shards using other replicas~~ Recover dirty shards using other replicas when marked Dead Apr 16, 2025

KShivendu merged commit 8d49c2c into dev Apr 16, 2025
16 checks passed

KShivendu deleted the heal-dirty-shards branch April 16, 2025 19:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Recover dirty shards using other replicas when marked Dead #6293

Recover dirty shards using other replicas when marked Dead #6293

Uh oh!

KShivendu commented Apr 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

timvisee Apr 2, 2025

Uh oh!

KShivendu Apr 2, 2025

Uh oh!

timvisee Apr 2, 2025

Uh oh!

KShivendu Apr 3, 2025 •

edited

Loading

Uh oh!

timvisee Apr 9, 2025

Uh oh!

timvisee Apr 2, 2025

Uh oh!

KShivendu Apr 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KShivendu commented Apr 15, 2025 •

edited

Loading

Uh oh!

timvisee commented Apr 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

generall Apr 16, 2025

Uh oh!

Uh oh!

Uh oh!

Recover dirty shards using other replicas when marked Dead #6293

Recover dirty shards using other replicas when marked Dead #6293

Uh oh!

Conversation

KShivendu commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

All Submissions:

Uh oh!

Uh oh!

Uh oh!

timvisee Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

KShivendu Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

timvisee Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

KShivendu Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timvisee Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

timvisee Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

KShivendu Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KShivendu commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timvisee commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

generall Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

KShivendu commented Apr 1, 2025 •

edited

Loading

KShivendu Apr 3, 2025 •

edited

Loading

KShivendu Apr 3, 2025 •

edited

Loading

KShivendu commented Apr 15, 2025 •

edited

Loading

timvisee commented Apr 15, 2025 •

edited

Loading