Skip to content

Conversation

generall
Copy link
Member

There is a bug, which I can't reproduce locally, but it was observed on practice multiple times:

If pod was somehow killed during collection creation or there was an error during creating a collection (due to file descriptors or something like that), it might be possible that some shards of the collection have inconsistent state between initializing and dead.

Local shard thinks the shard is dead while other machines in the cluster consider it initializing.

Since local shard status it dead it needs to recover it from somewhere, but it is also the only shard in the cluster. So cluster is stuck in this inconsistent state without ability to recover (except for collection deletion).

This PR extends our check for is_last_active_replica and handles the case of no active replicas in more details.

@generall generall requested review from timvisee and ffuugoo June 24, 2025 22:26

This comment was marked as resolved.

Copy link
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not reproduce this either, but I've seen this problem as well. The implementation looks sound 👍

Co-authored-by: Tim Visée <tim+github@visee.me>
@generall generall merged commit c29ab98 into dev Jun 25, 2025
18 checks passed
@generall generall deleted the do-not-deactivate-last-initializing branch June 25, 2025 12:04
generall added a commit that referenced this pull request Jul 17, 2025
* check if we try to deactivate last initializing replica

* consider more cases

* Update lib/collection/src/shards/replica_set/mod.rs

Co-authored-by: Tim Visée <tim+github@visee.me>

---------

Co-authored-by: Tim Visée <tim+github@visee.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants