Skip to content

Conversation

agourlay
Copy link
Member

@agourlay agourlay commented Feb 19, 2025

Alternative to #5984

This PR makes sure that a failed snapshot restore does not prevent the service from restarting.

This is achieved by making the restore process crash safe using a flag file.
The flag is created before the shard restore and deleted afterwards.

The presence of a flag at startup means that either:

  • the restore process crashed midway
  • the restore process failed and left the flag behind

When starting up, the presence of the flag is used to load an empty shard marked as dead.
This empty shard can be the target of a new snapshot restore.

There is an E2E integration test that generates a corrupted snapshot to prove the existence of the flag and the correct startup of the service.

Status

At this point, it is possible to restart the service & try another non-corrupted snapshot to fix the shard.

However we want to be able to resync the node "automatically" which requires a shard transfer.
The absence of a proper replica state makes it difficult to achieve without hacks.

@agourlay
Copy link
Member Author

I am still trying to improve the test to show that the shard can be moved.

@agourlay agourlay requested a review from timvisee February 20, 2025 15:53
@@ -41,6 +41,11 @@ pub fn shard_path(collection_path: &Path, shard_id: ShardId) -> PathBuf {
collection_path.join(format!("{shard_id}"))
}

/// Path to a shard directory
pub fn shard_initialized_flag_path(collection_path: &Path, shard_id: ShardId) -> PathBuf {
collection_path.join(format!("shard_{shard_id}.initialized"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
collection_path.join(format!("shard_{shard_id}.initialized"))
collection_path.join(format!("shard_{shard_id}.initializing"))

To indicate that the file is temporary

Comment on lines +638 to +643
// Delete the initialized flag
tokio::fs::remove_file(&initialized_flag)
.await
.unwrap_or_else(|e| {
log::error!("Failed to remove initialized flag for shard {collection_id}:{shard_id}: {e}");
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we delete the directory as well? Otherwise I suspect the service can be restarted again and load a corrupt shard

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can just remove the directory. The replica (local shard) must exist, because consensus tells us it should.

If we just remove the directory, we must remove the replica from consensus (and confirm that, which is not trivial).

You make a good point though. Just removing the initialized file here may be incorrect, because another crash will not get it into the same state on restart.

I think we must do either of:

  • load dummy shard, mark shard as dead (if not done already), confirm with consensus, only then remove initialized file
  • remove replica entirely, confirm with consensus, remove directory

I strongly prefer the first, because I see two more problems with the second approach:

  • we might accidentally remove the last replica of some shard
  • we lose the replica set state, which we should always have, even if not having a local replica

@agourlay agourlay force-pushed the handle-malfomed-shard-after-incomplete-restore branch from a520e79 to e271679 Compare February 21, 2025 09:20
let is_active = matches!(
source_replicas.get(&transfer.from),
Some(ReplicaState::Active | ReplicaState::ReshardingScaleDown),
Some(ReplicaState::Active | ReplicaState::ReshardingScaleDown) | None,
Copy link
Member

@timvisee timvisee Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In many other cases we consider None to be equal to Dead.

I'll have to check if this is compatible with all the other places where we explicitly use None (which I'll do a bit later).

Edit: if we keep the replica state, as discussed in the call, this may be unnecessary all together.

Comment on lines +638 to +643
// Delete the initialized flag
tokio::fs::remove_file(&initialized_flag)
.await
.unwrap_or_else(|e| {
log::error!("Failed to remove initialized flag for shard {collection_id}:{shard_id}: {e}");
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can just remove the directory. The replica (local shard) must exist, because consensus tells us it should.

If we just remove the directory, we must remove the replica from consensus (and confirm that, which is not trivial).

You make a good point though. Just removing the initialized file here may be incorrect, because another crash will not get it into the same state on restart.

I think we must do either of:

  • load dummy shard, mark shard as dead (if not done already), confirm with consensus, only then remove initialized file
  • remove replica entirely, confirm with consensus, remove directory

I strongly prefer the first, because I see two more problems with the second approach:

  • we might accidentally remove the last replica of some shard
  • we lose the replica set state, which we should always have, even if not having a local replica

@agourlay
Copy link
Member Author

superseded by #6038

@timvisee
Copy link
Member

Closing for #6038

@timvisee timvisee closed this Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants