-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Handle malformed shard after incomplete restore #6020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3e272fd
to
beb0bf4
Compare
4de4dc6
to
de44188
Compare
I am still trying to improve the test to show that the shard can be moved. |
@@ -41,6 +41,11 @@ pub fn shard_path(collection_path: &Path, shard_id: ShardId) -> PathBuf { | |||
collection_path.join(format!("{shard_id}")) | |||
} | |||
|
|||
/// Path to a shard directory | |||
pub fn shard_initialized_flag_path(collection_path: &Path, shard_id: ShardId) -> PathBuf { | |||
collection_path.join(format!("shard_{shard_id}.initialized")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collection_path.join(format!("shard_{shard_id}.initialized")) | |
collection_path.join(format!("shard_{shard_id}.initializing")) |
To indicate that the file is temporary
// Delete the initialized flag | ||
tokio::fs::remove_file(&initialized_flag) | ||
.await | ||
.unwrap_or_else(|e| { | ||
log::error!("Failed to remove initialized flag for shard {collection_id}:{shard_id}: {e}"); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we delete the directory as well? Otherwise I suspect the service can be restarted again and load a corrupt shard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can just remove the directory. The replica (local shard) must exist, because consensus tells us it should.
If we just remove the directory, we must remove the replica from consensus (and confirm that, which is not trivial).
You make a good point though. Just removing the initialized file here may be incorrect, because another crash will not get it into the same state on restart.
I think we must do either of:
- load dummy shard, mark shard as dead (if not done already), confirm with consensus, only then remove initialized file
- remove replica entirely, confirm with consensus, remove directory
I strongly prefer the first, because I see two more problems with the second approach:
- we might accidentally remove the last replica of some shard
- we lose the replica set state, which we should always have, even if not having a local replica
a520e79
to
e271679
Compare
e271679
to
6cdc332
Compare
let is_active = matches!( | ||
source_replicas.get(&transfer.from), | ||
Some(ReplicaState::Active | ReplicaState::ReshardingScaleDown), | ||
Some(ReplicaState::Active | ReplicaState::ReshardingScaleDown) | None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In many other cases we consider None
to be equal to Dead
.
I'll have to check if this is compatible with all the other places where we explicitly use None
(which I'll do a bit later).
Edit: if we keep the replica state, as discussed in the call, this may be unnecessary all together.
// Delete the initialized flag | ||
tokio::fs::remove_file(&initialized_flag) | ||
.await | ||
.unwrap_or_else(|e| { | ||
log::error!("Failed to remove initialized flag for shard {collection_id}:{shard_id}: {e}"); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can just remove the directory. The replica (local shard) must exist, because consensus tells us it should.
If we just remove the directory, we must remove the replica from consensus (and confirm that, which is not trivial).
You make a good point though. Just removing the initialized file here may be incorrect, because another crash will not get it into the same state on restart.
I think we must do either of:
- load dummy shard, mark shard as dead (if not done already), confirm with consensus, only then remove initialized file
- remove replica entirely, confirm with consensus, remove directory
I strongly prefer the first, because I see two more problems with the second approach:
- we might accidentally remove the last replica of some shard
- we lose the replica set state, which we should always have, even if not having a local replica
superseded by #6038 |
Closing for #6038 |
Alternative to #5984
This PR makes sure that a failed snapshot restore does not prevent the service from restarting.
This is achieved by making the restore process crash safe using a flag file.
The flag is created before the shard restore and deleted afterwards.
The presence of a flag at startup means that either:
When starting up, the presence of the flag is used to load an empty shard marked as dead.
This empty shard can be the target of a new snapshot restore.
There is an E2E integration test that generates a corrupted snapshot to prove the existence of the flag and the correct startup of the service.
Status
At this point, it is possible to restart the service & try another non-corrupted snapshot to fix the shard.
However we want to be able to resync the node "automatically" which requires a shard transfer.
The absence of a proper replica state makes it difficult to achieve without hacks.