-
Notifications
You must be signed in to change notification settings - Fork 693
Ensure snapshots can be committed after migration #37064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes in general sense to me.
I'm wondering about MigrationSnapshotDirector
why it doesnt directly force/commit the snapshot, instead of using the logic of the other async director where we wait for the commit.
I mean the use cases are different right? I feel it might make sense to adjust this, and have a clear difference betweem them one is sync the other is async.
But I guess in both cases you need the commit position.
@@ -390,7 +391,7 @@ public void onCommit(final long committedPosition) { | |||
public void newPositionCommitted(final long currentCommitPosition) { | |||
actor.run( | |||
() -> { | |||
commitPosition = currentCommitPosition; | |||
commitPosition = Math.max(currentCommitPosition, commitPosition); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ Just that I get this right: The idea is that the currentCommitPosition might be 0 because we are leader and haven't seen yet a new application entry right? But isnt then the commitPosition field anyway 0? Because we reinit/restarted the partition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entries written before this node becomes leader would have been committed. So the actual commit position can be > 0 if there was a leader already before this one.
try (final LogStreamReader logStreamReader = | ||
context.getLogStream().newLogStreamReader()) { | ||
final var commitPosition = logStreamReader.seekToEnd(); | ||
director.onCommit(commitPosition); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ Do we need to read the log? Doesnt RAFT or the LogStream on open already? Can we get this from there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we keep the information about the last entry anywhere. So we have to seek to find the information.
when(actorSchedulingService.submitActor(any(), any())) | ||
.thenReturn(TestActorFuture.completedFuture(null)); | ||
transitionContext.setActorSchedulingService(actorSchedulingService); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙃 How did this ever work? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was only verifying if the during transitions it is removing and installing the snapshot director as expected. So an actual actor was not necessary. Now to verify if commit position is updated, we have to have an actual actor that can run and update it.
if (Role.LEADER.equals(targetRole)) { | ||
// verify that the last position is read to notify snapshot director | ||
verify(logstreamReader, times(1)).seekToEnd(); | ||
assertThat(transitionContext.getSnapshotDirector().getCommitPosition()) | ||
.isEqualTo(LAST_LOG_POSITION); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ Couldnt we also validate that this snapshot is now valid/confirmed/commited?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is not taking the snapshot, it is only verifying that commit position is initialized after role transition.
Even though it is forcing the snapshot, it has to still follow the same logic as in AsyncSnapshotDirector because there will be concurrent state changes from StreamProcessor replay or processing. |
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin main
git worktree add -d .worktree/backport-37064-to-main origin/main
cd .worktree/backport-37064-to-main
git switch --create backport-37064-to-main
git cherry-pick -x 6080d32494a3cb658ffdadb5d5ac73ae61d4176c 2836eccbba83e1ff4fe6ab3f90aa91ecc9f087a5 184c872eb5a572f94d22f1b1625c2c71812b6e24 |
Successfully created backport PR for |
But migration starts before all of this, no? |
Migration is done before the StreamProcessor starts. But the snapshot after migration is done async, as far as I can see from the |
Description
MigrationSnapshotDirector
did not become healthy because a snapshot was not taken.AsyncSnapshotDirector
could not commit the snapshot because the commitPosition remained 0. This is becauseRaftServer
only notifies of new commit if there are new application entries written after transitioning to leader. This was ok until now, because snapshots will be eventually taken when new entries are committed. But this prevents the force snapshot after migration if there are no new processing records.The test was failing due to the partition not becoming healthy because it cannot commit the snapshot. However, it is not clear why migration is running again after restart.
Checklist
Related issues
closes #37046