Skip to content

Conversation

deepthidevaki
Copy link
Contributor

@deepthidevaki deepthidevaki commented Aug 21, 2025

Description

MigrationSnapshotDirector did not become healthy because a snapshot was not taken. AsyncSnapshotDirector could not commit the snapshot because the commitPosition remained 0. This is because RaftServer only notifies of new commit if there are new application entries written after transitioning to leader. This was ok until now, because snapshots will be eventually taken when new entries are committed. But this prevents the force snapshot after migration if there are no new processing records.

The test was failing due to the partition not becoming healthy because it cannot commit the snapshot. However, it is not clear why migration is running again after restart.

Checklist

Related issues

closes #37046

@deepthidevaki deepthidevaki added the backport stable/8.6 Backport a pull request to stable/8.6 label Aug 21, 2025
@github-actions github-actions bot added the component/zeebe Related to the Zeebe component/team label Aug 21, 2025
@deepthidevaki deepthidevaki added the backport main Forward-port a pull request to main label Aug 22, 2025
@deepthidevaki deepthidevaki marked this pull request as ready for review August 22, 2025 10:48
@deepthidevaki deepthidevaki requested review from a team and ChrisKujawa and removed request for a team August 22, 2025 10:49
Copy link
Member

@ChrisKujawa ChrisKujawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes in general sense to me.

I'm wondering about MigrationSnapshotDirector why it doesnt directly force/commit the snapshot, instead of using the logic of the other async director where we wait for the commit.

I mean the use cases are different right? I feel it might make sense to adjust this, and have a clear difference betweem them one is sync the other is async.

But I guess in both cases you need the commit position.

@@ -390,7 +391,7 @@ public void onCommit(final long committedPosition) {
public void newPositionCommitted(final long currentCommitPosition) {
actor.run(
() -> {
commitPosition = currentCommitPosition;
commitPosition = Math.max(currentCommitPosition, commitPosition);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ Just that I get this right: The idea is that the currentCommitPosition might be 0 because we are leader and haven't seen yet a new application entry right? But isnt then the commitPosition field anyway 0? Because we reinit/restarted the partition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entries written before this node becomes leader would have been committed. So the actual commit position can be > 0 if there was a leader already before this one.

Comment on lines +84 to +87
try (final LogStreamReader logStreamReader =
context.getLogStream().newLogStreamReader()) {
final var commitPosition = logStreamReader.seekToEnd();
director.onCommit(commitPosition);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ Do we need to read the log? Doesnt RAFT or the LogStream on open already? Can we get this from there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we keep the information about the last entry anywhere. So we have to seek to find the information.

Comment on lines -54 to -56
when(actorSchedulingService.submitActor(any(), any()))
.thenReturn(TestActorFuture.completedFuture(null));
transitionContext.setActorSchedulingService(actorSchedulingService);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙃 How did this ever work? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was only verifying if the during transitions it is removing and installing the snapshot director as expected. So an actual actor was not necessary. Now to verify if commit position is updated, we have to have an actual actor that can run and update it.

Comment on lines +105 to +110
if (Role.LEADER.equals(targetRole)) {
// verify that the last position is read to notify snapshot director
verify(logstreamReader, times(1)).seekToEnd();
assertThat(transitionContext.getSnapshotDirector().getCommitPosition())
.isEqualTo(LAST_LOG_POSITION);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ Couldnt we also validate that this snapshot is now valid/confirmed/commited?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is not taking the snapshot, it is only verifying that commit position is initialized after role transition.

@deepthidevaki
Copy link
Contributor Author

I'm wondering about MigrationSnapshotDirector why it doesnt directly force/commit the snapshot, instead of using the logic of the other async director where we wait for the commit.

Even though it is forcing the snapshot, it has to still follow the same logic as in AsyncSnapshotDirector because there will be concurrent state changes from StreamProcessor replay or processing.

@deepthidevaki deepthidevaki added this pull request to the merge queue Aug 22, 2025
Merged via the queue into stable/8.7 with commit e464091 Aug 22, 2025
63 checks passed
@deepthidevaki deepthidevaki deleted the dd-37046-snapshot-after-migration branch August 22, 2025 12:47
@backport-action
Copy link
Collaborator

Backport failed for main, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin main
git worktree add -d .worktree/backport-37064-to-main origin/main
cd .worktree/backport-37064-to-main
git switch --create backport-37064-to-main
git cherry-pick -x 6080d32494a3cb658ffdadb5d5ac73ae61d4176c 2836eccbba83e1ff4fe6ab3f90aa91ecc9f087a5 184c872eb5a572f94d22f1b1625c2c71812b6e24

@backport-action
Copy link
Collaborator

Successfully created backport PR for stable/8.6:

github-merge-queue bot pushed a commit that referenced this pull request Aug 22, 2025
…on (#37116)

# Description
Backport of #37064 to `stable/8.6`.

relates to #37046
@ChrisKujawa
Copy link
Member

I'm wondering about MigrationSnapshotDirector why it doesnt directly force/commit the snapshot, instead of using the logic of the other async director where we wait for the commit.

Even though it is forcing the snapshot, it has to still follow the same logic as in AsyncSnapshotDirector because there will be concurrent state changes from StreamProcessor replay or processing.

But migration starts before all of this, no?

@deepthidevaki
Copy link
Contributor Author

But migration starts before all of this, no?

Migration is done before the StreamProcessor starts. But the snapshot after migration is done async, as far as I can see from the MigrationSnapshotDirector. I assume it is done in parallel to StreamProcessor startup because in 8.6 and 8.7 we cannot take snapshot if there was no new records processed or replayed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport main Forward-port a pull request to main backport stable/8.6 Backport a pull request to stable/8.6 component/zeebe Related to the Zeebe component/team version:8.6.25
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants