Ensure snapshots can be committed after migration #37064

deepthidevaki · 2025-08-21T12:52:21Z

Description

MigrationSnapshotDirector did not become healthy because a snapshot was not taken. AsyncSnapshotDirector could not commit the snapshot because the commitPosition remained 0. This is because RaftServer only notifies of new commit if there are new application entries written after transitioning to leader. This was ok until now, because snapshots will be eventually taken when new entries are committed. But this prevents the force snapshot after migration if there are no new processing records.

The test was failing due to the partition not becoming healthy because it cannot commit the snapshot. However, it is not clear why migration is running again after restart.

Checklist

Enable backports when necessary (fex. for bug fixes or for CI changes).

Related issues

closes #37046

…ter leader transition

ChrisKujawa

Makes in general sense to me.

I'm wondering about MigrationSnapshotDirector why it doesnt directly force/commit the snapshot, instead of using the logic of the other async director where we wait for the commit.

I mean the use cases are different right? I feel it might make sense to adjust this, and have a clear difference betweem them one is sync the other is async.

But I guess in both cases you need the commit position.

ChrisKujawa · 2025-08-22T11:31:48Z

...oker/src/main/java/io/camunda/zeebe/broker/system/partitions/impl/AsyncSnapshotDirector.java

@@ -390,7 +391,7 @@ public void onCommit(final long committedPosition) {
  public void newPositionCommitted(final long currentCommitPosition) {
    actor.run(
        () -> {
-          commitPosition = currentCommitPosition;
+          commitPosition = Math.max(currentCommitPosition, commitPosition);


❓ Just that I get this right: The idea is that the currentCommitPosition might be 0 because we are leader and haven't seen yet a new application entry right? But isnt then the commitPosition field anyway 0? Because we reinit/restarted the partition?

The entries written before this node becomes leader would have been committed. So the actual commit position can be > 0 if there was a leader already before this one.

ChrisKujawa · 2025-08-22T11:34:31Z

...munda/zeebe/broker/system/partitions/impl/steps/SnapshotDirectorPartitionTransitionStep.java

+                try (final LogStreamReader logStreamReader =
+                    context.getLogStream().newLogStreamReader()) {
+                  final var commitPosition = logStreamReader.seekToEnd();
+                  director.onCommit(commitPosition);


❓ Do we need to read the log? Doesnt RAFT or the LogStream on open already? Can we get this from there?

I don't think we keep the information about the last entry anywhere. So we have to seek to find the information.

ChrisKujawa · 2025-08-22T11:36:45Z

...a/zeebe/broker/system/partitions/impl/steps/SnapshotDirectorPartitionTransitionStepTest.java

-    when(actorSchedulingService.submitActor(any(), any()))
-        .thenReturn(TestActorFuture.completedFuture(null));
-    transitionContext.setActorSchedulingService(actorSchedulingService);


🙃 How did this ever work? 🤔

It was only verifying if the during transitions it is removing and installing the snapshot director as expected. So an actual actor was not necessary. Now to verify if commit position is updated, we have to have an actual actor that can run and update it.

ChrisKujawa · 2025-08-22T12:30:52Z

...a/zeebe/broker/system/partitions/impl/steps/SnapshotDirectorPartitionTransitionStepTest.java

+    if (Role.LEADER.equals(targetRole)) {
+      // verify that the last position is read to notify snapshot director
+      verify(logstreamReader, times(1)).seekToEnd();
+      assertThat(transitionContext.getSnapshotDirector().getCommitPosition())
+          .isEqualTo(LAST_LOG_POSITION);
+    }


❓ Couldnt we also validate that this snapshot is now valid/confirmed/commited?

This test is not taking the snapshot, it is only verifying that commit position is initialized after role transition.

deepthidevaki · 2025-08-22T12:30:53Z

I'm wondering about MigrationSnapshotDirector why it doesnt directly force/commit the snapshot, instead of using the logic of the other async director where we wait for the commit.

Even though it is forcing the snapshot, it has to still follow the same logic as in AsyncSnapshotDirector because there will be concurrent state changes from StreamProcessor replay or processing.

backport-action · 2025-08-22T12:48:03Z

Backport failed for main, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin main
git worktree add -d .worktree/backport-37064-to-main origin/main
cd .worktree/backport-37064-to-main
git switch --create backport-37064-to-main
git cherry-pick -x 6080d32494a3cb658ffdadb5d5ac73ae61d4176c 2836eccbba83e1ff4fe6ab3f90aa91ecc9f087a5 184c872eb5a572f94d22f1b1625c2c71812b6e24

backport-action · 2025-08-22T12:48:10Z

Successfully created backport PR for stable/8.6:

[Backport stable/8.6] Ensure snapshots can be committed after migration #37116

…on (#37116) # Description Backport of #37064 to `stable/8.6`. relates to #37046

ChrisKujawa · 2025-08-22T17:45:13Z

I'm wondering about MigrationSnapshotDirector why it doesnt directly force/commit the snapshot, instead of using the logic of the other async director where we wait for the commit.

Even though it is forcing the snapshot, it has to still follow the same logic as in AsyncSnapshotDirector because there will be concurrent state changes from StreamProcessor replay or processing.

But migration starts before all of this, no?

deepthidevaki · 2025-08-25T07:54:15Z

But migration starts before all of this, no?

Migration is done before the StreamProcessor starts. But the snapshot after migration is done async, as far as I can see from the MigrationSnapshotDirector. I assume it is done in parallel to StreamProcessor startup because in 8.6 and 8.7 we cannot take snapshot if there was no new records processed or replayed.

## Description Forward port of #37064 ## Related issues closes #37046

deepthidevaki added 2 commits August 21, 2025 14:43

fix: ensure snapshots can be taken even if no new entry is written af…

6080d32

…ter leader transition

test: re-enable flaky test

2836ecc

deepthidevaki added the backport stable/8.6 Backport a pull request to stable/8.6 label Aug 21, 2025

github-actions bot added the component/zeebe Related to the Zeebe component/team label Aug 21, 2025

test: verify commit position is set after snapshot director is installed

184c872

deepthidevaki added the backport main Forward-port a pull request to main label Aug 22, 2025

deepthidevaki marked this pull request as ready for review August 22, 2025 10:48

deepthidevaki requested review from a team and ChrisKujawa and removed request for a team August 22, 2025 10:49

ChrisKujawa approved these changes Aug 22, 2025

View reviewed changes

ChrisKujawa reviewed Aug 22, 2025

View reviewed changes

deepthidevaki added this pull request to the merge queue Aug 22, 2025

Merged via the queue into stable/8.7 with commit e464091 Aug 22, 2025
63 checks passed

deepthidevaki deleted the dd-37046-snapshot-after-migration branch August 22, 2025 12:47

backport-action mentioned this pull request Aug 22, 2025

[Backport stable/8.6] Ensure snapshots can be committed after migration #37116

Merged

deepthidevaki mentioned this pull request Aug 22, 2025

Investigate ScaleDownBrokersTest flakiness #37118

Closed

github-merge-queue bot pushed a commit that referenced this pull request Aug 22, 2025

[Backport stable/8.6] Ensure snapshots can be committed after migrati…

acc20d1

…on (#37116) # Description Backport of #37064 to `stable/8.6`. relates to #37046

deepthidevaki mentioned this pull request Aug 25, 2025

Ensure snapshots can be committed after migration #37165

Merged

github-actions bot added the version:8.6.25 label Aug 25, 2025

deepthidevaki mentioned this pull request Aug 26, 2025

[8.6][8.7] Investigate failing test ExporterDisableTest.exporterStaysDisabledAfterRestart #37046

Closed

github-merge-queue bot pushed a commit that referenced this pull request Aug 28, 2025

Ensure snapshots can be committed after migration (#37165)

fd52c34

## Description Forward port of #37064 ## Related issues closes #37046

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ensure snapshots can be committed after migration #37064

Ensure snapshots can be committed after migration #37064

Uh oh!

deepthidevaki commented Aug 21, 2025 •

edited

Loading

Uh oh!

ChrisKujawa left a comment

Uh oh!

ChrisKujawa Aug 22, 2025

Uh oh!

deepthidevaki Aug 22, 2025

Uh oh!

ChrisKujawa Aug 22, 2025

Uh oh!

deepthidevaki Aug 22, 2025

Uh oh!

ChrisKujawa Aug 22, 2025

Uh oh!

deepthidevaki Aug 22, 2025

Uh oh!

ChrisKujawa Aug 22, 2025

Uh oh!

deepthidevaki Aug 22, 2025

Uh oh!

deepthidevaki commented Aug 22, 2025

Uh oh!

Uh oh!

backport-action commented Aug 22, 2025

Uh oh!

backport-action commented Aug 22, 2025

Uh oh!

ChrisKujawa commented Aug 22, 2025

Uh oh!

deepthidevaki commented Aug 25, 2025

Uh oh!

Uh oh!

Ensure snapshots can be committed after migration #37064

Ensure snapshots can be committed after migration #37064

Uh oh!

Conversation

deepthidevaki commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Related issues

Uh oh!

ChrisKujawa left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deepthidevaki commented Aug 22, 2025

Uh oh!

Uh oh!

backport-action commented Aug 22, 2025

Uh oh!

backport-action commented Aug 22, 2025

Uh oh!

ChrisKujawa commented Aug 22, 2025

Uh oh!

deepthidevaki commented Aug 25, 2025

Uh oh!

Uh oh!

deepthidevaki commented Aug 21, 2025 •

edited

Loading