introduce a new locking scheme in ExecutionLockService, and apply it in a number of places where race conditions can occur #496

vincent-cognite · 2025-04-29T11:53:10Z

[X ] Bugfix
[X ] Refactoring (no functional changes, no api changes)

Changes in this PR

We see that sometimes workflows can end up in a state where some tasks are IN_PROGRESS while the workflow is TERMINATED, FAILED, TIMED_OUT, ...
This is due to lack of locking on some critical parts where you can have concurrent access to update a workflow.

In order to fix that we suggest to improve the ExecutionLockService and introduce a LockInstance that is autoclosable.
The new lock mechanism also supports reentrance, and still relies on the underlying storage locking support (Postgres, Redis, ...)

It has proved to be robust in our testing, and this is what we use now in production.

Something to be aware is that there can be more contention on workflows with large number of tasks that are updated simultaneously.

VerstraeteBert

LGTM. Can second that this has been stable in production.

vincent-cognite · 2025-04-30T13:52:59Z

core/src/main/java/com/netflix/conductor/service/ExecutionLockService.java

-     *
-     * @param lockId
-     */
-    public void waitForLock(String lockId) {


VerstraeteBert · 2025-06-03T11:19:06Z

Any updates on this @manan164 ?

jeffbulltech · 2025-06-06T17:03:59Z

@kgoeltner @bradyyie @manan164 Please review

lbestatlas · 2025-06-26T07:12:36Z

core/src/main/java/com/netflix/conductor/core/execution/AsyncSystemTaskExecutor.java

@@ -112,6 +116,12 @@ public void execute(WorkflowSystemTask systemTask, String taskId) {
        boolean hasTaskExecutionCompleted = false;
        boolean shouldRemoveTaskFromQueue = false;
        String workflowId = task.getWorkflowInstanceId();


It doesn't look like this lock gets released.

lbestatlas · 2025-06-26T07:15:05Z

core/src/main/java/com/netflix/conductor/service/ExecutionLockService.java

+        }
+
+        var releaseDistributedLock = false;
+


AFAIK the Redis Implementation was already reentrant, if there is an implementation that is not, shouldn't that implementation be change instead?

I don't think that should be the case. The solution provided here is rather simple, while offering a robust re-entrance. Having to duplicate this logic in some way shape or form for all locking DAOs does not feel productive, and can lead to more issues down the line.

extracted relevant parts for the new locking scheme

b6869e1

vincent-cognite changed the title ~~extracted relevant parts for the new locking scheme~~ introduce a new locking scheme with reentrance support, and apply it in a number of places Apr 29, 2025

vincent-cognite added 2 commits April 29, 2025 16:17

some fixes and remove retry tests that are only in our repo

642c041

spotless

64e831a

VerstraeteBert approved these changes Apr 29, 2025

View reviewed changes

vincent-cognite marked this pull request as ready for review April 29, 2025 16:09

turn some locking log statements to debug

a0c437a

VerstraeteBert mentioned this pull request Apr 30, 2025

Ignore WorkflowExecutionLock in terminate workflow #447

Closed

6 tasks

vincent-cognite changed the title ~~introduce a new locking scheme with reentrance support, and apply it in a number of places~~ introduce a new locking scheme in ExecutionLockService, and apply it in a number of places where race conditions can occur Apr 30, 2025

vincent-cognite commented Apr 30, 2025

View reviewed changes

jeffbulltech requested review from manan164 and jeffbulltech May 12, 2025 16:12

jeffbulltech assigned manan164 May 12, 2025

jeffbulltech added the bug Something isn't working label May 12, 2025

jeffbulltech requested review from kgoeltner and bradyyie June 6, 2025 17:03

jeffbulltech assigned kgoeltner and bradyyie Jun 6, 2025

v1r3n approved these changes Jun 7, 2025

View reviewed changes

bradyyie merged commit 53b116b into conductor-oss:main Jun 11, 2025
2 checks passed

bradyyie mentioned this pull request Jun 16, 2025

Revert due to locks not functional when used with fork joins #530

Merged

6 tasks

lbestatlas reviewed Jun 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

introduce a new locking scheme in ExecutionLockService, and apply it in a number of places where race conditions can occur #496

introduce a new locking scheme in ExecutionLockService, and apply it in a number of places where race conditions can occur #496

Uh oh!

vincent-cognite commented Apr 29, 2025 •

edited

Loading

Uh oh!

VerstraeteBert left a comment

Uh oh!

vincent-cognite Apr 30, 2025

Uh oh!

VerstraeteBert commented Jun 3, 2025

Uh oh!

jeffbulltech commented Jun 6, 2025

Uh oh!

Uh oh!

lbestatlas Jun 26, 2025

Uh oh!

lbestatlas Jun 26, 2025

Uh oh!

VerstraeteBert Jun 26, 2025

Uh oh!

Uh oh!

introduce a new locking scheme in ExecutionLockService, and apply it in a number of places where race conditions can occur #496

introduce a new locking scheme in ExecutionLockService, and apply it in a number of places where race conditions can occur #496

Uh oh!

Conversation

vincent-cognite commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes in this PR

Uh oh!

VerstraeteBert left a comment

Choose a reason for hiding this comment

Uh oh!

vincent-cognite Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

VerstraeteBert commented Jun 3, 2025

Uh oh!

jeffbulltech commented Jun 6, 2025

Uh oh!

Uh oh!

lbestatlas Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

lbestatlas Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

VerstraeteBert Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vincent-cognite commented Apr 29, 2025 •

edited

Loading