Skip to content

Conversation

vincent-cognite
Copy link
Contributor

@vincent-cognite vincent-cognite commented Apr 29, 2025

  • [X ] Bugfix
  • [X ] Refactoring (no functional changes, no api changes)

Changes in this PR

We see that sometimes workflows can end up in a state where some tasks are IN_PROGRESS while the workflow is TERMINATED, FAILED, TIMED_OUT, ...
This is due to lack of locking on some critical parts where you can have concurrent access to update a workflow.

In order to fix that we suggest to improve the ExecutionLockService and introduce a LockInstance that is autoclosable.
The new lock mechanism also supports reentrance, and still relies on the underlying storage locking support (Postgres, Redis, ...)

It has proved to be robust in our testing, and this is what we use now in production.

Something to be aware is that there can be more contention on workflows with large number of tasks that are updated simultaneously.

@vincent-cognite vincent-cognite changed the title extracted relevant parts for the new locking scheme introduce a new locking scheme with reentrance support, and apply it in a number of places Apr 29, 2025
Copy link
Contributor

@VerstraeteBert VerstraeteBert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can second that this has been stable in production.

@vincent-cognite vincent-cognite marked this pull request as ready for review April 29, 2025 16:09
@vincent-cognite vincent-cognite changed the title introduce a new locking scheme with reentrance support, and apply it in a number of places introduce a new locking scheme in ExecutionLockService, and apply it in a number of places where race conditions can occur Apr 30, 2025
*
* @param lockId
*/
public void waitForLock(String lockId) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused

@jeffbulltech jeffbulltech added the bug Something isn't working label May 12, 2025
@VerstraeteBert
Copy link
Contributor

Any updates on this @manan164 ?

@jeffbulltech
Copy link

@kgoeltner @bradyyie @manan164 Please review

@bradyyie bradyyie merged commit 53b116b into conductor-oss:main Jun 11, 2025
2 checks passed
@@ -112,6 +116,12 @@ public void execute(WorkflowSystemTask systemTask, String taskId) {
boolean hasTaskExecutionCompleted = false;
boolean shouldRemoveTaskFromQueue = false;
String workflowId = task.getWorkflowInstanceId();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like this lock gets released.

}

var releaseDistributedLock = false;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK the Redis Implementation was already reentrant, if there is an implementation that is not, shouldn't that implementation be change instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that should be the case. The solution provided here is rather simple, while offering a robust re-entrance. Having to duplicate this logic in some way shape or form for all locking DAOs does not feel productive, and can lead to more issues down the line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants