-
Notifications
You must be signed in to change notification settings - Fork 691
introduce a new locking scheme in ExecutionLockService, and apply it in a number of places where race conditions can occur #496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce a new locking scheme in ExecutionLockService, and apply it in a number of places where race conditions can occur #496
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Can second that this has been stable in production.
* | ||
* @param lockId | ||
*/ | ||
public void waitForLock(String lockId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused
Any updates on this @manan164 ? |
@kgoeltner @bradyyie @manan164 Please review |
@@ -112,6 +116,12 @@ public void execute(WorkflowSystemTask systemTask, String taskId) { | |||
boolean hasTaskExecutionCompleted = false; | |||
boolean shouldRemoveTaskFromQueue = false; | |||
String workflowId = task.getWorkflowInstanceId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like this lock gets released.
} | ||
|
||
var releaseDistributedLock = false; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK the Redis Implementation was already reentrant, if there is an implementation that is not, shouldn't that implementation be change instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that should be the case. The solution provided here is rather simple, while offering a robust re-entrance. Having to duplicate this logic in some way shape or form for all locking DAOs does not feel productive, and can lead to more issues down the line.
Changes in this PR
We see that sometimes workflows can end up in a state where some tasks are IN_PROGRESS while the workflow is TERMINATED, FAILED, TIMED_OUT, ...
This is due to lack of locking on some critical parts where you can have concurrent access to update a workflow.
In order to fix that we suggest to improve the ExecutionLockService and introduce a LockInstance that is autoclosable.
The new lock mechanism also supports reentrance, and still relies on the underlying storage locking support (Postgres, Redis, ...)
It has proved to be robust in our testing, and this is what we use now in production.
Something to be aware is that there can be more contention on workflows with large number of tasks that are updated simultaneously.