Skip to content

[BUG] - Lock does not work for distributed instances #841

@rbroggi

Description

@rbroggi

Describe the bug

For all the job schedules that are duration-driven and for which the execution of the job is relatively fast, the lock approach does not guarantee one execution per instance. The reason for that is that the schedule will likely start at different times across the different instances (e.g. pod rollouts in k8s). With that, the triggering moment of the schedule is not synchronized across different instances as the ticker is shifted between instances. What endup happening is that most of the instances manage to successfully acquire and release the lock when it's time for their execution.

The same problem is true for the other types of schedulers in the presence of clock-skew (one instance is likely to be able to acquire and release the lock before another instance attempt to acquire).

I don't see a way to fix this but I think we should document this situation or completely remove the distributed-lock.

To Reproduce

I can try to create a reproducer but I think that the explanation is sufficient, it's more a functionality bug rather than a technical bug.

Version

v2.16.1

Expected behavior

Could we either document the distributed lock's shortcomings or remove the functionality?

Additional context

I have opted to use leader-election after identifying the issue above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions