-
Notifications
You must be signed in to change notification settings - Fork 9.7k
discovery: Try fixing potential deadlocks in discovery #16587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Manager.ApplyConfig() uses multiple locks: - Provider.mu - Manager.targetsMtx Manager.cleaner() uses the same locks but in the opposite order: - First it locks Manager.targetsMtx - The it locks Provider.mu I've seen a few strange cases of Prometheus hanging up on shutdown and never compliting that shutdown. From a few traces I was given it appears that while Prometheus is still running only discovery.Manager and notifier.Manager are running running. From that trace it also seems like they are stuck on a lock from two functions: - cleaner waits on a RLock() - ApplyConfig waits on a Lock() I cannot reproduce it but I suspect this is a race between locks. Imagine this scenario: - Manager.ApplyConfig() is called - Manager.ApplyConfig locks Provider.mu.Lock() - at the same time cleaner() is called on the same Provider instance and it calls Manager.targetsMtx.Lock() - Manager.ApplyConfig() now calls Manager.targetsMtx.Lock() but that lock is already held by cleaner() function so ApplyConfig() hangs there - at the same time cleaner() now wants to lock Provider.mu.Rlock() but that lock is already held by Manager.ApplyConfig() - we end up with both functions locking each other out without any way to break that lock Re-order lock calls to try to avoid this scenario. I tried writing a test case for it but couldn't hit this issue. Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
thanks for this. I see also |
Make sure the order of locks is always the same in all functions. In ApplyConfig() we have m.targetsMtx.Lock() after provider is locked, so replicate the same in allGroups(). Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
Added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Was able to capture more goroutine dumps from affected instance and I do believe this fixes the issue I've observed before. What is running when we are hung: ApplyConfig():
cancelDiscoverers()
cleaner() x 2
cleaner()
ApplyConfig() and cleaner() are both mutually locking each other. Here's the dump filtered down to only what's important here:
|
Does someone else need to review this or what else needs to happen before this can be merged? |
it’s clear that bug gave you a hard time. (for such tricky changes like this one, I approve the PR while keeping it open so others have a chance to flag anything we might have missed.) |
I saw a test fail on this PR, now it also failed on main after the merge https://github.com/prometheus/prometheus/actions/runs/15108680391/job/42462961125 Not sure if it's directly related to the change, maybe the test needs to be adjusted... |
Every time I open a PR or push to existing PR there's a random test that fails, so I doubt this failure is related to my change |
Don't hesitate to open an issue for them or update an existing one, they get fixed eventually. |
Manager.ApplyConfig() uses multiple locks:
Manager.cleaner() uses the same locks but in the opposite order:
I've seen a few strange cases of Prometheus hanging up on shutdown and never compliting that shutdown. From a few traces I was given it appears that while Prometheus is still running only discovery.Manager and notifier.Manager are running running. From that trace it also seems like they are stuck on a lock from two functions:
I cannot reproduce it but I suspect this is a race between locks. Imagine this scenario:
Re-order lock calls to try to avoid this scenario. I tried writing a test case for it but couldn't hit this issue.