discovery: Try fixing potential deadlocks in discovery #16587

prymitive · 2025-05-09T14:49:02Z

Manager.ApplyConfig() uses multiple locks:

Provider.mu
Manager.targetsMtx

Manager.cleaner() uses the same locks but in the opposite order:

First it locks Manager.targetsMtx
The it locks Provider.mu

I've seen a few strange cases of Prometheus hanging up on shutdown and never compliting that shutdown. From a few traces I was given it appears that while Prometheus is still running only discovery.Manager and notifier.Manager are running running. From that trace it also seems like they are stuck on a lock from two functions:

cleaner waits on a RLock()
ApplyConfig waits on a Lock()

I cannot reproduce it but I suspect this is a race between locks. Imagine this scenario:

Manager.ApplyConfig() is called
Manager.ApplyConfig locks Provider.mu.Lock()
at the same time cleaner() is called on the same Provider instance and it calls Manager.targetsMtx.Lock()
Manager.ApplyConfig() now calls Manager.targetsMtx.Lock() but that lock is already held by cleaner() function so ApplyConfig() hangs there
at the same time cleaner() now wants to lock Provider.mu.Rlock() but that lock is already held by Manager.ApplyConfig()
we end up with both functions locking each other out without any way to break that lock

Re-order lock calls to try to avoid this scenario. I tried writing a test case for it but couldn't hit this issue.

prymitive · 2025-05-09T14:50:24Z

https://github.com/prometheus/prometheus/blob/v3.3.1/discovery/manager.go#L231
https://github.com/prometheus/prometheus/blob/v3.3.1/discovery/manager.go#L233
https://github.com/prometheus/prometheus/blob/v3.3.1/discovery/manager.go#L309
https://github.com/prometheus/prometheus/blob/v3.3.1/discovery/manager.go#L310

Manager.ApplyConfig() uses multiple locks: - Provider.mu - Manager.targetsMtx Manager.cleaner() uses the same locks but in the opposite order: - First it locks Manager.targetsMtx - The it locks Provider.mu I've seen a few strange cases of Prometheus hanging up on shutdown and never compliting that shutdown. From a few traces I was given it appears that while Prometheus is still running only discovery.Manager and notifier.Manager are running running. From that trace it also seems like they are stuck on a lock from two functions: - cleaner waits on a RLock() - ApplyConfig waits on a Lock() I cannot reproduce it but I suspect this is a race between locks. Imagine this scenario: - Manager.ApplyConfig() is called - Manager.ApplyConfig locks Provider.mu.Lock() - at the same time cleaner() is called on the same Provider instance and it calls Manager.targetsMtx.Lock() - Manager.ApplyConfig() now calls Manager.targetsMtx.Lock() but that lock is already held by cleaner() function so ApplyConfig() hangs there - at the same time cleaner() now wants to lock Provider.mu.Rlock() but that lock is already held by Manager.ApplyConfig() - we end up with both functions locking each other out without any way to break that lock Re-order lock calls to try to avoid this scenario. I tried writing a test case for it but couldn't hit this issue. Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>

machine424 · 2025-05-14T18:29:25Z

thanks for this.
yes, it's always a good idea to have them ordered (and I don't see why we can't) even though we cannot reproduce the issue.

I see also allGroups use both, maybe let's haveApplyConfig try targetsMtx first? as mu are acquired in loops in ApplyConfig and allGroups. Or maybe we can move targetsMtx down the loop in allGroups?

Make sure the order of locks is always the same in all functions. In ApplyConfig() we have m.targetsMtx.Lock() after provider is locked, so replicate the same in allGroups(). Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>

prymitive · 2025-05-15T11:31:27Z

Or maybe we can move targetsMtx down the loop in allGroups?

Added

machine424

lgtm

prymitive · 2025-05-16T10:12:10Z

Was able to capture more goroutine dumps from affected instance and I do believe this fixes the issue I've observed before.

What is running when we are hung:

ApplyConfig():

Holds Manager.mtx.Lock()
Holds Provider.mu.Lock()
Hangs on Manager.targetsMtx.Lock()

cancelDiscoverers()

Hangs on Manager.mtx.RLock()

cleaner() x 2

Hangs on Manager.targetsMtx.Lock()

cleaner()

Holds m.targetsMtx.Lock()
Hangs on Provider.mu.RLock()

ApplyConfig() and cleaner() are both mutually locking each other.
ApplyConfig has Provider.mu.Lock() locked and waits on Manager.targetsMtx.Lock()
At the same time cleaner() has Manager.targetsMtx.Lock() locked and waits on Provider.mu.RLock()

Here's the dump filtered down to only what's important here:

1: sync.Mutex.Lock [68 minutes] [Created by run.(*Group).Run in goroutine 1 @ group.go:37]
    sync         sema.go:95               runtime_SemacquireMutex(*uint32(#42), true, 4696079)
    sync         mutex.go:149             (*Mutex).lockSlow(*Mutex(#28))
    sync         mutex.go:70              (*Mutex).Lock(...)
    sync         mutex.go:46              (*Mutex).Lock(...)
    discovery    manager.go:233           (*Manager).ApplyConfig(#27, #180)
    main         main.go:922              main.func5(#126)
    main         main.go:1463             reloadConfig({#191, 0x1b}, 1, #60, #63, #55, {#70, 0xa, #192})
    main         main.go:1140             main.func22()
    run          group.go:38              (*Group).Run.func1(*Group(#46), #56)
1: sync.RWMutex.RLock [68 minutes] [Created by run.(*Group).Run in goroutine 1 @ group.go:37]
    sync         sema.go:100              runtime_SemacquireRWMutexR(*uint32(#31), true, 824644937472)
    sync         rwmutex.go:74            (*RWMutex).RLock(...)
    discovery    manager.go:380           (*Manager).cancelDiscoverers(#27)
    discovery    manager.go:193           (*Manager).Run(#27)
    main         main.go:1034             main.func11()
    run          group.go:38              (*Group).Run.func1(*Group(#65), #66)
10: sync.Mutex.Lock [68 minutes] [Created by discovery.(*Manager).startProvider in goroutine 376 @ manager.go:304]
    sync         sema.go:95               runtime_SemacquireMutex(*, 2, *)
    sync         mutex.go:149             (*Mutex).lockSlow(#28)
    sync         mutex.go:70              (*Mutex).Lock(...)
    sync         mutex.go:46              (*Mutex).Lock(...)
    discovery    manager.go:309           (*Manager).cleaner(#27, *)
    discovery    manager.go:327           (*Manager).updater(#27, {#23, *}, *, *)
5: sync.Mutex.Lock [68 minutes] [Created by discovery.(*Manager).startProvider in goroutine 376 @ manager.go:304]
    sync         sema.go:95               runtime_SemacquireMutex(*, *, 0)
    sync         mutex.go:149             (*Mutex).lockSlow(#28)
    sync         mutex.go:70              (*Mutex).Lock(...)
    sync         mutex.go:46              (*Mutex).Lock(...)
    discovery    manager.go:309           (*Manager).cleaner(#27, *)
    discovery    manager.go:334           (*Manager).updater(#27, {#23, *}, *, *)
1: sync.RWMutex.RLock [68 minutes] [Created by discovery.(*Manager).startProvider in goroutine 376 @ manager.go:304]
    sync         sema.go:100              runtime_SemacquireRWMutexR(*uint32(#28), false, 1)
    sync         rwmutex.go:74            (*RWMutex).RLock(...)
    discovery    manager.go:310           (*Manager).cleaner(#27, #157)
    discovery    manager.go:327           (*Manager).updater(#27, {#23, #120}, #157, #124)

prymitive · 2025-05-19T08:31:06Z

Does someone else need to review this or what else needs to happen before this can be merged?

machine424 · 2025-05-19T09:09:27Z

Does someone else need to review this or what else needs to happen before this can be merged?

it’s clear that bug gave you a hard time.
Let’s merge and make this the start of a better week.

(for such tricky changes like this one, I approve the PR while keeping it open so others have a chance to flag anything we might have missed.)

machine424 · 2025-05-19T11:03:00Z

I saw a test fail on this PR, now it also failed on main after the merge https://github.com/prometheus/prometheus/actions/runs/15108680391/job/42462961125
and I found the failing one on the PR https://github.com/prometheus/prometheus/actions/runs/14931598411

Not sure if it's directly related to the change, maybe the test needs to be adjusted...

prymitive · 2025-05-19T11:20:07Z

Every time I open a PR or push to existing PR there's a random test that fails, so I doubt this failure is related to my change

machine424 · 2025-05-19T12:47:32Z

Every time I open a PR or push to existing PR there's a random test that fails, so I doubt this failure is related to my change

Don't hesitate to open an issue for them or update an existing one, they get fixed eventually.
First time I encounter it personally, let's see, I opened #16615

prymitive force-pushed the discoveryLocks branch from 539b67d to 7d55ee8 Compare May 12, 2025 08:13

prymitive force-pushed the discoveryLocks branch from 57eb5ab to 7d55ee8 Compare May 15, 2025 11:25

Move m.targetsMtx.Lock down into the loop

59761f6

Make sure the order of locks is always the same in all functions. In ApplyConfig() we have m.targetsMtx.Lock() after provider is locked, so replicate the same in allGroups(). Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>

machine424 approved these changes May 15, 2025

View reviewed changes

machine424 merged commit eb8d34c into prometheus:main May 19, 2025
27 checks passed

prymitive deleted the discoveryLocks branch May 19, 2025 09:21

machine424 mentioned this pull request May 19, 2025

TestHangingNotifier is flaky on windows #16615

Open

charleskorn mentioned this pull request May 26, 2025

Sync upstream at 14fc57e grafana/mimir-prometheus#878

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

discovery: Try fixing potential deadlocks in discovery #16587

discovery: Try fixing potential deadlocks in discovery #16587

Uh oh!

prymitive commented May 9, 2025

Uh oh!

prymitive commented May 9, 2025

Uh oh!

machine424 commented May 14, 2025

Uh oh!

prymitive commented May 15, 2025

Uh oh!

machine424 left a comment

Uh oh!

prymitive commented May 16, 2025

Uh oh!

prymitive commented May 19, 2025

Uh oh!

machine424 commented May 19, 2025

Uh oh!

Uh oh!

machine424 commented May 19, 2025

Uh oh!

prymitive commented May 19, 2025

Uh oh!

machine424 commented May 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

discovery: Try fixing potential deadlocks in discovery #16587

discovery: Try fixing potential deadlocks in discovery #16587

Uh oh!

Conversation

prymitive commented May 9, 2025

Uh oh!

prymitive commented May 9, 2025

Uh oh!

machine424 commented May 14, 2025

Uh oh!

prymitive commented May 15, 2025

Uh oh!

machine424 left a comment

Choose a reason for hiding this comment

Uh oh!

prymitive commented May 16, 2025

Uh oh!

prymitive commented May 19, 2025

Uh oh!

machine424 commented May 19, 2025

Uh oh!

Uh oh!

machine424 commented May 19, 2025

Uh oh!

prymitive commented May 19, 2025

Uh oh!

machine424 commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

machine424 commented May 19, 2025 •

edited

Loading