Skip to content

Conversation

prymitive
Copy link
Contributor

Manager.ApplyConfig() uses multiple locks:

  • Provider.mu
  • Manager.targetsMtx

Manager.cleaner() uses the same locks but in the opposite order:

  • First it locks Manager.targetsMtx
  • The it locks Provider.mu

I've seen a few strange cases of Prometheus hanging up on shutdown and never compliting that shutdown. From a few traces I was given it appears that while Prometheus is still running only discovery.Manager and notifier.Manager are running running. From that trace it also seems like they are stuck on a lock from two functions:

  • cleaner waits on a RLock()
  • ApplyConfig waits on a Lock()

I cannot reproduce it but I suspect this is a race between locks. Imagine this scenario:

  • Manager.ApplyConfig() is called
  • Manager.ApplyConfig locks Provider.mu.Lock()
  • at the same time cleaner() is called on the same Provider instance and it calls Manager.targetsMtx.Lock()
  • Manager.ApplyConfig() now calls Manager.targetsMtx.Lock() but that lock is already held by cleaner() function so ApplyConfig() hangs there
  • at the same time cleaner() now wants to lock Provider.mu.Rlock() but that lock is already held by Manager.ApplyConfig()
  • we end up with both functions locking each other out without any way to break that lock

Re-order lock calls to try to avoid this scenario. I tried writing a test case for it but couldn't hit this issue.

Manager.ApplyConfig() uses multiple locks:
- Provider.mu
- Manager.targetsMtx

Manager.cleaner() uses the same locks but in the opposite order:
- First it locks Manager.targetsMtx
- The it locks Provider.mu

I've seen a few strange cases of Prometheus hanging up on shutdown and never compliting that shutdown.
From a few traces I was given it appears that while Prometheus is still running only discovery.Manager and notifier.Manager are running running.
From that trace it also seems like they are stuck on a lock from two functions:
- cleaner waits on a RLock()
- ApplyConfig waits on a Lock()

I cannot reproduce it but I suspect this is a race between locks. Imagine this scenario:
- Manager.ApplyConfig() is called
- Manager.ApplyConfig locks Provider.mu.Lock()
- at the same time cleaner() is called on the same Provider instance and it calls Manager.targetsMtx.Lock()
- Manager.ApplyConfig() now calls Manager.targetsMtx.Lock() but that lock is already held by cleaner() function so ApplyConfig() hangs there
- at the same time cleaner() now wants to lock Provider.mu.Rlock() but that lock is already held by Manager.ApplyConfig()
- we end up with both functions locking each other out without any way to break that lock

Re-order lock calls to try to avoid this scenario.
I tried writing a test case for it but couldn't hit this issue.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
@machine424
Copy link
Member

thanks for this.
yes, it's always a good idea to have them ordered (and I don't see why we can't) even though we cannot reproduce the issue.

I see also allGroups use both, maybe let's haveApplyConfig try targetsMtx first? as mu are acquired in loops in ApplyConfig and allGroups. Or maybe we can move targetsMtx down the loop in allGroups?

Make sure the order of locks is always the same in all functions. In ApplyConfig() we have m.targetsMtx.Lock() after provider is locked, so replicate the same in allGroups().

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
@prymitive
Copy link
Contributor Author

Or maybe we can move targetsMtx down the loop in allGroups?

Added

Copy link
Member

@machine424 machine424 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@prymitive
Copy link
Contributor Author

Was able to capture more goroutine dumps from affected instance and I do believe this fixes the issue I've observed before.

What is running when we are hung:

ApplyConfig():

  • Holds Manager.mtx.Lock()
  • Holds Provider.mu.Lock()
  • Hangs on Manager.targetsMtx.Lock()

cancelDiscoverers()

  • Hangs on Manager.mtx.RLock()

cleaner() x 2

  • Hangs on Manager.targetsMtx.Lock()

cleaner()

  • Holds m.targetsMtx.Lock()
  • Hangs on Provider.mu.RLock()

ApplyConfig() and cleaner() are both mutually locking each other.
ApplyConfig has Provider.mu.Lock() locked and waits on Manager.targetsMtx.Lock()
At the same time cleaner() has Manager.targetsMtx.Lock() locked and waits on Provider.mu.RLock()

Here's the dump filtered down to only what's important here:

1: sync.Mutex.Lock [68 minutes] [Created by run.(*Group).Run in goroutine 1 @ group.go:37]
    sync         sema.go:95               runtime_SemacquireMutex(*uint32(#42), true, 4696079)
    sync         mutex.go:149             (*Mutex).lockSlow(*Mutex(#28))
    sync         mutex.go:70              (*Mutex).Lock(...)
    sync         mutex.go:46              (*Mutex).Lock(...)
    discovery    manager.go:233           (*Manager).ApplyConfig(#27, #180)
    main         main.go:922              main.func5(#126)
    main         main.go:1463             reloadConfig({#191, 0x1b}, 1, #60, #63, #55, {#70, 0xa, #192})
    main         main.go:1140             main.func22()
    run          group.go:38              (*Group).Run.func1(*Group(#46), #56)
1: sync.RWMutex.RLock [68 minutes] [Created by run.(*Group).Run in goroutine 1 @ group.go:37]
    sync         sema.go:100              runtime_SemacquireRWMutexR(*uint32(#31), true, 824644937472)
    sync         rwmutex.go:74            (*RWMutex).RLock(...)
    discovery    manager.go:380           (*Manager).cancelDiscoverers(#27)
    discovery    manager.go:193           (*Manager).Run(#27)
    main         main.go:1034             main.func11()
    run          group.go:38              (*Group).Run.func1(*Group(#65), #66)
10: sync.Mutex.Lock [68 minutes] [Created by discovery.(*Manager).startProvider in goroutine 376 @ manager.go:304]
    sync         sema.go:95               runtime_SemacquireMutex(*, 2, *)
    sync         mutex.go:149             (*Mutex).lockSlow(#28)
    sync         mutex.go:70              (*Mutex).Lock(...)
    sync         mutex.go:46              (*Mutex).Lock(...)
    discovery    manager.go:309           (*Manager).cleaner(#27, *)
    discovery    manager.go:327           (*Manager).updater(#27, {#23, *}, *, *)
5: sync.Mutex.Lock [68 minutes] [Created by discovery.(*Manager).startProvider in goroutine 376 @ manager.go:304]
    sync         sema.go:95               runtime_SemacquireMutex(*, *, 0)
    sync         mutex.go:149             (*Mutex).lockSlow(#28)
    sync         mutex.go:70              (*Mutex).Lock(...)
    sync         mutex.go:46              (*Mutex).Lock(...)
    discovery    manager.go:309           (*Manager).cleaner(#27, *)
    discovery    manager.go:334           (*Manager).updater(#27, {#23, *}, *, *)
1: sync.RWMutex.RLock [68 minutes] [Created by discovery.(*Manager).startProvider in goroutine 376 @ manager.go:304]
    sync         sema.go:100              runtime_SemacquireRWMutexR(*uint32(#28), false, 1)
    sync         rwmutex.go:74            (*RWMutex).RLock(...)
    discovery    manager.go:310           (*Manager).cleaner(#27, #157)
    discovery    manager.go:327           (*Manager).updater(#27, {#23, #120}, #157, #124)

@prymitive
Copy link
Contributor Author

Does someone else need to review this or what else needs to happen before this can be merged?

@machine424
Copy link
Member

Does someone else need to review this or what else needs to happen before this can be merged?

it’s clear that bug gave you a hard time.
Let’s merge and make this the start of a better week.

(for such tricky changes like this one, I approve the PR while keeping it open so others have a chance to flag anything we might have missed.)

@machine424 machine424 merged commit eb8d34c into prometheus:main May 19, 2025
27 checks passed
@prymitive prymitive deleted the discoveryLocks branch May 19, 2025 09:21
@machine424
Copy link
Member

I saw a test fail on this PR, now it also failed on main after the merge https://github.com/prometheus/prometheus/actions/runs/15108680391/job/42462961125
and I found the failing one on the PR https://github.com/prometheus/prometheus/actions/runs/14931598411

Not sure if it's directly related to the change, maybe the test needs to be adjusted...

@prymitive
Copy link
Contributor Author

Every time I open a PR or push to existing PR there's a random test that fails, so I doubt this failure is related to my change

@machine424
Copy link
Member

machine424 commented May 19, 2025

Every time I open a PR or push to existing PR there's a random test that fails, so I doubt this failure is related to my change

Don't hesitate to open an issue for them or update an existing one, they get fixed eventually.
First time I encounter it personally, let's see, I opened #16615

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants