-
-
Notifications
You must be signed in to change notification settings - Fork 867
Description
On our large-scale instance we observe occasional slowdowns where each ATC has hit its 64 open connections limit.
After doing some digging, we were able to track this down to (db.Pipeline).AcquireSchedulingLock
.
Here's the digging, for posterity:
-
Added
/debug/connections
to the debug server, which shows the stack trace of where all open connections where first created. -
Force the issue. We happened to observe this when two very intensive jobs triggered, but it's more likely that it was a random network blip or a bunch of pipeline scheduling that all triggered at once.
-
While in this state, run the following on an affected ATC:
while sleep 0.5; do curl -s http://127.0.0.1:8079/debug/connections | grep 'countingConn).Begin' >> conns.log; done
This will continually emit
.Begin
call sites toconns.log
over time. Each line includes pointers unique enough to identify how long an individual stack has kept its connection open, by identifying repeated lines:i=0; for ptr in $(cat conns.log | sort | uniq -c | sort -nr | awk '{print $NF}'); do if [ "$i" = "10" ]; then break; fi; if curl -s http://127.0.0.1:8079/debug/connections | grep -C10 $ptr; then i=$(expr $i + 1); fi done
These can then be correlated to
/debug/pprof/goroutine?debug=2
by goroutine ID, to see what the goroutine that created the connection is now doing:... | grep 'goroutine ' | while read _ goroutine _; do curl -s 'http://127.0.0.1:8079/debug/pprof/goroutine?debug=2' | grep -C20 "goroutine $goroutine "; done
This led us to discovering that many, many (db.Pipeline).AcquireSchedulingLock
call stacks were hanging on acquiring a local *sync.Mutex
in a (*lock).Acquire
, which was being held by another goroutine acquiring or releasing a DB lock. This (*sync.Mutex).Lock
was blocked with the transaction opened in (db.Pipeline).AcquireSchedulingLock
held open, causing the ATC to quickly reach its client-side connection limit, as scheduling fires every 10 seconds for each pipeline.
So, a few things learned from this:
- Obviously, a transaction should not be held open while doing any other I/O outside of the transaction. Even with locks running against their own dedicated connection (and thus not counting towards the client-side limit), it still led to hitting the limit.
- Having a single
*sync.Mutex
underlying all lock acquisition, even for unrelated locks, will be problematic in the future. If one lock acquisition hangs, all of them will.