Skip to content

ATC can end up with a bunch of stuck open connections #1734

@vito

Description

@vito

On our large-scale instance we observe occasional slowdowns where each ATC has hit its 64 open connections limit.

After doing some digging, we were able to track this down to (db.Pipeline).AcquireSchedulingLock.

Here's the digging, for posterity:

  1. Added /debug/connections to the debug server, which shows the stack trace of where all open connections where first created.

  2. Force the issue. We happened to observe this when two very intensive jobs triggered, but it's more likely that it was a random network blip or a bunch of pipeline scheduling that all triggered at once.

  3. While in this state, run the following on an affected ATC:

    while sleep 0.5; do
      curl -s http://127.0.0.1:8079/debug/connections | grep 'countingConn).Begin' >> conns.log;
    done
    

    This will continually emit .Begin call sites to conns.log over time. Each line includes pointers unique enough to identify how long an individual stack has kept its connection open, by identifying repeated lines:

    i=0;
    for ptr in $(cat conns.log | sort | uniq -c | sort -nr | awk '{print $NF}'); do
      if [ "$i" = "10" ]; then
        break;
      fi;
    
      if curl -s http://127.0.0.1:8079/debug/connections | grep -C10 $ptr; then
        i=$(expr $i + 1);
      fi
    done
    

    These can then be correlated to /debug/pprof/goroutine?debug=2 by goroutine ID, to see what the goroutine that created the connection is now doing:

    ... | grep 'goroutine ' | while read _ goroutine _; do
      curl -s 'http://127.0.0.1:8079/debug/pprof/goroutine?debug=2' | grep -C20 "goroutine $goroutine ";
    done
    

This led us to discovering that many, many (db.Pipeline).AcquireSchedulingLock call stacks were hanging on acquiring a local *sync.Mutex in a (*lock).Acquire, which was being held by another goroutine acquiring or releasing a DB lock. This (*sync.Mutex).Lock was blocked with the transaction opened in (db.Pipeline).AcquireSchedulingLock held open, causing the ATC to quickly reach its client-side connection limit, as scheduling fires every 10 seconds for each pipeline.

So, a few things learned from this:

  1. Obviously, a transaction should not be held open while doing any other I/O outside of the transaction. Even with locks running against their own dedicated connection (and thus not counting towards the client-side limit), it still led to hitting the limit.
  2. Having a single *sync.Mutex underlying all lock acquisition, even for unrelated locks, will be problematic in the future. If one lock acquisition hangs, all of them will.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions