ATC can end up with a bunch of stuck open connections

On our large-scale instance we observe occasional slowdowns where each ATC has hit its 64 open connections limit.

After doing some digging, we were able to track this down to `(db.Pipeline).AcquireSchedulingLock`.

Here's the digging, for posterity:

1. Added `/debug/connections` to the debug server, which shows the stack trace of where all open connections where first created.
1. Force the issue. We happened to observe this when two very intensive jobs triggered, but it's more likely that it was a random network blip or a bunch of pipeline scheduling that all triggered at once.
2. While in this state, run the following on an affected ATC:
    ```sh
    while sleep 0.5; do
      curl -s http://127.0.0.1:8079/debug/connections | grep 'countingConn).Begin' >> conns.log;
    done
    ```

    This will continually emit `.Begin` call sites to `conns.log` over time. Each line includes pointers unique enough to identify how long an individual stack has kept its connection open, by identifying repeated lines:

    ```sh
    i=0;
    for ptr in $(cat conns.log | sort | uniq -c | sort -nr | awk '{print $NF}'); do
      if [ "$i" = "10" ]; then
        break;
      fi;

      if curl -s http://127.0.0.1:8079/debug/connections | grep -C10 $ptr; then
        i=$(expr $i + 1);
      fi
    done
    ```

    These can then be correlated to `/debug/pprof/goroutine?debug=2` by goroutine ID, to see what the goroutine that created the connection is now doing:

    ```sh
    ... | grep 'goroutine ' | while read _ goroutine _; do
      curl -s 'http://127.0.0.1:8079/debug/pprof/goroutine?debug=2' | grep -C20 "goroutine $goroutine ";
    done
    ```

This led us to discovering that many, many `(db.Pipeline).AcquireSchedulingLock` call stacks were hanging on acquiring a local `*sync.Mutex` in a `(*lock).Acquire`, which was being held by another goroutine acquiring or releasing a DB lock. This `(*sync.Mutex).Lock` was blocked with the transaction opened in `(db.Pipeline).AcquireSchedulingLock` held open, causing the ATC to quickly reach its client-side connection limit, as scheduling fires every 10 seconds for each pipeline.

So, a few things learned from this:

1. Obviously, a transaction should not be held open while doing any other I/O outside of the transaction. Even with locks running against their own dedicated connection (and thus not counting towards the client-side limit), it still led to hitting the limit.
2. Having a single `*sync.Mutex` underlying all lock acquisition, even for unrelated locks, will be problematic in the future. If one lock acquisition hangs, all of them will.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ATC can end up with a bunch of stuck open connections #1734

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ATC can end up with a bunch of stuck open connections #1734

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions