Skip to content

Conversation

taylorsilva
Copy link
Member

@taylorsilva taylorsilva commented Feb 18, 2025

Changes proposed by this PR

This PR removes all usage of lib/pq and replaces it with pgx. Everything is still wrapped by the database/sql interface which made the transition mostly easy. Further work could be done to move to the pgx interface which would provide significant performance gains when it comes to scanning data from Postgresql into Go structs.

What we do get with switching to pgx right now is support for newer Postgres features. The recent example of this was the addition of binary_params which ended up not really working with lib/pq.

Notes to reviewer

The hardest part of this transition was replacing lib/pq's Listener. pgx did not come with a similar interface and only provides a single blocking function WaitForNotification(). This function alone does not provide the same concurrent functionality of lib/pq's Listener interface.

I had to implement the Listener interface and it's existing behaviour using pgx. This work went through two iterations, all preserved in this PR.

Iteration #1: Using sync.Mutex

I got to a working version of the Listener that would pass all units tests. Running the testflight suite would fail though. My implementation had two locks and eventually two goroutines would deadlock with each other. This resulted in builds appearing to run but never actually running. This deadlocking and trying to manage two locks is why I eventually gave up using Locks.

Iteration #2: Channels; Share by communicating

The current implementation uses channels to control who can use the single connection object. Once I had wrapped my head around channels (again) this proved to be a much more straight forward implementation and easier to reason about. It's working, passing all unit and integration tests.

Release Note

  • Replace lib/pq with pgx as the Postgresql driver.
    • PgBouncer users: The Pgx driver docs state that its out-of-the-box configuration does not support PgBouncer, but recent discussion indicates that may not be the case if you're using PgBouncer >1.21.0. The recent 1.24.0 release also says prepared statement support is on by default, so this may be a non-issue if you're on the most recent version of PgBouncer.

analytically and others added 30 commits February 7, 2025 11:13
there's no need to cast the int array into a pg type array. Seems pgx
does the work for us!

Signed-off-by: Taylor Silva <dev@taydev.net>
had to create our own implementation of pq's listener. Thankfully we had
that wrapped in an interface already.

While digging into the pq implementation I found we couldn't implement
things in the exact same way they did. They actually received
notifications by using a sort of "escape hatch" they built into their
driver. They had a func that would read every single bit of data sent to
the driver. It would then check the type of the data going through the
driver and if it was a "Notification" then it would read the raw data
from a bufffer in the driver. We cannot do the same thing with the Pgx
driver.

I'm not 100% sure my implementation is good. It's passing most tests,
but I'm not sure how robust it is. I have no clue if pgx will recover
from connection errors for us for example.

Signed-off-by: Taylor Silva <dev@taydev.net>
This interface is embedded in the LogConn struct. Because we expect the
interface to have a method named "Conn" this resulted in a naming
conflict when the interface is embedded because the method name now
conflicts with the embedded name of the interface... yeah that's
confusing to explain. Anyways, had to rename Conn so LogConn could
satisfy the Conn interface, that's the gist of this commit.

Signed-off-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Taylor Silva <dev@taydev.net>
and further decrease sleep time

Signed-off-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Taylor Silva <dev@taydev.net>
using a pool so we can easily reset the connection if needed. the lib/pq
listener had this behaviour. We're just trying to replicate it here.
Using exponential backoff which will timeout after 15mins. We use this
package in other places in our codebase.

Signed-off-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Taylor Silva <dev@taydev.net>
in this case we had row.BuildID with a type of int. The column we were
trying to insert that into was of type "text" in postgres. We have to
convert the int to a string first before passing it into the query.

Signed-off-by: Taylor Silva <dev@taydev.net>
WOW is the pgx driver ANAL about types when it comes to arrays.

So in postgres, if `array_agg()` returns nothing, you get back `{NULL}`.
So this _should_ cast just fine to the sql.NullString type from the Go
sql package. Pgx cannot figure out how to cast to this type at all, you
just get errors like this:

sql: Scan error on column index 3, name "passed": cannot convert to sql.Scanner: cannot find registered type for *[]sql.NullString

and initially I was getting this error:

sql: Scan error on column index 3, name "passed": unsupported Scan, storing driver.Value type string into type *[]sql.NullString

In that second error I found it weird that it said it was trying to turn
a type `string` into type `*[]sql.NullString`. I kept playing around
with the types and using dlv to debug. I eventually figured out pgx
would safely scan into a slice of type `[]*string`, so an array that
contains pointers to strings, which is a VERY annoying array to use!

I eventually figured out this is because `array_agg()` returns `{NULL}`
when it has no results. I figured out that by wrapping `array_agg()`
with `array_remove(array_agg(), NULL)` I could then get pgx to cast to a
normal `[]string` slice.

A bunch of tests still failed a this point because `Passed` on the
various structs would be non-nil. This is why there are length checks on
the slices and we set the var to `nil` if the length is zero before we
pass it to the returning struct.

I think this is actually a bit of a bug in pgx. This workaround is not
bad, but was VERY annoying to figure out, especially when it worked just
fine with lib/pq.

Signed-off-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Taylor Silva <dev@taydev.net>
Signed-off-by: Taylor Silva <dev@taydev.net>
also learned that pgx does not support sending multiple commands in a
single statement. You have to send it in two separate statements:

jackc/pgx#1090

Signed-off-by: Taylor Silva <dev@taydev.net>
It seems the pgx driver gives up on the initial connection very quickly.
Guessing lib/pq made some re-attempts which is why we never saw this
behaviour previously.

Signed-off-by: Mathias Bogaert <mathias.bogaert@gmail.com>
Co-authored--by: Taylor Silva <dev@taydev.net>
Disables fsync, full_page_writes and synchronous_commit for
Postgres in CI environments. This greatly improves performance
by removing disk durability guarantees which aren't needed
in ephemeral test environments.
Sets random_page_cost=1.1 which tells PostgreSQL's query
planner that random disk access is relatively fast, as is
typical with SSDs. This encourages the planner to consider
index scans over sequential scans when appropriate.
Sets random_page_cost=1.1 which tells PostgreSQL's query
planner that random disk access is relatively fast, as is
typical with SSDs. This encourages the planner to consider
index scans over sequential scans when appropriate.
…tion in PgxListener

With the locks and one channel, we kept ending up in a deadlocking
situation where one goroutine had one lock and another goroutine had
another lock OR a goroutine would get stuck waiting for a message from
the opsDone channel when there was zero chance of any message being
sent from another goroutine. Deadlock!

Definitely due to bad coding/design on my part. I couldn't figure out a
nice way to make the two locking thing work and not eventually deadlock.
Decided to go back to figuring out how to do the whole thing with
channels, which I had tried earlier but failed at.

I re-read Effective Go's piece about concurrency and "share by
communicating": https://go.dev/doc/effective_go#concurrency

> Do not communicate by sharing memory; instead, share memory by communicating.

It got me in the right mindset and I was able to put together the setup
below. There is now zero chance that someone will be blocked sending or
receiving on a channel now.

And we have less vars now! We went from two shared locks and one shared
channel, down to one shared channel, and a second private channel just
for listenerLoop() to keep track of its stuff.

I think there is some potential to refactor listenerLoop() here.

I would love to add a unit test here that could reproduce the race
condition stuff as well, but not sure the best way to achieve that.

Regardless, all the db tests pass and testflight is passing now too,
which is the test suite which originally caught this error.

Signed-off-by: Taylor Silva <dev@taydev.net>
there was a race condition here where we would try listening for
notifications with a nil conn. Had to make sure the check on the
connection was AFTER we were given back control from the comms channel.
Also made it explicit that l.conn is set to nil during close() even
though it appears Release() does this for us. This is to avoid any race
conditions with Release()

Signed-off-by: Taylor Silva <dev@taydev.net>
@taylorsilva taylorsilva requested a review from a team as a code owner February 18, 2025 00:58
@taylorsilva taylorsilva merged commit 996abf9 into master Feb 18, 2025
11 checks passed
@taylorsilva taylorsilva deleted the pgx-migration branch February 18, 2025 15:18
taylorsilva added a commit to concourse/flag that referenced this pull request Feb 18, 2025
We've changed the driver we're using upstream from lib/pq to Pgx.

concourse/concourse#9066

It seems this setting was only needed for folks using PgBouncer and was
based on info from a 2019 blog post:

- concourse/concourse#6936
    - https://blog.bullgare.com/2019/06/pgbouncer-and-prepared-statements/

For PgBouncer to work with pgx it was previously suggested that one
change the Pgx ExecMode to the SimpleProtocol. Recent pgx user reports
indicate that >1.21.0 of PgBouncer should work without any config
changes though: jackc/pgx#1784

The most recent 1.24.0 PgBouncer release also indicates that Prepared
statements are now supported: https://www.pgbouncer.org/2025/01/pgbouncer-1.24.0

Therefore I don't think we need to change any configuration at the sql
driver level now. We definitely don't need the BinaryParameters flag
anymore since it was specific to lib/pq and doesn't work at all with
pgx. Trying to set this flag with pgx results in an error.

Signed-off-by: Taylor Silva <dev@taydev.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants