-
Notifications
You must be signed in to change notification settings - Fork 105
Make backfill batch selection exclude rows inserted or updated after backfill start #648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
`RawConn` returns the underlying `*sql.DB` for the connection.
Change the method signature to accept `sql.TxOptions` as the second argument. This allows transactions to run at isolation levels other than the default `READ COMMITTED` level.
Allow clients to define the schema where `pgroll` stores its internal state.
Add tests for: * Generating SQL to create a batch table * Generating SQL to select a batch into a batch table * Generating SQL to update a batch
Change the backfill algorithm to only backfill rows that were present at the start of the backfill process. Rows inserted or updated after backfill start will be backfilled by the already-installed `up` trigger and do not need to be backfilled by the backfill process (although doing so is safe from a correctness perspective). Avoiding backfilling rows that were inserted or updated after the backfill start ensures that the backfill process is guaranteed to terminate, even if a large number of rows are inserted or updated during the backfill process. The new algorithm works as follows: * Create a 'batch table' in the `pgroll` schema. The batch table is used to store the primary key values of each batch of rows to be updated during the backfill process. The table holds at most `batchSize` rows at a time and is `TRUNCATE`d at the start of each batch. * Begin a `REPEATABLE READ` transaction and take a transaction snapshot. This transaction remains open for the duration of the backfill so that other transactions can use the snapshot. * For each batch: 1. The primary key values for each batch of rows to be updated is `INSERT INTO` a table. The transaction that does the `INSERT INTO` uses the snapshot taken in step 1 so that only rows present at the start of the backfill are visible. 2. The batch of rows is updated in the table being backfilled by setting their primary keys to themselves (a no-op update). This update causes any `ON UPDATE` trigger to fire for the affected rows. The 'batch table' is necesary as a temporary store of the primary key values of the rows to be updated because the per-batch transaction that selects the rows to be updated runs in a `REPEATABLE READ` transaction (by necessity to use the transaction snapshot). Trying to update the selected batch of rows in the same transaction would fail with serialization errors in the case where a row in the batch had been updated by a transaction committed after the snapshot was taken. Such serialization errors can safely be ignored, as any rows updated after the snapshot was taken will already have been backfilled by the `up` trigger. In order to avoid the serialization errors therefore, the batch of rows to be updated is written to the 'batch table' from where the batch can be `UPDATE`d from a `READ COMMITTED` transaction that can not encounter serialization errors. The largest drawback of this approach is that it requires holding a transaction open during the backfill process. Long-running transactions can cause bloat in the database by preventing vacuuming of dead rows.
Closing in favor of #652 |
This was referenced Feb 17, 2025
andrew-farries
added a commit
that referenced
this pull request
Feb 18, 2025
…backfill start (#652) Backfill only rows present at backfill start. This is third approach to solving #583. The previous two are: * #634 * #648 This is the most direct approach to solving the problem. At the same time as the up/down triggers are created to perform a backfill, a `_pgroll_needs_backfill` column is also created on the table to be backfilled. The column has a `DEFAULT` of `true`; the constant default ensures that this extra column can be added quickly without a lengthy `ACCESS_EXCLUSIVE` lock. The column is removed when the the operation is rolled back or completed. The up/down triggers are modified to set `_pgroll_needs_backfill` to false whenever they update a row. The backfill itself is updated to select only rows having `_pgroll_needs_backfill` set to `true` - this ensures that only rows created before the triggers were installed are updated by the backfill. The backfill process still needs to *read* every row in the table, including those inserted/updated after backfill start, but only those rows created before backfill start will be updated. The main disadvantage of this approach is that backfill now requires an extra column to be created on the target table.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backfill only rows present at backfill start. This is a second approach to solving #583; the first one is #634.
Change the backfill algorithm to only backfill rows that were present at the start of the backfill process. Rows inserted or updated after backfill start will be backfilled by the already-installed
up
trigger and do not need to be backfilled by the backfill process (although doing so is safe from a correctness perspective).Avoiding backfilling rows that were inserted or updated after the backfill start ensures that the backfill process is guaranteed to terminate, even if a large number of rows are inserted or updated during the backfill process.
The new algorithm works as follows:
pgroll
schema. The batch table is used to store the primary key values of each batch of rows to be updated during the backfill process. The table holds at mostbatchSize
rows at a time and isTRUNCATE
d at the start of each batch.REPEATABLE READ
transaction and take a transaction snapshot. This transaction remains open for the duration of the backfill so that other transactions can use the snapshot.INSERT INTO
a table. The transaction that does theINSERT INTO
uses the snapshot taken in step 1 so that only rows present at the start of the backfill are visible.ON UPDATE
trigger to fire for the affected rows.The 'batch table' is necessary as a temporary store of the primary key values of the rows to be updated because the per-batch transaction that selects the rows to be updated runs in a
REPEATABLE READ
transaction (by necessity to use the transaction snapshot). Trying to update the selected batch of rows in the same transaction would fail with serialization errors in the case where a row in the batch had been updated by a transaction committed after the snapshot was taken. Such serialization errors can safely be ignored, as any rows updated after the snapshot was taken will already have been backfilled by theup
trigger. In order to avoid the serialization errors therefore, the batch of rows to be updated is written to the 'batch table' from where the batch can beUPDATE
d from aREAD COMMITTED
transaction that can not encounter serialization errors.The largest drawback of this approach is that it requires holding a transaction open during the backfill process. Long-running transactions can cause bloat in the database by preventing vacuuming of dead rows.