Add a function to restore a snapshot #10373

jrockway · 2024-10-07T23:36:55Z

This adds the ability to restore the database state to that contained in a snapshot. After restoring, the snapshot is still restorable, because we re-upload the database dump and adjust the snapshot to show the new ID. This makes it look like snapshots contain themselves, which is a nice feature ;)

To restore cleanly, we can't rely on queries running against a default schema (public), it seems, so I updated the migrations code to be explicit about the schema the queries are in.

psql is used to restore the dump, because that's what the pg_dump documentation says to do. I thought about parsing the SQL and sending one statement at a time, but it seems like something that is likely to go wrong in weird ways. As a result of this, psql is available in //tools/postgres/psql now, which should be convenient.

We use pg_dump v17 to dump any version of postgres. For various reasons, these dumps can only be restored onto postgres 17. The snapshotting code has been adjusted to remove the v17-specific functionality so that we can take snapshots of v13 and restore onto v13.

Finally, I adjusted the snapshotting code to atomically create the pachyderm snapshot and the postgres snapshot. This involves sharing the transaction between Go and pg_dump which is possible (see pg_export_snapshot). We have to manually dump everything created in that transaction, though, which we now do. (The start READING the same MVCC snapshot, but can't see each other's writes because postgres doesn't have READ UNCOMMITTED isolation. Thus, we make a list of everything we wrote and add it to the end of the dump file ourselves.) This fixes MLDM-143.

github-actions · 2024-10-07T23:40:36Z

size-limit report 📦

Path	Size
Entry	93.13 KB (0%)
Vendor XL	800.86 KB (0%)
Chunks	292.51 KB (0%)
Everything	1.16 MB (0%)

codecov · 2024-10-07T23:57:42Z

Codecov Report

Attention: Patch coverage is 79.23077% with 54 lines in your changes missing coverage. Please review.

Project coverage is 62.26%. Comparing base (8b6e714) to head (fe76302).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
src/internal/recovery/recovery.go	78.35%	42 Missing ⚠️
src/internal/storage/track/postgres_tracker.go	80.85%	9 Missing ⚠️
src/internal/storage/fileset/storage.go	76.92%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10373      +/-   ##
==========================================
+ Coverage   62.22%   62.26%   +0.04%     
==========================================
  Files        1232     1232              
  Lines       88407    88612     +205     
  Branches     1820     1820              
==========================================
+ Hits        55013    55177     +164     
- Misses      32543    32591      +48     
+ Partials      851      844       -7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

brendoncarroll · 2024-10-14T16:54:52Z

src/internal/recovery/recovery_test.go

+	ctx := pctx.TestContext(t)
+	db := dockertestenv.NewMigratedTestDB(t, clusterstate.DesiredClusterState)
+	tracker := track.NewPostgresTracker(db)
+	storage := fileset.NewStorage(fileset.NewPostgresStore(db), tracker, chunk.NewStorage(kv.NewMemStore(), nil, db, tracker))


We have a test constructor for this.

fileset.NewTestStorage only works against an unmigrated database, and can only create tables used by the storage system. This test wants to write to the recovery.snapshots table, but there is no way we can create that in a way that doesn't conflict with NewTestStorage. (Applying migrations prevents fileset.NewTestStorage from working.)

I definitely get why storage tests don't want all those tables around, but for this "dump the entire database" test, we want the entire database.

(I also like kv.NewMemStore() better than NewFSStore, though not for any real reason.)

brendoncarroll · 2024-10-14T17:51:41Z

src/internal/recovery/recovery.go

-		return 0, errors.Wrap(err, "create snapshot row")
+	b.WriteRune('\n')
+	b.WriteString("COPY storage.chunksets (id) FROM stdin;\n")
+	fmt.Fprintf(&b, "%v\n", int64(chunksetID))


Shouldn't this be a log line?

No, this is the format of the database dump. We have to manually create it here because pg_dump can't see writes from this transaction even though its reads are in this transaction. The statement after COPY is basically a TSV file of the table content.

src/internal/recovery/recovery.go

brendoncarroll · 2024-10-14T21:27:09Z

I'm concerned about the dumpTrackerRefs and dumpTrackerObjects accessing storage tables which may need to change in future releases. That information is included in the database snapshot, and we should get it back for free when the backup is restored.

I thought the whole point of having ChunkSets was to encapsulate that logic behind an API that could remind stable while the storage representation continued to change.

At the very least, those function should be exported from the storage package, so we know that they are a commitment to the recovery system.

brendoncarroll

We need to sort out the cross-system database queries. see my other comment.

jrockway · 2024-10-18T02:46:53Z

@brendoncarroll I have moved the storage dumping functionality to the storage and tracker package. The reason for this is that pg_dump and Go start with the same database state (i.e. the same MVCC snapshot at the postgres level), but can't see each other's writes. Thus, anything new created by Go in this transaction has to be dumped by Go. We dump our newly-created chunkset row and newly-created snapshot row in this package, and dump the state of the tracker in the tracker package.

(Ideally we would dump the storage.chunksets row in the storage package, but it would be a real mess for other callers, so the mess is contained here instead. storage.chunksets is just an integer id.)

All this complexity is to avoid any new data being added to the system between creating the chunkset and creating the database dump. If data were added between those two points, we'd potentially restore to an unrecoverable state. So we compromise and create + dump in the same transaction, but because we're using two different programs to do the work (pg_dump and pachd), the best we can do is starting from the same postgres snapshot (https://www.postgresql.org/docs/current/sql-set-transaction.html). Unfortunately snapshots just represent the state of the database at tx start time, but can't see changes made since then. So pg_dump is dumping from the start of our WithTx call, and that involves creating snapshot, chunkset, and tracker rows; we then dump those manually after pg_dump is done (since we can read our own writes); thus we end up with a simulation of doing the snapshot creation + dump in one transaction, even though that's not quite what happens.

With all that in mind, I think this has been cleaned up significantly. It should be obvious when modifying the tracker what breaks. I have also made the test fail if chunksets don't actually work, so that should be another thing that catches un-snapshottable-with-this-code tracker changes.

dump and restore in a transaction run database migrations after restore test version compatibility; fix bug make snapshots atomic

…ng a public interface to dump them

jrockway requested review from Zhang-Muyang and brendoncarroll October 10, 2024 09:57

brendoncarroll reviewed Oct 14, 2024

View reviewed changes

src/internal/recovery/recovery.go Show resolved Hide resolved

brendoncarroll suggested changes Oct 14, 2024

View reviewed changes

jrockway force-pushed the jonathan/mldm-107-restore-machinery branch from 335f732 to 81c72f7 Compare October 18, 2024 02:35

robert-uhl requested a review from brendoncarroll October 18, 2024 16:45

brendoncarroll approved these changes Oct 18, 2024

View reviewed changes

jrockway added 6 commits October 18, 2024 18:54

add function to restore a snapshot

6b11604

dump and restore in a transaction run database migrations after restore test version compatibility; fix bug make snapshots atomic

move tracker dumper to the tracker package

f50b447

adjust recovery package to avoid knowing tracker details, instead usi…

5f68498

…ng a public interface to dump them

clarify createSnapshotRow

62b1c83

validate the new fileset, not the old one

471bfcd

use ExecContext

fe76302

jrockway force-pushed the jonathan/mldm-107-restore-machinery branch from cc061f9 to fe76302 Compare October 18, 2024 18:58

jrockway merged commit ade7122 into master Oct 18, 2024
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a function to restore a snapshot #10373

Add a function to restore a snapshot #10373

Uh oh!

jrockway commented Oct 7, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Oct 7, 2024 •

edited

Loading

Uh oh!

codecov bot commented Oct 7, 2024 •

edited

Loading

Uh oh!

brendoncarroll Oct 14, 2024

Uh oh!

jrockway Oct 18, 2024

Uh oh!

brendoncarroll Oct 14, 2024

Uh oh!

jrockway Oct 15, 2024

Uh oh!

Uh oh!

brendoncarroll commented Oct 14, 2024 •

edited

Loading

Uh oh!

brendoncarroll left a comment

Uh oh!

jrockway commented Oct 18, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Add a function to restore a snapshot #10373

Add a function to restore a snapshot #10373

Uh oh!

Conversation

jrockway commented Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

size-limit report 📦

Uh oh!

codecov bot commented Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

brendoncarroll Oct 14, 2024

Choose a reason for hiding this comment

Uh oh!

jrockway Oct 18, 2024

Choose a reason for hiding this comment

Uh oh!

brendoncarroll Oct 14, 2024

Choose a reason for hiding this comment

Uh oh!

jrockway Oct 15, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brendoncarroll commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brendoncarroll left a comment

Choose a reason for hiding this comment

Uh oh!

jrockway commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrockway commented Oct 7, 2024 •

edited

Loading

github-actions bot commented Oct 7, 2024 •

edited

Loading

codecov bot commented Oct 7, 2024 •

edited

Loading

brendoncarroll commented Oct 14, 2024 •

edited

Loading

jrockway commented Oct 18, 2024 •

edited

Loading