[ART] WAL serialization, automatic checkpointing, decoupling catalog and storage, index names #9339

taniabogatsch · 2023-10-13T12:48:23Z

This is an involved ART PR that mainly covers four topics. All these topics depend on each other, so I decided to bundle them in on PR—general related issues: https://github.com/duckdblabs/duckdb-internal/issues/199, https://github.com/duckdblabs/duckdb-internal/issues/102.

Serializing small indexes directly to the WAL. Before this PR, we would rebuild the ART when replaying the WAL. This led to hard-to-fix index binder issues and specialized, non-performant index creation code.
Index changes are now accounted for when estimating the automatic checkpoint threshold.
Decoupling index catalog entries and their physical storage by removing the index field from the IndexCatalogEntry. This is a step towards pluggable indexes and also towards custom catalogs.
Primary keys and foreign keys now have names.

WAL Serialization

We reuse most of our existing serialization code for serializing an ART index to the WAL. However, we do not need to write the allocator data onto partial blocks, as we are directly writing the bytes from their respective buffers. Therefore, we pass a flag to the existing index serialization functions. Then, we only return all data_ptr_ts of relevant buffers and their allocation_size. Related fixed issues: https://github.com/duckdblabs/duckdb-internal/issues/345, https://github.com/duckdblabs/duckdb-internal/issues/374

NOTE: I changed the index serialization to work with the WAL and checkpoint managers. Before this PR, we used the metadata manager for serializing non-block index data, which is not accessible from the WAL. Now, we use the Binary(De)Serializer directly. This also means adding custom serialization code to the respective JSON files for new storage information structs. To guarantee backward compatibility, I kept some of the old deserialization code.

NOTE: THIS BREAKS FORWARD STORAGE COMPATIBILITY. Unifying the index serialization code to use the Binary(De)Serializer avoids two separate implementations (WAL and checkpoint manager). If we do not want this, the implementation becomes more cumbersome, and I will implement two different code paths (@Maxxen?).

Account Indexes for Automatic Checkpoints

When estimating the size of all changes in a transaction, we now consider an estimated index size. This enables indexes to trigger automatic checkpoints (#6273).

Decoupling the Index Catalog Entry from the Physical Index Storage

I removed the physical index from the index catalog entry. Also, I added names to primary and foreign keys. Physical index objects now also contain their names. These names are the connection between a catalog entry and an index object. We need this connection in various places where we directly used the linked physical object in the catalog entry previously.

Additionally, I removed the IndexType enum, which is now simply a string. This allows adding arbitrary new index types through extensions without changing a duckdb enum. Related issue: https://github.com/duckdblabs/duckdb-internal/issues/321

Questions

src/include/duckdb/storage/serialization/create_info.json: I changed a type here and removed a field. Is this allowed?

Follow-Up PRs

We now use an initial_art_size variable in the DuckIndexEntry. I plan to remove this variable again in a follow-up PR that dives more into the unique names of indexes. In this follow-up PR, I plan to add a function to traverse the catalog and execute a callback function on all existing indexes (via DuckTableEntry). Two usages for this callback function are (1) checking for unique name conflicts with primary and possibly foreign keys during index creation and (2) retrieving a matching index from the undo buffer during WAL size estimation.

At the moment, conflicting names between CREATE INDEX and primary/foreign keys are not problematic, as we always deserialize primary/unique/foreign indexes before separately created indexes, and because it is not yet possible to set custom names for these constraints.

pdet

Very exciting work and it is cool to see that pluggable indexes are coming! :-)

I've added some comments, mostly NIPs, besides the index_types becoming strings and adding a lot of code to manage old checkpoints. You can see my comments about them around the code, but we can chat about it in more detail if you wish!

For the pluggable indexes, I think it would be cool to see a mock index being plugged as a test, but it's also a bit out of the scope of this PR.

Otherwise, I'm missing a couple more tests.

Can we have a test that loads a DuckDB with an index on an older storage? Different types of indexes, multiple indexes in the same db, and so on.
Can we add tests that check indexes with the same name can't be created? including scenarios with camelcases, invalid names, etc..

src/catalog/catalog_entry/index_catalog_entry.cpp

src/catalog/catalog_entry/duck_table_entry.cpp

src/execution/index/art/art.cpp

src/parser/transform/statement/transform_create_index.cpp

src/planner/expression_binder/index_binder.cpp

src/storage/checkpoint/table_data_writer.cpp

src/storage/checkpoint_manager.cpp

src/storage/wal_replay.cpp

Maxxen · 2023-10-17T10:02:57Z

Re: Index type strings - we need to have some sort of identifier for index types, and a string is probably the most flexible way to do it.

Ideally I think it makes most sense to have "IndexType" being a separate catalog entry, which an index "instance" can reference. Similarly to how we currently can create custom types, and then have tables/functions with columns/arguments referencing these types. Im not sure how these sort of intra-catalog references are made though, is this where you would use an OID? How does it work now when e.g. a table references an index? If that's just by name (i.e string) then I don't see the issue with index types being strings too.

Mytherin

Thanks for the PR! Looks great - some comments from my side:

data/storage/index_0-9-1.db

test/sql/index/art/storage/test_art_names.test

src/include/duckdb/storage/table_storage_info.hpp

src/include/duckdb/catalog/catalog_entry/index_catalog_entry.hpp

src/storage/checkpoint_manager.cpp

src/storage/index.cpp

src/storage/wal_replay.cpp

Mytherin · 2023-11-21T11:43:29Z

Thanks!

Merge pull request duckdb/duckdb#9339 from taniabogatsch/art-wal

taniabogatsch and others added 22 commits October 2, 2023 17:26

WIP index fixes and improvements

22c778f

compiles

70d2b36

fixes

3c60407

tidy fixes

4a18daa

backwards compatibility

9616d4d

backwards compatibility fixes

fe3e095

Merge branch 'feature' into art-wal

600a4c2

minor fixes

656d1f1

Windows include fixes

3aac118

more include clean-up

c371108

directly (de)serializing to/from the WAL

363d77e

auto checkpoint after index creation and insertions

59df0cb

Merge branch 'feature' into art-wal

a38b49a

Merge branch 'feature' into art-wal

21b8123

serializing bytes to the WAL directly

459591f

tidy checks

527f485

more missing exclude and build fixes

96483a7

more fixes

395a271

not changing FixedSizeBuffer during WAL serialization

d9493bf

track initial index size

cfee701

Merge branch 'feature' into art-wal

7611b81

tidying up

92f4d2e

taniabogatsch requested review from Maxxen and pdet October 13, 2023 12:48

taniabogatsch added 3 commits October 13, 2023 14:50

regenerate serialization

b32dcad

missing include

7179d09

Merge branch 'feature' into art-wal

78ca40e

pdet reviewed Oct 16, 2023

View reviewed changes

Merge branch 'feature' into art-wal

eea1015

github-actions bot marked this pull request as draft November 1, 2023 14:39

taniabogatsch added 3 commits November 6, 2023 13:50

missing include

95978bb

Merge branch 'feature' into art-wal

72541e9

Merge branch 'feature' into art-wal

5fe0406

taniabogatsch marked this pull request as ready for review November 8, 2023 10:17

Mytherin added the Ready For Review label Nov 8, 2023

Mytherin reviewed Nov 13, 2023

View reviewed changes

Mytherin added Changes Requested and removed Ready For Review labels Nov 13, 2023

taniabogatsch added 2 commits November 14, 2023 13:43

Merge branch 'feature' into art-wal

4327643

fix index unique names

8188cd9

github-actions bot marked this pull request as draft November 14, 2023 13:46

taniabogatsch added 4 commits November 15, 2023 17:51

review feedback

d79b62f

missing include

25ab6b5

more review feedback

0626b0e

revert to index_name

048dee0

taniabogatsch marked this pull request as ready for review November 17, 2023 08:48

taniabogatsch added Ready For Review and removed Changes Requested labels Nov 17, 2023

Mytherin added Ready To Merge and removed Ready For Review labels Nov 20, 2023

Mytherin changed the base branch from feature to main November 20, 2023 12:45

Merge branch 'main' into art-wal

51c3e90

github-actions bot marked this pull request as draft November 20, 2023 13:29

taniabogatsch marked this pull request as ready for review November 20, 2023 13:34

Mytherin merged commit 1a73610 into duckdb:main Nov 21, 2023

taniabogatsch deleted the art-wal branch November 21, 2023 11:49

taniabogatsch mentioned this pull request Nov 21, 2023

Index sizes not accounted for during auto checkpoints #6273

Closed

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Dec 11, 2023

chore: Update vendored sources to duckdb/duckdb@1a73610

cf864d8

Merge pull request duckdb/duckdb#9339 from taniabogatsch/art-wal

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ART] WAL serialization, automatic checkpointing, decoupling catalog and storage, index names #9339

[ART] WAL serialization, automatic checkpointing, decoupling catalog and storage, index names #9339

Uh oh!

taniabogatsch commented Oct 13, 2023

Uh oh!

pdet left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Maxxen commented Oct 17, 2023

Uh oh!

Mytherin left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Mytherin commented Nov 21, 2023

Uh oh!

Uh oh!

[ART] WAL serialization, automatic checkpointing, decoupling catalog and storage, index names #9339

[ART] WAL serialization, automatic checkpointing, decoupling catalog and storage, index names #9339

Uh oh!

Conversation

taniabogatsch commented Oct 13, 2023

WAL Serialization

Account Indexes for Automatic Checkpoints

Decoupling the Index Catalog Entry from the Physical Index Storage

Questions

Follow-Up PRs

Uh oh!

pdet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Maxxen commented Oct 17, 2023

Uh oh!

Mytherin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Mytherin commented Nov 21, 2023

Uh oh!

Uh oh!