Skip to content

Conversation

taniabogatsch
Copy link
Contributor

This is an involved ART PR that mainly covers four topics. All these topics depend on each other, so I decided to bundle them in on PR—general related issues: https://github.com/duckdblabs/duckdb-internal/issues/199, https://github.com/duckdblabs/duckdb-internal/issues/102.

  1. Serializing small indexes directly to the WAL. Before this PR, we would rebuild the ART when replaying the WAL. This led to hard-to-fix index binder issues and specialized, non-performant index creation code.
  2. Index changes are now accounted for when estimating the automatic checkpoint threshold.
  3. Decoupling index catalog entries and their physical storage by removing the index field from the IndexCatalogEntry. This is a step towards pluggable indexes and also towards custom catalogs.
  4. Primary keys and foreign keys now have names.

WAL Serialization

We reuse most of our existing serialization code for serializing an ART index to the WAL. However, we do not need to write the allocator data onto partial blocks, as we are directly writing the bytes from their respective buffers. Therefore, we pass a flag to the existing index serialization functions. Then, we only return all data_ptr_ts of relevant buffers and their allocation_size. Related fixed issues: https://github.com/duckdblabs/duckdb-internal/issues/345, https://github.com/duckdblabs/duckdb-internal/issues/374

NOTE: I changed the index serialization to work with the WAL and checkpoint managers. Before this PR, we used the metadata manager for serializing non-block index data, which is not accessible from the WAL. Now, we use the Binary(De)Serializer directly. This also means adding custom serialization code to the respective JSON files for new storage information structs. To guarantee backward compatibility, I kept some of the old deserialization code.

NOTE: THIS BREAKS FORWARD STORAGE COMPATIBILITY. Unifying the index serialization code to use the Binary(De)Serializer avoids two separate implementations (WAL and checkpoint manager). If we do not want this, the implementation becomes more cumbersome, and I will implement two different code paths (@Maxxen?).

Account Indexes for Automatic Checkpoints

When estimating the size of all changes in a transaction, we now consider an estimated index size. This enables indexes to trigger automatic checkpoints (#6273).

Decoupling the Index Catalog Entry from the Physical Index Storage

I removed the physical index from the index catalog entry. Also, I added names to primary and foreign keys. Physical index objects now also contain their names. These names are the connection between a catalog entry and an index object. We need this connection in various places where we directly used the linked physical object in the catalog entry previously.

Additionally, I removed the IndexType enum, which is now simply a string. This allows adding arbitrary new index types through extensions without changing a duckdb enum. Related issue: https://github.com/duckdblabs/duckdb-internal/issues/321

Questions

src/include/duckdb/storage/serialization/create_info.json: I changed a type here and removed a field. Is this allowed?

Follow-Up PRs

We now use an initial_art_size variable in the DuckIndexEntry. I plan to remove this variable again in a follow-up PR that dives more into the unique names of indexes. In this follow-up PR, I plan to add a function to traverse the catalog and execute a callback function on all existing indexes (via DuckTableEntry). Two usages for this callback function are (1) checking for unique name conflicts with primary and possibly foreign keys during index creation and (2) retrieving a matching index from the undo buffer during WAL size estimation.

At the moment, conflicting names between CREATE INDEX and primary/foreign keys are not problematic, as we always deserialize primary/unique/foreign indexes before separately created indexes, and because it is not yet possible to set custom names for these constraints.

@taniabogatsch taniabogatsch requested review from Maxxen and pdet October 13, 2023 12:48
Copy link
Contributor

@pdet pdet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very exciting work and it is cool to see that pluggable indexes are coming! :-)

I've added some comments, mostly NIPs, besides the index_types becoming strings and adding a lot of code to manage old checkpoints. You can see my comments about them around the code, but we can chat about it in more detail if you wish!

For the pluggable indexes, I think it would be cool to see a mock index being plugged as a test, but it's also a bit out of the scope of this PR.

Otherwise, I'm missing a couple more tests.

  1. Can we have a test that loads a DuckDB with an index on an older storage? Different types of indexes, multiple indexes in the same db, and so on.
  2. Can we add tests that check indexes with the same name can't be created? including scenarios with camelcases, invalid names, etc..

@Maxxen
Copy link
Member

Maxxen commented Oct 17, 2023

Re: Index type strings - we need to have some sort of identifier for index types, and a string is probably the most flexible way to do it.

Ideally I think it makes most sense to have "IndexType" being a separate catalog entry, which an index "instance" can reference. Similarly to how we currently can create custom types, and then have tables/functions with columns/arguments referencing these types. Im not sure how these sort of intra-catalog references are made though, is this where you would use an OID? How does it work now when e.g. a table references an index? If that's just by name (i.e string) then I don't see the issue with index types being strings too.

@github-actions github-actions bot marked this pull request as draft November 1, 2023 14:39
@taniabogatsch taniabogatsch marked this pull request as ready for review November 8, 2023 10:17
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks great - some comments from my side:

@github-actions github-actions bot marked this pull request as draft November 14, 2023 13:46
@taniabogatsch taniabogatsch marked this pull request as ready for review November 17, 2023 08:48
@Mytherin Mytherin changed the base branch from feature to main November 20, 2023 12:45
@github-actions github-actions bot marked this pull request as draft November 20, 2023 13:29
@taniabogatsch taniabogatsch marked this pull request as ready for review November 20, 2023 13:34
@Mytherin Mytherin merged commit 1a73610 into duckdb:main Nov 21, 2023
@Mytherin
Copy link
Collaborator

Thanks!

@taniabogatsch taniabogatsch deleted the art-wal branch November 21, 2023 11:49
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Dec 11, 2023
Merge pull request duckdb/duckdb#9339 from taniabogatsch/art-wal
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants