-
Notifications
You must be signed in to change notification settings - Fork 2.6k
[ART] WAL serialization, automatic checkpointing, decoupling catalog and storage, index names #9339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very exciting work and it is cool to see that pluggable indexes are coming! :-)
I've added some comments, mostly NIPs, besides the index_types becoming strings and adding a lot of code to manage old checkpoints. You can see my comments about them around the code, but we can chat about it in more detail if you wish!
For the pluggable indexes, I think it would be cool to see a mock index being plugged as a test, but it's also a bit out of the scope of this PR.
Otherwise, I'm missing a couple more tests.
- Can we have a test that loads a DuckDB with an index on an older storage? Different types of indexes, multiple indexes in the same db, and so on.
- Can we add tests that check indexes with the same name can't be created? including scenarios with camelcases, invalid names, etc..
Re: Index type strings - we need to have some sort of identifier for index types, and a string is probably the most flexible way to do it. Ideally I think it makes most sense to have "IndexType" being a separate catalog entry, which an index "instance" can reference. Similarly to how we currently can create custom types, and then have tables/functions with columns/arguments referencing these types. Im not sure how these sort of intra-catalog references are made though, is this where you would use an OID? How does it work now when e.g. a table references an index? If that's just by name (i.e string) then I don't see the issue with index types being strings too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Looks great - some comments from my side:
src/include/duckdb/catalog/catalog_entry/index_catalog_entry.hpp
Outdated
Show resolved
Hide resolved
Thanks! |
Merge pull request duckdb/duckdb#9339 from taniabogatsch/art-wal
This is an involved ART PR that mainly covers four topics. All these topics depend on each other, so I decided to bundle them in on PR—general related issues: https://github.com/duckdblabs/duckdb-internal/issues/199, https://github.com/duckdblabs/duckdb-internal/issues/102.
index
field from theIndexCatalogEntry
. This is a step towards pluggable indexes and also towards custom catalogs.WAL Serialization
We reuse most of our existing serialization code for serializing an ART index to the WAL. However, we do not need to write the allocator data onto partial blocks, as we are directly writing the bytes from their respective buffers. Therefore, we pass a flag to the existing index serialization functions. Then, we only return all
data_ptr_t
s of relevant buffers and theirallocation_size
. Related fixed issues: https://github.com/duckdblabs/duckdb-internal/issues/345, https://github.com/duckdblabs/duckdb-internal/issues/374NOTE: I changed the index serialization to work with the WAL and checkpoint managers. Before this PR, we used the metadata manager for serializing non-block index data, which is not accessible from the WAL. Now, we use the
Binary(De)Serializer
directly. This also means adding custom serialization code to the respective JSON files for new storage information structs. To guarantee backward compatibility, I kept some of the old deserialization code.NOTE: THIS BREAKS FORWARD STORAGE COMPATIBILITY. Unifying the index serialization code to use the
Binary(De)Serializer
avoids two separate implementations (WAL and checkpoint manager). If we do not want this, the implementation becomes more cumbersome, and I will implement two different code paths (@Maxxen?).Account Indexes for Automatic Checkpoints
When estimating the size of all changes in a transaction, we now consider an estimated index size. This enables indexes to trigger automatic checkpoints (#6273).
Decoupling the Index Catalog Entry from the Physical Index Storage
I removed the physical index from the index catalog entry. Also, I added names to primary and foreign keys. Physical index objects now also contain their names. These names are the connection between a catalog entry and an index object. We need this connection in various places where we directly used the linked physical object in the catalog entry previously.
Additionally, I removed the
IndexType
enum, which is now simply a string. This allows adding arbitrary new index types through extensions without changing a duckdb enum. Related issue: https://github.com/duckdblabs/duckdb-internal/issues/321Questions
src/include/duckdb/storage/serialization/create_info.json
: I changed a type here and removed a field. Is this allowed?Follow-Up PRs
We now use an
initial_art_size
variable in theDuckIndexEntry
. I plan to remove this variable again in a follow-up PR that dives more into the unique names of indexes. In this follow-up PR, I plan to add a function to traverse the catalog and execute a callback function on all existing indexes (viaDuckTableEntry
). Two usages for this callback function are (1) checking for unique name conflicts with primary and possibly foreign keys during index creation and (2) retrieving a matching index from the undo buffer during WAL size estimation.At the moment, conflicting names between
CREATE INDEX
and primary/foreign keys are not problematic, as we always deserialize primary/unique/foreign indexes before separately created indexes, and because it is not yet possible to set custom names for these constraints.