-
Notifications
You must be signed in to change notification settings - Fork 2.6k
[Compression] Introduce DICT_FSST
compression method
#15637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…n optional encoded string
…t I accidentally made a corrupted db..
One consideration, connected to file storage / cross compatibility, I think we should still support writing to files keeping them compatible with older versions. It might be fine to forbid generating |
Yes agreed, we should still keep in the code for writing the old encodings for forwards compatibility purposes. We should automatically select |
Alright I'll add the code back in and use the storage_compatibility_version to decide whether dictionary+fsst should be loaded |
…storage, where fsst/dictionary are disabled
… for this purpose with the new require flag instead
Implement `Filter` and `Select` for `DICT_FSST`, fix stats gathering to skip `NULL` values and fix `LATEST_STORAGE` tests
…ishj/duckdb into dict_fsst_compression_combined
Thanks! |
[Compression] Introduce `DICT_FSST` compression method (duckdb/duckdb#15637) Remove bundled TPCH & TPCDS in Python wheels (duckdb/duckdb#15923)
[Compression] Introduce `DICT_FSST` compression method (duckdb/duckdb#15637) Remove bundled TPCH & TPCDS in Python wheels (duckdb/duckdb#15923)
[Compression] Introduce `DICT_FSST` compression method (duckdb/duckdb#15637) Remove bundled TPCH & TPCDS in Python wheels (duckdb/duckdb#15923)
[Compression] Introduce `DICT_FSST` compression method (duckdb/duckdb#15637) Remove bundled TPCH & TPCDS in Python wheels (duckdb/duckdb#15923)
This PR adds
DICT_FSST
, making use of the newNO_VALIDITY_REQUIRED
behavior introduced in #15591The new method is a mix of Dictionary and FSST (which will be deprecated by this method at a later point)
The method currently has 3 modes that it can decide to use for a block:
DICTIONARY
, almost an exact replica of the existing Dictionary method, no further encoding added.DICT_FSST
, Dictionary but the created dictionary is further encoded using FSST.FSST_ONLY
, We don't perform dictionary encoding in this mode, so no string lookups are performed. Instead everything is added to the block without deduplication, everything is FSST encoded.A couple separate improvements are made to the existing dictionary / fsst behaviors:
dictionary_indices
(formerly known asselection_buffer
) we now also bitpack thestring_lengths
(formerly known asindex_buffer
, which was storing offsets instead of lengths)BitpackingPrimitives::MinimumBitWidth
andBitpackingPrimitives::GetRequiredSize
calls are only performed when necessary.