Skip to content

Conversation

Tishj
Copy link
Contributor

@Tishj Tishj commented Jan 9, 2025

This PR adds DICT_FSST, making use of the new NO_VALIDITY_REQUIRED behavior introduced in #15591

The new method is a mix of Dictionary and FSST (which will be deprecated by this method at a later point)

The method currently has 3 modes that it can decide to use for a block:

  • DICTIONARY, almost an exact replica of the existing Dictionary method, no further encoding added.
  • DICT_FSST, Dictionary but the created dictionary is further encoded using FSST.
  • FSST_ONLY, We don't perform dictionary encoding in this mode, so no string lookups are performed. Instead everything is added to the block without deduplication, everything is FSST encoded.

A couple separate improvements are made to the existing dictionary / fsst behaviors:

  • As well as bitpacking the dictionary_indices (formerly known as selection_buffer) we now also bitpack the string_lengths (formerly known as index_buffer, which was storing offsets instead of lengths)
  • To reduce the amount of computation done for each tuple, the BitpackingPrimitives::MinimumBitWidth and BitpackingPrimitives::GetRequiredSize calls are only performed when necessary.

Tishj added 24 commits January 7, 2025 13:32
@carlopi
Copy link
Contributor

carlopi commented Jan 9, 2025

One consideration, connected to file storage / cross compatibility, I think we should still support writing to files keeping them compatible with older versions.

It might be fine to forbid generating Dictionary / FSST and Dict_fsst entirely, when needing to write to files that needs to be compatible with older versions, unsure of what are the costs though (as is, how bigger / slower files will become given cross-compatibility will be the default).

@Mytherin
Copy link
Collaborator

Mytherin commented Jan 9, 2025

Yes agreed, we should still keep in the code for writing the old encodings for forwards compatibility purposes. We should automatically select DICT_FSST when targeting newer DuckDB versions (>= 1.2), and Dictionary or FSST when targeting older versions.

@Tishj
Copy link
Contributor Author

Tishj commented Jan 9, 2025

Alright I'll add the code back in and use the storage_compatibility_version to decide whether dictionary+fsst should be loaded

@Tishj Tishj marked this pull request as ready for review May 6, 2025 12:07
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 6, 2025 12:57
@Tishj Tishj marked this pull request as ready for review May 6, 2025 12:57
Mytherin and others added 2 commits May 6, 2025 21:19
Implement `Filter` and `Select` for `DICT_FSST`, fix stats gathering to skip `NULL` values and fix `LATEST_STORAGE` tests
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 6, 2025 20:28
@Tishj Tishj marked this pull request as ready for review May 6, 2025 20:30
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 7, 2025 08:32
@Tishj Tishj marked this pull request as ready for review May 7, 2025 09:28
@Mytherin Mytherin merged commit 7152079 into duckdb:main May 7, 2025
47 of 49 checks passed
@Mytherin
Copy link
Collaborator

Mytherin commented May 7, 2025

Thanks!

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
[Compression] Introduce `DICT_FSST` compression method (duckdb/duckdb#15637)
Remove bundled TPCH & TPCDS in Python wheels (duckdb/duckdb#15923)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
[Compression] Introduce `DICT_FSST` compression method (duckdb/duckdb#15637)
Remove bundled TPCH & TPCDS in Python wheels (duckdb/duckdb#15923)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 19, 2025
[Compression] Introduce `DICT_FSST` compression method (duckdb/duckdb#15637)
Remove bundled TPCH & TPCDS in Python wheels (duckdb/duckdb#15923)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 19, 2025
[Compression] Introduce `DICT_FSST` compression method (duckdb/duckdb#15637)
Remove bundled TPCH & TPCDS in Python wheels (duckdb/duckdb#15923)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants