Add dictionary size, and use dictionary/constant vectors in the aggregate hash table to speed up finding groups #15152

Mytherin · 2024-12-05T15:56:29Z

This PR adjusts the dictionary vector to have an optional dictionary size. This size parameter is set only for duplicate-eliminated dictionaries that are read from disk (either from dictionary-compressed columns in DuckDB's own storage, or from dictionary-compressed columns in Parquet). The dictionary size can be used to optimize various optimizations on the vectors, as we can operate on the underlying dictionary instead of on the individual values, allowing us to operate only on unique values in the vector.

This PR adjusts the aggregate hash table to utilize both dictionary and constant vectors when inserting groups. In particular, when figuring out which group a value belongs to, we only need to figure this out once per unique value. For constant vectors, that means we only need to probe the hash table once. For dictionary vectors, we only need to probe the unique values within the dictionary.

This behavior is added in the TryAddCompressedGroups method. We first figure out which unique values are present in the dictionary (that are also referenced in the vector), and then call FindOrCreateGroups only for those unique values.

UnaryExecutor

We also use the dictionary vectors to speed up execution of UnaryExecutor - however, we only do this if the function cannot throw errors (FunctionErrors). The reason for this is that we actually compute the function for the entire dictionary at this layer instead of only the elements that are explicitly referenced - and if the function can throw an error we might execute the function on a filtered out row and introduce errors. Currently we explicitly define FunctionErrors:: CANNOT_ERROR only for compressed materialization - so this is still limited (to be expanded in the future).

Benchmarks

Below are some performance numbers for synthetic data (100M rows):

Aggregation over string dictionary

CREATE TABLE strings AS SELECT CONCAT('thisisastringwithrepetitions', i%100) AS grp, i FROM range(100_000_000) tbl(i);
SELECT grp, SUM(i) FROM strings GROUP BY ALL ORDER BY ALL;

v1.1.3	new
0.12s	0.04s

Aggregation over constant dates

CREATE TABLE dates AS SELECT DATE '1900-01-01' + INTERVAL (i // 50000) MONTH grp, i FROM range(100_000_000) tbl(i);
SELECT grp, SUM(i) FROM dates GROUP BY ALL ORDER BY ALL;

v1.1.3	new
0.07s	0.04s

Limitations

Currently this only works for grouping of individual dictionary vectors (i.e. grouping by a single column). Grouping by multiple dictionary vectors is possible but more challenging especially when it comes to caching of the data.
We limit dictionary size to 20000 - for larger dictionaries we don't consider the optimization.

…able

…ow errors

…and use this in the aggregate HT to cache look-ups (#15196) Follow-up from #15152 This PR enables the caching of results computed for dictionaries across vectors by (optionally) assigning them with a unique identifier. The identifier is a string that needs to be unique so we can distinguish between a vector having the same dictionary or a different dictionary as the previous vector. It does not have to be globally unique (i.e. really only in the same pipeline for the same thread). * For our own storage, the dictionary id is set to the `ColumnSegment` pointer address. Since we never de-allocate `ColumnSegments` while scanning this works * For Parquet dictionaries, the dictionary id is set to `[parquet_filename]_[column_name]_[byte_offset]`

Add dictionary size, and use dictionary/constant vectors in the aggregate hash table to speed up finding groups (duckdb/duckdb#15152) Fix Regression workflow running against itself (duckdb/duckdb#15165) Fix wrongly used github.ref to sha (duckdb/duckdb#15167) Add cross-version testing CI (duckdb/duckdb#15161)

Add dictionary size, and use dictionary/constant vectors in the aggregate hash table to speed up finding groups (duckdb/duckdb#15152) Fix Regression workflow running against itself (duckdb/duckdb#15165) Fix wrongly used github.ref to sha (duckdb/duckdb#15167) Add cross-version testing CI (duckdb/duckdb#15161) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>

PR #15152 added support for _dictionary aggregation_: when figuring out which group a value belongs to, we only figure this out once per unique value, if the vector is a dictionary vector coming from storage. This PR implements _dictionary functions_: when evaluating a scalar function on a dictionary vector, we evaluate it once per unique value. This has two benefits: 1) less time is spent evaluating the function, as it is evaluated on less values, and 2) the resulting vector is again a dictionary vector instead of a flat one, allowing for subsequent dictionary optimizations. For dictionaries that a larger than a `STANDARD_VECTOR_SIZE`, we loop over the dictionary, executing the function on at most `STANDARD_VECTOR_SIZE` values at a time, similar to the approach in the list functions. The most efficient approach would be to execute the function on the entire dictionary at once, but some functions assume that inputs are never larger than `STANDARD_VECTOR_SIZE`, so this is not feasible. To avoid copying data between intermediate vectors and the resulting dictionary, I've added the restriction that both the input and output must be non-nested. After implementing this, I noticed that the optimization was almost never triggered for columns compressed with `DICT_FSST`. This was due to a restriction that the start of vectors must be aligned with our bitpacking group size, which only has a 1/32 chance of happening. We don't have this restriction for regular dictionary compression, so I was able to remove the restriction using the same approach used there. Small benchmark (same table as PR #15152): ```sql -- generate data CREATE TABLE strings AS SELECT CONCAT('thisisastringwithrepetitions', i%100) AS grp, i FROM range(100_000_000) tbl(i); -- query SELECT SUM(b) FROM ( SELECT grp LIKE '%42%' b FROM strings ); ``` For this query, `LIKE` is converted to the `contains` function, which is executed on the dictionary. | main [s] | this PR [s] | |-|-| | 0.082 | __0.024__ | The previous query is sped up by more than 3x. The gains are even higher for more expensive functions, especially when the output of the function becomes a group for dictionary aggregation: ```sql SELECT grp, SUM(i) FROM ( SELECT regexp_extract(grp, '[0-9]+') AS grp, i FROM strings ) GROUP BY grp; ``` | main [s] | this PR [s] | |-|-| | 2.154s | __0.075s__ | This query is sped up by almost 30x. This is currently only implemented for `BoundFunctionExpression`, but it can be implemented for `BOUND_CASE`, `BOUND_CAST`, etc. in future PRs.

Mytherin added 10 commits December 3, 2024 15:38

WIP - use dictionary vectors in UnaryExecutor and in aggregate hash t…

0d1a7f2

…able

Emit dictionary size in Parquet dictionary read

c6e346b

Dictionary pushdown fully working

256b44b

Preserve dictionary vector acrosss slices

4a305f0

UnaryExecutor - only execute on dictionary when function does not thr…

54a1ccd

…ow errors

Add missing file

c652011

Add support for constant vectors in the aggregate hash table

42abc8d

Add dictionary aggregate benchmark

466ec86

Generate files

a462574

No longer cache dictionary results in aggregate HT dictionary probing

e1f0eb6

duckdb-draftbot marked this pull request as draft December 6, 2024 12:08

Mytherin marked this pull request as ready for review December 6, 2024 12:08

Mytherin merged commit 9da182a into duckdb:main Dec 6, 2024
44 checks passed

Mytherin deleted the aggregatedictionaryvectors branch December 8, 2024 06:50

Mytherin mentioned this pull request Dec 8, 2024

Add DictionaryId that can be used to uniquely identify dictionaries, and use this in the aggregate HT to cache look-ups #15196

Merged

github-actions bot mentioned this pull request Dec 28, 2024

vendor: Update vendored sources to duckdb/duckdb@9da182a939d8a812484bf7ed0fdc3b3f6cb7a5c7 duckdb/duckdb-r#809

Merged

lnkuiper mentioned this pull request Jul 3, 2025

Dictionary functions #18127

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add dictionary size, and use dictionary/constant vectors in the aggregate hash table to speed up finding groups #15152

Add dictionary size, and use dictionary/constant vectors in the aggregate hash table to speed up finding groups #15152

Uh oh!

Mytherin commented Dec 5, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Add dictionary size, and use dictionary/constant vectors in the aggregate hash table to speed up finding groups #15152

Add dictionary size, and use dictionary/constant vectors in the aggregate hash table to speed up finding groups #15152

Uh oh!

Conversation

Mytherin commented Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

UnaryExecutor

Benchmarks

Aggregation over string dictionary

Aggregation over constant dates

Limitations

Uh oh!

Uh oh!

Uh oh!

Mytherin commented Dec 5, 2024 •

edited

Loading