-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Add dictionary size, and use dictionary/constant vectors in the aggregate hash table to speed up finding groups #15152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Mytherin
added a commit
that referenced
this pull request
Dec 9, 2024
…and use this in the aggregate HT to cache look-ups (#15196) Follow-up from #15152 This PR enables the caching of results computed for dictionaries across vectors by (optionally) assigning them with a unique identifier. The identifier is a string that needs to be unique so we can distinguish between a vector having the same dictionary or a different dictionary as the previous vector. It does not have to be globally unique (i.e. really only in the same pipeline for the same thread). * For our own storage, the dictionary id is set to the `ColumnSegment` pointer address. Since we never de-allocate `ColumnSegments` while scanning this works * For Parquet dictionaries, the dictionary id is set to `[parquet_filename]_[column_name]_[byte_offset]`
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
Dec 27, 2024
Add dictionary size, and use dictionary/constant vectors in the aggregate hash table to speed up finding groups (duckdb/duckdb#15152) Fix Regression workflow running against itself (duckdb/duckdb#15165) Fix wrongly used github.ref to sha (duckdb/duckdb#15167) Add cross-version testing CI (duckdb/duckdb#15161)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
Dec 27, 2024
Add dictionary size, and use dictionary/constant vectors in the aggregate hash table to speed up finding groups (duckdb/duckdb#15152) Fix Regression workflow running against itself (duckdb/duckdb#15165) Fix wrongly used github.ref to sha (duckdb/duckdb#15167) Add cross-version testing CI (duckdb/duckdb#15161)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
Dec 27, 2024
Add dictionary size, and use dictionary/constant vectors in the aggregate hash table to speed up finding groups (duckdb/duckdb#15152) Fix Regression workflow running against itself (duckdb/duckdb#15165) Fix wrongly used github.ref to sha (duckdb/duckdb#15167) Add cross-version testing CI (duckdb/duckdb#15161)
github-actions bot
pushed a commit
to duckdb/duckdb-r
that referenced
this pull request
Dec 28, 2024
Add dictionary size, and use dictionary/constant vectors in the aggregate hash table to speed up finding groups (duckdb/duckdb#15152) Fix Regression workflow running against itself (duckdb/duckdb#15165) Fix wrongly used github.ref to sha (duckdb/duckdb#15167) Add cross-version testing CI (duckdb/duckdb#15161)
github-actions bot
added a commit
to duckdb/duckdb-r
that referenced
this pull request
Dec 28, 2024
Add dictionary size, and use dictionary/constant vectors in the aggregate hash table to speed up finding groups (duckdb/duckdb#15152) Fix Regression workflow running against itself (duckdb/duckdb#15165) Fix wrongly used github.ref to sha (duckdb/duckdb#15167) Add cross-version testing CI (duckdb/duckdb#15161) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
Merged
Mytherin
added a commit
that referenced
this pull request
Jul 13, 2025
PR #15152 added support for _dictionary aggregation_: when figuring out which group a value belongs to, we only figure this out once per unique value, if the vector is a dictionary vector coming from storage. This PR implements _dictionary functions_: when evaluating a scalar function on a dictionary vector, we evaluate it once per unique value. This has two benefits: 1) less time is spent evaluating the function, as it is evaluated on less values, and 2) the resulting vector is again a dictionary vector instead of a flat one, allowing for subsequent dictionary optimizations. For dictionaries that a larger than a `STANDARD_VECTOR_SIZE`, we loop over the dictionary, executing the function on at most `STANDARD_VECTOR_SIZE` values at a time, similar to the approach in the list functions. The most efficient approach would be to execute the function on the entire dictionary at once, but some functions assume that inputs are never larger than `STANDARD_VECTOR_SIZE`, so this is not feasible. To avoid copying data between intermediate vectors and the resulting dictionary, I've added the restriction that both the input and output must be non-nested. After implementing this, I noticed that the optimization was almost never triggered for columns compressed with `DICT_FSST`. This was due to a restriction that the start of vectors must be aligned with our bitpacking group size, which only has a 1/32 chance of happening. We don't have this restriction for regular dictionary compression, so I was able to remove the restriction using the same approach used there. Small benchmark (same table as PR #15152): ```sql -- generate data CREATE TABLE strings AS SELECT CONCAT('thisisastringwithrepetitions', i%100) AS grp, i FROM range(100_000_000) tbl(i); -- query SELECT SUM(b) FROM ( SELECT grp LIKE '%42%' b FROM strings ); ``` For this query, `LIKE` is converted to the `contains` function, which is executed on the dictionary. | main [s] | this PR [s] | |-|-| | 0.082 | __0.024__ | The previous query is sped up by more than 3x. The gains are even higher for more expensive functions, especially when the output of the function becomes a group for dictionary aggregation: ```sql SELECT grp, SUM(i) FROM ( SELECT regexp_extract(grp, '[0-9]+') AS grp, i FROM strings ) GROUP BY grp; ``` | main [s] | this PR [s] | |-|-| | 2.154s | __0.075s__ | This query is sped up by almost 30x. This is currently only implemented for `BoundFunctionExpression`, but it can be implemented for `BOUND_CASE`, `BOUND_CAST`, etc. in future PRs.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adjusts the dictionary vector to have an optional dictionary size. This size parameter is set only for duplicate-eliminated dictionaries that are read from disk (either from dictionary-compressed columns in DuckDB's own storage, or from dictionary-compressed columns in Parquet). The dictionary size can be used to optimize various optimizations on the vectors, as we can operate on the underlying dictionary instead of on the individual values, allowing us to operate only on unique values in the vector.
This PR adjusts the aggregate hash table to utilize both dictionary and constant vectors when inserting groups. In particular, when figuring out which group a value belongs to, we only need to figure this out once per unique value. For constant vectors, that means we only need to probe the hash table once. For dictionary vectors, we only need to probe the unique values within the dictionary.
This behavior is added in the
TryAddCompressedGroups
method. We first figure out which unique values are present in the dictionary (that are also referenced in the vector), and then callFindOrCreateGroups
only for those unique values.UnaryExecutor
We also use the dictionary vectors to speed up execution of
UnaryExecutor
- however, we only do this if the function cannot throw errors (FunctionErrors
). The reason for this is that we actually compute the function for the entire dictionary at this layer instead of only the elements that are explicitly referenced - and if the function can throw an error we might execute the function on a filtered out row and introduce errors. Currently we explicitly defineFunctionErrors:: CANNOT_ERROR
only for compressed materialization - so this is still limited (to be expanded in the future).Benchmarks
Below are some performance numbers for synthetic data (100M rows):
Aggregation over string dictionary
Aggregation over constant dates
Limitations
20000
- for larger dictionaries we don't consider the optimization.