Skip to content

Conversation

Mytherin
Copy link
Collaborator

@Mytherin Mytherin commented Dec 5, 2024

This PR adjusts the dictionary vector to have an optional dictionary size. This size parameter is set only for duplicate-eliminated dictionaries that are read from disk (either from dictionary-compressed columns in DuckDB's own storage, or from dictionary-compressed columns in Parquet). The dictionary size can be used to optimize various optimizations on the vectors, as we can operate on the underlying dictionary instead of on the individual values, allowing us to operate only on unique values in the vector.

This PR adjusts the aggregate hash table to utilize both dictionary and constant vectors when inserting groups. In particular, when figuring out which group a value belongs to, we only need to figure this out once per unique value. For constant vectors, that means we only need to probe the hash table once. For dictionary vectors, we only need to probe the unique values within the dictionary.

This behavior is added in the TryAddCompressedGroups method. We first figure out which unique values are present in the dictionary (that are also referenced in the vector), and then call FindOrCreateGroups only for those unique values.

UnaryExecutor

We also use the dictionary vectors to speed up execution of UnaryExecutor - however, we only do this if the function cannot throw errors (FunctionErrors). The reason for this is that we actually compute the function for the entire dictionary at this layer instead of only the elements that are explicitly referenced - and if the function can throw an error we might execute the function on a filtered out row and introduce errors. Currently we explicitly define FunctionErrors:: CANNOT_ERROR only for compressed materialization - so this is still limited (to be expanded in the future).

Benchmarks

Below are some performance numbers for synthetic data (100M rows):

Aggregation over string dictionary
CREATE TABLE strings AS SELECT CONCAT('thisisastringwithrepetitions', i%100) AS grp, i FROM range(100_000_000) tbl(i);
SELECT grp, SUM(i) FROM strings GROUP BY ALL ORDER BY ALL;
v1.1.3 new
0.12s 0.04s
Aggregation over constant dates
CREATE TABLE dates AS SELECT DATE '1900-01-01' + INTERVAL (i // 50000) MONTH grp, i FROM range(100_000_000) tbl(i);
SELECT grp, SUM(i) FROM dates GROUP BY ALL ORDER BY ALL;
v1.1.3 new
0.07s 0.04s

Limitations

  • Currently this only works for grouping of individual dictionary vectors (i.e. grouping by a single column). Grouping by multiple dictionary vectors is possible but more challenging especially when it comes to caching of the data.
  • We limit dictionary size to 20000 - for larger dictionaries we don't consider the optimization.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft December 6, 2024 12:08
@Mytherin Mytherin marked this pull request as ready for review December 6, 2024 12:08
@Mytherin Mytherin merged commit 9da182a into duckdb:main Dec 6, 2024
44 checks passed
@Mytherin Mytherin deleted the aggregatedictionaryvectors branch December 8, 2024 06:50
Mytherin added a commit that referenced this pull request Dec 9, 2024
…and use this in the aggregate HT to cache look-ups (#15196)

Follow-up from #15152

This PR enables the caching of results computed for dictionaries across
vectors by (optionally) assigning them with a unique identifier. The
identifier is a string that needs to be unique so we can distinguish
between a vector having the same dictionary or a different dictionary as
the previous vector. It does not have to be globally unique (i.e. really
only in the same pipeline for the same thread).

* For our own storage, the dictionary id is set to the `ColumnSegment`
pointer address. Since we never de-allocate `ColumnSegments` while
scanning this works
* For Parquet dictionaries, the dictionary id is set to
`[parquet_filename]_[column_name]_[byte_offset]`
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Dec 27, 2024
Add dictionary size, and use dictionary/constant vectors in the aggregate hash table to speed up finding groups  (duckdb/duckdb#15152)
Fix Regression workflow running against itself (duckdb/duckdb#15165)
Fix wrongly used github.ref to sha (duckdb/duckdb#15167)
Add cross-version testing CI (duckdb/duckdb#15161)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Dec 27, 2024
Add dictionary size, and use dictionary/constant vectors in the aggregate hash table to speed up finding groups  (duckdb/duckdb#15152)
Fix Regression workflow running against itself (duckdb/duckdb#15165)
Fix wrongly used github.ref to sha (duckdb/duckdb#15167)
Add cross-version testing CI (duckdb/duckdb#15161)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Dec 27, 2024
Add dictionary size, and use dictionary/constant vectors in the aggregate hash table to speed up finding groups  (duckdb/duckdb#15152)
Fix Regression workflow running against itself (duckdb/duckdb#15165)
Fix wrongly used github.ref to sha (duckdb/duckdb#15167)
Add cross-version testing CI (duckdb/duckdb#15161)
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Dec 28, 2024
Add dictionary size, and use dictionary/constant vectors in the aggregate hash table to speed up finding groups  (duckdb/duckdb#15152)
Fix Regression workflow running against itself (duckdb/duckdb#15165)
Fix wrongly used github.ref to sha (duckdb/duckdb#15167)
Add cross-version testing CI (duckdb/duckdb#15161)
github-actions bot added a commit to duckdb/duckdb-r that referenced this pull request Dec 28, 2024
Add dictionary size, and use dictionary/constant vectors in the aggregate hash table to speed up finding groups  (duckdb/duckdb#15152)
Fix Regression workflow running against itself (duckdb/duckdb#15165)
Fix wrongly used github.ref to sha (duckdb/duckdb#15167)
Add cross-version testing CI (duckdb/duckdb#15161)

Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
@lnkuiper lnkuiper mentioned this pull request Jul 3, 2025
Mytherin added a commit that referenced this pull request Jul 13, 2025
PR #15152 added support for
_dictionary aggregation_: when figuring out which group a value belongs
to, we only figure this out once per unique value, if the vector is a
dictionary vector coming from storage.

This PR implements _dictionary functions_: when evaluating a scalar
function on a dictionary vector, we evaluate it once per unique value.
This has two benefits: 1) less time is spent evaluating the function, as
it is evaluated on less values, and 2) the resulting vector is again a
dictionary vector instead of a flat one, allowing for subsequent
dictionary optimizations.

For dictionaries that a larger than a `STANDARD_VECTOR_SIZE`, we loop
over the dictionary, executing the function on at most
`STANDARD_VECTOR_SIZE` values at a time, similar to the approach in the
list functions. The most efficient approach would be to execute the
function on the entire dictionary at once, but some functions assume
that inputs are never larger than `STANDARD_VECTOR_SIZE`, so this is not
feasible. To avoid copying data between intermediate vectors and the
resulting dictionary, I've added the restriction that both the input and
output must be non-nested.

After implementing this, I noticed that the optimization was almost
never triggered for columns compressed with `DICT_FSST`. This was due to
a restriction that the start of vectors must be aligned with our
bitpacking group size, which only has a 1/32 chance of happening. We
don't have this restriction for regular dictionary compression, so I was
able to remove the restriction using the same approach used there.

Small benchmark (same table as PR
#15152):
```sql
-- generate data
CREATE TABLE strings AS
SELECT CONCAT('thisisastringwithrepetitions', i%100) AS grp, i
FROM range(100_000_000) tbl(i);
-- query
SELECT SUM(b)
FROM (
    SELECT grp LIKE '%42%' b
    FROM strings
);
```

For this query, `LIKE` is converted to the `contains` function, which is
executed on the dictionary.

| main [s] | this PR [s] |
|-|-|
| 0.082 | __0.024__ |

The previous query is sped up by more than 3x. The gains are even higher
for more expensive functions, especially when the output of the function
becomes a group for dictionary aggregation:
```sql
SELECT grp, SUM(i)
FROM (
    SELECT regexp_extract(grp, '[0-9]+') AS grp, i
    FROM strings
)
GROUP BY grp;
```

| main [s] | this PR [s] |
|-|-|
| 2.154s | __0.075s__ |

This query is sped up by almost 30x.

This is currently only implemented for `BoundFunctionExpression`, but it
can be implemented for `BOUND_CASE`, `BOUND_CAST`, etc. in future PRs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant