Skip to content

Conversation

Mytherin
Copy link
Collaborator

The current UNNEST implementation handles one list a time. While that works fine for large lists, we often unnest small lists. One scenario where we unnest small lists is in UNPIVOT, where the list size is equivalent to the amount of columns we are unpivoting. For example, the query:

UNPIVOT (SELECT l_orderkey, l_returnflag, l_shipinstruct FROM lineitem) ON l_returnflag, l_shipinstruct;

Is translated into the below query:

SELECT l_orderkey, UNNEST(['l_returnflag', 'l_shipinstruct']) AS name, UNNEST([l_returnflag, l_shipinstruct]) AS value
FROM lineitem;

This frequently leads to unnesting of small lists (in the above example lists of size 2).

The current UNNEST implementation handles this inefficiently because it will, in this scenario, emit vectors of at most size 2. This PR reworks the UNNEST to handle multiple lists at once. In addition, it also gets rid of the copying that UNNEST would previously do by instead rewriting UNNEST to only work using selection vectors.

Effectively we traverse the input lists and construct a selection vector with the rows we need from (1) the input payload, and (2) the list entries. We then slice the input/list children with these selection vectors.

When unnesting lists of different sizes we must add NULL entries. We do this as a separate pass - we slice the first row of the given list vector, and then call Flatten + assign NULL to the entries for which this is required. This is slower - but still does not require any special code. The flattening could be avoided if the list child has NULL entries itself but that is not guaranteed, so for now this optimization is skipped. I also presume that unnesting multiple lists of different lengths is probably pretty rare (but have no data to back that up).

Performance

Below is the performance of the given UNPIVOT query running on TPC-H SF10:

UNPIVOT (SELECT l_orderkey, l_returnflag, l_shipinstruct FROM lineitem) ON l_returnflag, l_shipinstruct;
v1.2.0 New
8.9s 0.37s

@Mytherin Mytherin marked this pull request as draft February 12, 2025 17:14
@Mytherin Mytherin marked this pull request as ready for review February 12, 2025 17:14
@Mytherin Mytherin merged commit 1ee8a17 into duckdb:main Feb 13, 2025
48 of 49 checks passed
@Mytherin
Copy link
Collaborator Author

Performance numbers of RealNest:


benchmark/realnest/hep/q01.benchmark
Old timing: 0.001858
New timing: 0.001481

benchmark/realnest/hep/q02.benchmark
Old timing: 0.008656
New timing: 0.010143

benchmark/realnest/hep/q03.benchmark
Old timing: 0.010393
New timing: 0.011433

benchmark/realnest/hep/q04.benchmark
Old timing: 0.001963
New timing: 0.00163

benchmark/realnest/hep/q06_1.benchmark
Old timing: 11.362823
New timing: 10.356383

benchmark/realnest/hep/q06_2.benchmark
Old timing: 10.34189
New timing: 10.565738

benchmark/realnest/hep/q07.benchmark
Old timing: 0.061034
New timing: 0.060379

benchmark/realnest/micro/01_aggregate-first-level-struct-members.benchmark
Old timing: 0.011481
New timing: 0.012341

benchmark/realnest/micro/02_list_sort.benchmark
Old timing: 0.077561
New timing: 0.087428

benchmark/realnest/micro/03_create_table_from_unnested_structs.benchmark
Old timing: 5.434604
New timing: 0.794048

benchmark/realnest/micro/04_list_transform_and_list_aggregate.benchmark
Old timing: 0.455226
New timing: 0.492806

benchmark/realnest/micro/05_list_filter.benchmark
Old timing: 0.351521
New timing: 0.379369

benchmark/realnest/micro/06_list_filter_on_unnested_structure.benchmark
Old timing: 2.579773
New timing: 0.081754

benchmark/realnest/micro/07_list_unique_on_transformed_and_aggregated_list.benchmark
Old timing: 0.003845
New timing: 0.002441

benchmark/realnest/micro/08_count_map_keys.benchmark
Old timing: 0.327125
New timing: 0.051345

benchmark/realnest/micro/09_array_agg.benchmark
Old timing: 0.119122
New timing: 0.125681

benchmark/realnest/micro/11_list_sort_reduce_transform.benchmark
Old timing: 0.031223
New timing: 0.032346

benchmark/realnest/micro/12_map_list_values.benchmark
Old timing: 2.115604
New timing: 2.05503

benchmark/realnest/micro/13_multi_join_nested_data_with_filtering.benchmark
Old timing: 0.444702
New timing: 0.444692

benchmark/realnest/micro/14_list_slice.benchmark
Old timing: 2.401286
New timing: 2.386497

benchmark/realnest/micro/15_list_sort.benchmark
Old timing: 10.534116
New timing: 10.405631

benchmark/realnest/micro/16_most_common_list_aggregates.benchmark
Old timing: 2.11207
New timing: 2.09038


Old timing geometric mean: 0.20996531115304076
New timing geometric mean: 0.14991209428739685, roughly 28% faster

Antonov548 added a commit to Antonov548/duckdb-r that referenced this pull request Feb 27, 2025
Improve performance of `UNNEST/UNPIVOT` by using selection vectors to unnest multiple lists at once (duckdb/duckdb#16210)
[fix] Use bigobj when building with MSVC (duckdb/duckdb#16200)
krlmlr pushed a commit to duckdb/duckdb-r that referenced this pull request Mar 5, 2025
Improve performance of `UNNEST/UNPIVOT` by using selection vectors to unnest multiple lists at once (duckdb/duckdb#16210)
[fix] Use bigobj when building with MSVC (duckdb/duckdb#16200)
@Mytherin Mytherin deleted the unnestperformance branch April 2, 2025 09:24
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
Improve performance of `UNNEST/UNPIVOT` by using selection vectors to unnest multiple lists at once (duckdb/duckdb#16210)
[fix] Use bigobj when building with MSVC (duckdb/duckdb#16200)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
Improve performance of `UNNEST/UNPIVOT` by using selection vectors to unnest multiple lists at once (duckdb/duckdb#16210)
[fix] Use bigobj when building with MSVC (duckdb/duckdb#16200)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 17, 2025
Improve performance of `UNNEST/UNPIVOT` by using selection vectors to unnest multiple lists at once (duckdb/duckdb#16210)
[fix] Use bigobj when building with MSVC (duckdb/duckdb#16200)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
Improve performance of `UNNEST/UNPIVOT` by using selection vectors to unnest multiple lists at once (duckdb/duckdb#16210)
[fix] Use bigobj when building with MSVC (duckdb/duckdb#16200)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant