Skip to content

Conversation

hawkfish
Copy link
Contributor

Rewrite of moving quantile and mad to use better/faster data structures.

  • Add new window_init aggregate API to allow custom window functions to create global data structures
  • Compute and pass down statistics for frame computations to support global vs local performance optimisations
  • Add third party skip list code for thread-local implementations
  • Add (rewritten) third party merge sort tree code for global implementations
  • Extend MST SelectNth method to handle new EXCLUDE functionality
  • Fix problem with ValidityMask::Resize where it was always allocating memory instead of deferring if all rows were valid

fixes: duckdblabs/duckdb-internal#330

Richard Wesley and others added 30 commits September 21, 2023 18:23
Add Python implementation of skip lists to libraries.
Convert basic scalar quantile window to use skip lists.
Performance on 1e9 rows, single partition, 16 cores
went from 11:11 to 00:36.
Convert scalar and list windowed quantile to use skip lists.
Performance on 1e9 rows, single partition, 16 cores
went from 11:11 to 00:36.
Create MST template for use in quantile and other aggregates.
* Define QuantileSortTree for quantile acceleration.
* Add aggregate_wininit_t for constructing global windowing state.
* Update aggregate_window_t to take optional global windowing state.
* Add quantile WindowInit function (not used yet).
Call window initialisation if it is defined.
Switch quantile computations to use merge sort trees.
MAD not converted yet.
Refactor, simplify and deploy to Median Absolute Deviation.
Plumb through statistics and extract min/max for ROWS.
Use frame expression statistics to choose between
Skip Lists and Merge Sort Trees
Test statistics and handle no propagation situation.
Fix fanout misplacement.
Finish implementing and testing merge sort trees
by supporting EXCLUDE/multiple frames.
Fix CI issues (formatting, includes)
Remove non-portable #pragma mark lines.
Standardise SubFrame typing.
Push subframes into the Merge Sort Tree.
@hawkfish
Copy link
Contributor Author

I didn't have an in-depth look at src/core_functions/aggregate/holistic/quantile.cpp and src/include/duckdb/execution/merge_sort_tree.hpp yet. I can do that at a later point this week. I assume that I do not have to look at the skip list code?

The merge sort tree code is Adrian's but I modified it so we could compile it and we collaborated on adding the EXCLUDE changes. I also removed all his messy debugging code and added some whitespace.

The skip list code is a direct import of the code used by Python (it is a separate project). I just added a very simple one element memory pool (pooling is on their TODO list.) and you can look at the diff there (just routing new and delete through an internal class.)

Integrate Tania' feedback (thanks!)
@github-actions github-actions bot marked this pull request as draft October 30, 2023 20:34
Copy link
Contributor

@carlopi carlopi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool results, I only took a look at the skiplist library, minor comments

@hawkfish hawkfish marked this pull request as ready for review November 1, 2023 19:09
@hawkfish hawkfish requested a review from hannes November 1, 2023 20:36
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Some minor comments. I also launched a NightlyTests to check the code coverage on this feature. Perhaps some more tests could be added after that to ensure the new code is fully tested? I'm not sure how many tests we have that correctly test all of the edge cases here.

@hawkfish
Copy link
Contributor Author

hawkfish commented Nov 3, 2023

Thanks for the PR! Some minor comments. I also launched a NightlyTests to check the code coverage on this feature. Perhaps some more tests could be added after that to ensure the new code is fully tested? I'm not sure how many tests we have that correctly test all of the edge cases here.

Yeah there is coverage in the benchmark suite but I could certainly move those queries into a test_coverage file now that they are faster...

Richard Wesley added 2 commits November 3, 2023 12:00
Clean them up in response to PR feedback.
Note that ListWindow is being naughty and modifying the
partition data, which is not thread-safe.
Restore the statistics API to use signed values.
Add coverage tests for quantile variants.
@github-actions github-actions bot marked this pull request as draft November 3, 2023 21:34
@hawkfish hawkfish marked this pull request as ready for review November 3, 2023 22:38
@github-actions github-actions bot marked this pull request as draft November 7, 2023 15:55
@hawkfish hawkfish marked this pull request as ready for review November 7, 2023 15:56
@Mytherin Mytherin changed the base branch from feature to main November 8, 2023 15:19
@Mytherin Mytherin changed the base branch from main to feature November 8, 2023 15:19
@Mytherin Mytherin merged commit aad21f3 into duckdb:feature Nov 8, 2023
@Mytherin
Copy link
Collaborator

Mytherin commented Nov 8, 2023

Thanks! LGTM

@hawkfish hawkfish deleted the merge-sort-trees branch November 8, 2023 16:55
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Dec 11, 2023
Merge pull request duckdb/duckdb#9392 from lnkuiper/parquet_encryption
Merge pull request duckdb/duckdb#9461 from hawkfish/merge-sort-trees
Merge pull request duckdb/duckdb#8788 from alnkesq/capi_create_enum_type
Merge pull request duckdb/duckdb#9513 from Tmonster/5614-database-invalidated
Merge pull request duckdb/duckdb#9622 from Mytherin/typescoping
Merge pull request duckdb/duckdb#9615 from hawkfish/strptime-infinity
Merge pull request duckdb/duckdb#9627 from Mytherin/attachifnotexists
Merge pull request duckdb/duckdb#9648 from samansmink/add-keep-alive-toggle
Merge pull request duckdb/duckdb#9638 from taniabogatsch/bench-refactor
Merge pull request duckdb/duckdb#9651 from Mytherin/getenv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants