Improvement the speed of table sample systems #12631

continue-revolution · 2024-06-21T00:53:33Z

This PR is attempting to implement a separate sampler that skip chunks during full scan for table sample queries.

We observe that the original table sample systems is sometimes very slow because duckdb will always full scan the whole database during the first sequential scan, so we implement a new method “chunk” to skip chunks.

Our experiments show that by implementing this sampler, we can have up to 100x (tpch-1) speed-up for table sample aggregation queries.

…tion

* Run formatter also on src/include/duckdb/core_functions/... * fix numpy issues with the 'string' dtype changes * Use numeric_limits * Format fix * Fix duckdb#12467 changes to stddev calculation * Format fix * Update min/max to cache allocations and prevent unnecessary re-allocation * missed some constants in FixedSizeBuffer * first step ci run for android * baby steps * typo * explicit platform * extension static build * more env and ninja * add arm64 * wrong flag * extensions, fingers crossed * container * using default containers * removing it in more places * patch vss * port changes of 'python_datetime_and_deltatime_missing_stride' from 'feature' to 'main' * Switch arg_min/arg_max to use sort key instead of vectors * Clean up unused functions * AddStringOrBlob * Skip only built-in optimizers * Add support for arg_min(ANY, ANY) * revert extension patch with optional_idx * Correct count * Format fix * Format fix * Switch fallback implementation of FIRST to use sort keys instead of vectors * WIP - clean up histogram function, avoid using ListVector::Push * Move many tests to slow * Add support for all types to the histogram function * dont WaitOnTask if there are no tasks available * Rework list_distinct/list_unique to support arbitrary types and to no longer use values and ListVector::Push * Format fix * fix compilation * Avoid overriding types in PrepareTypeForCast when not required * Use string_t and the arena allocator to allocate strings in the histogram function, instead of std::string * this is soooo much better * forgot to add new file * apply feedback * prevent the undefined behaviour of std::tolower() by casting the input to uint_8 * Binned histograms WIP * format * fix up test * More tests * Format fix + use string_map_t here * Detect duplicate bounds, sort bounds, and allow empty bounds * Binned histograms working for all types * Add binned histogram test to test all types * Unify/clean up histogram and binned histogram * RequiresExtract * Add TPC-H tests * Improve error message * Format * Add equi-width binning method for integers * More clean-up and testing, add support for doubles * lets start with this * Update .github/workflows/Android.yml Co-authored-by: Carlo Piovesan <piovesan.carlo@gmail.com> * add missing headers * Make equi-width-bins always return the input type * treat NOTHING different from REPLACE/UPDATE, the filtered tuples should not be added to the returning chunk * format * nits, use ExtensionHelper to check for spatial extension * np.NaN -> np.nan * remove pyarrow version lock * cxxheaderparser * commit generated code * add reqs * Update CodeQuality.yml * name capi enums * rewrite verify_enum_integrity.py to use cxxheaderparser * fix dependency installation * Add equi-width binning support for dates/timestamps * update messages * remove dead code * format * Make typeof a constant function (removing its argument) so that the optimizer can optimize branches away * Format * Binning test * Revert "Make typeof a constant function (removing its argument) so that the optimizer can optimize branches away" This reverts commit 4455c46. * Fix test * Remove duplicate test * Re-generate functions * Use / * Add bind_expression function, and make typeof return a constant expression when possible * Set function pointer * Add function can_cast_implicitly that, given a source and target type, states whether or not the source can be implicitly cast to the target * This is optimized away * This is moved * change np.NaN -> np.nan * Fixes for equi_width_bins with timestamps - cap bins at bin_count, propagate fractional months/days downwards (i.e. 0.2 months becomes 6 days) and handle bin_count > time_diff in micros * Add histogram and histogram_values table macros + infrastructure for creating default table macros * Feature duckdb#1272: Window Executor State PR feedback. * More testing for histogram function * Allow min as boundary (when there are few values this might occur) * Format * Add missing include * Fix tests * Move mode function to owning string map * Format fix * Use correct map type in list aggregate for histogram * Make mode work for all types by using sort keys * Remove N from this test as there is a duplicate * "benchmark/micro/*" >> "benchmark/micro/*.benchmark" * add copy_to_select hook, rework unsupported types in copy_to and export * optimize * remove unused variable * Modify test to be deterministic * pass along options to select * nit * allow ducktyping of lists/dicts * add enum for copy to info * move type visit functions into TypeVisitor helper * When dropping a table - clear the local storage of any appended data to that table * Add include * Set flags again in test * Also call OnDropEntry for CREATE OR REPLACE * Add HTTP error code to extension install failures * Issue duckdb#12600: Streaming Positive LAG Use buffering to support streaming computation of constant positive LAGs and negative LEADs that are at most one vector away. This doesn't fix the "look ahead" problem, but the benchmark shows it is about 5x faster than the non-streaming version. * Issue duckdb#12600: Streaming Positive LAG Add new benchmark. * Feature duckdb#1272: Window Group Preparation Move the construction of the row data collections and masks to the Finalize phase. These are relatively fast and will use data that is still hot (e.g., the sort keys). This will make it easier parallelise the remaining two passes over the data (build and execute). * Feature duckdb#1272: Window Group Preparation Code cleanup and cast fixes. * Turn window_start and window_end into idx_t * Initialize payload collection DataChunk with payload_count to prevent resizing * Format * VectorOperations::Copy - fast path when copying an aligned flat validity mask into a flat vector * Set is_dropped flag instead of actually dropping data, as other scans might be depending on the local storage (in particular when running CREATE OR REPLACE tbl AS SELECT * FROM tbl) * Add missing include * Create sort key helper class and use it in min/max * Simplify histogram combine * StateCombine for binned histogram * Rework mode to also use sort key helpers * Remove specialized decimal implementation of minmax * Disable android CI on pushes/PRs * Quantile clean-up part 1: move bind data and sort tree to separate files * Move quantile window state into separate struct * Move MAD and quantile states/operations to separate files * Rework median - allow median(ANY) instead of having different function overloads * Avoid usage of std::string in quantiles, and switch to arena allocated strings as well * Add fallback option to discrete quantile * Issue duckdb#12600: Streaming Positive LAG Incorporate various PR feedback improvements: * Cached Executors * Validity Mask reset * Small Buffers Along with the mask copy improvement in VectorOperations::Copy these reduce the benchmark runtime by another 3x. * quantile_disc scalar and lists working for arbitrary types * Remove pre-allocation for now * Feature 1272: Window Payload Preallocation Only preallocate payloads for the value window functions (LEAD, LAG, FIRST, LAST, VALUE) instad of all of them. * Quantile binding clean-up, add test_all_types for quantile_disc and median * Test + CI fixes * Format * Clean up old deserialization code * Set destructor, remove some more dead code --------- Co-authored-by: Carlo Piovesan <piovesan.carlo@gmail.com> Co-authored-by: Tishj <t_b@live.nl> Co-authored-by: Mark Raasveldt <mark.raasveldt@gmail.com> Co-authored-by: taniabogatsch <44262898+taniabogatsch@users.noreply.github.com> Co-authored-by: Hannes Mühleisen <hannes@duckdblabs.com> Co-authored-by: Christina Sioula <chrisiou.myk@gmail.com> Co-authored-by: Hannes Mühleisen <hannes@muehleisen.org> Co-authored-by: Richard Wesley <13156216+hawkfish@users.noreply.github.com> Co-authored-by: Max Gabrielsson <max@gabrielsson.com> Co-authored-by: Elliana May <me@mause.me> Co-authored-by: Richard Wesley <hawkfish@electricfish.com> Co-authored-by: Maia <maia@duckdblabs.com>

Mytherin · 2024-06-24T07:43:27Z

Thanks for the PR!

Instead of having a different sampling method for this, I think a cleaner design would be to implement pushdown of the system sampling operator into the scans instead.

This PR also doesn't have any tests yet - what happens if this is used on e.g. a Parquet scan?

yuxuan18 · 2024-06-24T23:18:19Z

Hi Mark, we are happy to make changes. Could you be more specific about what does the following design idea mean?

implement pushdown of the system sampling operator into the scans instead

Is it implementing a pushdown method for the sampling operator similar to this filter push down?

into pushdown

sync upstream

into pushdown

sync upstream

yuxuan18 · 2024-09-17T17:26:15Z

@Mytherin Could you take a look at our current implementation? We revised the logic as you suggested and created a test_slow based on TPCH.

We have a failing check on the wrapping test, which I think is not relevant to our changes.

Tmonster

Thanks for the PR!
Just one small comment from my side

Tmonster · 2024-09-20T08:56:39Z

src/execution/operator/scan/physical_table_scan.cpp

@@ -184,6 +185,9 @@ InsertionOrderPreservingMap<string> PhysicalTableScan::ParamsToString() const {
 		}
 		result["Filters"] = filters_info;
 	}
+	if (extra_info.sample_options) {


nit: Can you add this to the LogicalGet::ParamsToString() as well?

yuxuan18 · 2024-09-24T14:41:33Z

@Mytherin @Tmonster We fixed LogicalGet::ParamsToString(). Could you check if the current implementation can be merged?

yuxuan18 · 2024-09-30T14:52:14Z

@Mytherin @TomBurdge Could you please take a look at our implementation?

yuxuan18 · 2024-10-07T18:53:53Z

@Mytherin We are happy to make further adjustments as needed. Could you please share your thoughts on the current implementation looks?

Mytherin · 2024-10-15T09:26:25Z

Thanks for the fixes! This looks good from my side (aside from the minor merge conflict). @Tmonster perhaps you can have a final look?

Tmonster

Looks good from my side

yuxuan18 · 2024-10-16T15:12:54Z

@Mytherin @Tmonster Thanks for reviewing this! I just fixed the merge conflicts.

Mytherin · 2024-10-16T15:14:03Z

Thanks!

conrevo and others added 7 commits April 24, 2024 21:13

finish sampler, next step implement operator for single table aggrega…

fdee50d

…tion

try to make chunk sample work for single table, but failed to compile

6a3e38f

2nd try

b093e65

works for simple query with 1 table. next step support join queries

b0acdb7

fix format (#1)

05b126f

Merge branch 'main' into sampler

ba0b933

duckdb-draftbot marked this pull request as draft June 22, 2024 14:09

continue-revolution marked this pull request as ready for review June 22, 2024 14:19

Revert prev commit

3d92c7c

duckdb-draftbot marked this pull request as draft June 22, 2024 14:32

continue-revolution marked this pull request as ready for review June 22, 2024 14:33

Mytherin added the Changes Requested label Jun 24, 2024

yuxuan and others added 12 commits June 25, 2024 23:15

impl as a pushdown optimizer

ec3d33e

Merge branch 'duckdb:main' into pushdown

232daba

clean up

0ca9060

Merge branch 'pushdown' of https://github.com/continue-revolution/duckdb

0ad2213

into pushdown

fix format

ae2b3c2

fix

2a048d2

fix

054ad61

fix format

8f796d6

fix format

49d2633

Merge pull request #5 from duckdb/main

0d7fef6

sync upstream

Merge branch 'pushdown' of https://github.com/continue-revolution/duckdb

d933ceb

into pushdown

Merge pull request #6 from duckdb/main

b94d9b6

sync upstream

duckdb-draftbot marked this pull request as draft June 26, 2024 15:07

yuxuan18 and others added 2 commits June 26, 2024 11:08

Merge branch 'sampler' into pushdown

e2e0079

clean chunk sample

3b718c1

yuxuan added 2 commits September 14, 2024 21:04

fix initialization problem

749ebdf

better initialization

861454a

duckdb-draftbot marked this pull request as draft September 15, 2024 01:51

continue-revolution marked this pull request as ready for review September 15, 2024 02:43

duckdb-draftbot marked this pull request as draft September 15, 2024 03:50

yuxuan18 force-pushed the sampler branch from 6c47dd7 to 861454a Compare September 15, 2024 03:53

fix format

519b299

continue-revolution marked this pull request as ready for review September 15, 2024 03:56

duckdb-draftbot marked this pull request as draft September 15, 2024 15:04

yuxuan18 force-pushed the sampler branch from 94b123b to 519b299 Compare September 15, 2024 17:19

Tmonster reviewed Sep 20, 2024

View reviewed changes

yuxuan and others added 2 commits September 20, 2024 10:57

Add sampling info to LogicalGet::ParamsToString

48271db

Merge remote-tracking branch 'upstream/feature' into sampler

86540b7

continue-revolution marked this pull request as ready for review September 21, 2024 03:06

Merge remote-tracking branch 'upstream/feature' into sampler

b9d4efe

duckdb-draftbot marked this pull request as draft September 24, 2024 02:27

continue-revolution marked this pull request as ready for review September 24, 2024 06:45

Tmonster approved these changes Oct 15, 2024

View reviewed changes

Merge branch 'feature' into sampler

b092519

duckdb-draftbot marked this pull request as draft October 15, 2024 18:07

Merge remote-tracking branch 'upstream/feature' into sampler

6b6b683

continue-revolution marked this pull request as ready for review October 16, 2024 04:20

Mytherin merged commit 714c4f9 into duckdb:feature Oct 16, 2024
38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improvement the speed of table sample systems #12631

Improvement the speed of table sample systems #12631

Uh oh!

continue-revolution commented Jun 21, 2024

Uh oh!

Mytherin commented Jun 24, 2024

Uh oh!

yuxuan18 commented Jun 24, 2024

Uh oh!

yuxuan18 commented Sep 17, 2024

Uh oh!

Tmonster left a comment

Uh oh!

Tmonster Sep 20, 2024

Uh oh!

yuxuan18 commented Sep 24, 2024

Uh oh!

yuxuan18 commented Sep 30, 2024

Uh oh!

yuxuan18 commented Oct 7, 2024

Uh oh!

Mytherin commented Oct 15, 2024

Uh oh!

Tmonster left a comment

Uh oh!

yuxuan18 commented Oct 16, 2024

Uh oh!

Uh oh!

Mytherin commented Oct 16, 2024

Uh oh!

Uh oh!

Improvement the speed of table sample systems #12631

Improvement the speed of table sample systems #12631

Uh oh!

Conversation

continue-revolution commented Jun 21, 2024

Uh oh!

Mytherin commented Jun 24, 2024

Uh oh!

yuxuan18 commented Jun 24, 2024

Uh oh!

yuxuan18 commented Sep 17, 2024

Uh oh!

Tmonster left a comment

Choose a reason for hiding this comment

Uh oh!

Tmonster Sep 20, 2024

Choose a reason for hiding this comment

Uh oh!

yuxuan18 commented Sep 24, 2024

Uh oh!

yuxuan18 commented Sep 30, 2024

Uh oh!

yuxuan18 commented Oct 7, 2024

Uh oh!

Mytherin commented Oct 15, 2024

Uh oh!

Tmonster left a comment

Choose a reason for hiding this comment

Uh oh!

yuxuan18 commented Oct 16, 2024

Uh oh!

Uh oh!

Mytherin commented Oct 16, 2024

Uh oh!

Uh oh!