-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Improvement the speed of table sample systems #12631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Run formatter also on src/include/duckdb/core_functions/... * fix numpy issues with the 'string' dtype changes * Use numeric_limits * Format fix * Fix duckdb#12467 changes to stddev calculation * Format fix * Update min/max to cache allocations and prevent unnecessary re-allocation * missed some constants in FixedSizeBuffer * first step ci run for android * baby steps * typo * explicit platform * extension static build * more env and ninja * add arm64 * wrong flag * extensions, fingers crossed * container * using default containers * removing it in more places * patch vss * port changes of 'python_datetime_and_deltatime_missing_stride' from 'feature' to 'main' * Switch arg_min/arg_max to use sort key instead of vectors * Clean up unused functions * AddStringOrBlob * Skip only built-in optimizers * Add support for arg_min(ANY, ANY) * revert extension patch with optional_idx * Correct count * Format fix * Format fix * Switch fallback implementation of FIRST to use sort keys instead of vectors * WIP - clean up histogram function, avoid using ListVector::Push * Move many tests to slow * Add support for all types to the histogram function * dont WaitOnTask if there are no tasks available * Rework list_distinct/list_unique to support arbitrary types and to no longer use values and ListVector::Push * Format fix * fix compilation * Avoid overriding types in PrepareTypeForCast when not required * Use string_t and the arena allocator to allocate strings in the histogram function, instead of std::string * this is soooo much better * forgot to add new file * apply feedback * prevent the undefined behaviour of std::tolower() by casting the input to uint_8 * Binned histograms WIP * format * fix up test * More tests * Format fix + use string_map_t here * Detect duplicate bounds, sort bounds, and allow empty bounds * Binned histograms working for all types * Add binned histogram test to test all types * Unify/clean up histogram and binned histogram * RequiresExtract * Add TPC-H tests * Improve error message * Format * Add equi-width binning method for integers * More clean-up and testing, add support for doubles * lets start with this * Update .github/workflows/Android.yml Co-authored-by: Carlo Piovesan <piovesan.carlo@gmail.com> * add missing headers * Make equi-width-bins always return the input type * treat NOTHING different from REPLACE/UPDATE, the filtered tuples should not be added to the returning chunk * format * nits, use ExtensionHelper to check for spatial extension * np.NaN -> np.nan * remove pyarrow version lock * cxxheaderparser * commit generated code * add reqs * Update CodeQuality.yml * name capi enums * rewrite verify_enum_integrity.py to use cxxheaderparser * fix dependency installation * Add equi-width binning support for dates/timestamps * update messages * remove dead code * format * Make typeof a constant function (removing its argument) so that the optimizer can optimize branches away * Format * Binning test * Revert "Make typeof a constant function (removing its argument) so that the optimizer can optimize branches away" This reverts commit 4455c46. * Fix test * Remove duplicate test * Re-generate functions * Use / * Add bind_expression function, and make typeof return a constant expression when possible * Set function pointer * Add function can_cast_implicitly that, given a source and target type, states whether or not the source can be implicitly cast to the target * This is optimized away * This is moved * change np.NaN -> np.nan * Fixes for equi_width_bins with timestamps - cap bins at bin_count, propagate fractional months/days downwards (i.e. 0.2 months becomes 6 days) and handle bin_count > time_diff in micros * Add histogram and histogram_values table macros + infrastructure for creating default table macros * Feature duckdb#1272: Window Executor State PR feedback. * More testing for histogram function * Allow min as boundary (when there are few values this might occur) * Format * Add missing include * Fix tests * Move mode function to owning string map * Format fix * Use correct map type in list aggregate for histogram * Make mode work for all types by using sort keys * Remove N from this test as there is a duplicate * "benchmark/micro/*" >> "benchmark/micro/*.benchmark" * add copy_to_select hook, rework unsupported types in copy_to and export * optimize * remove unused variable * Modify test to be deterministic * pass along options to select * nit * allow ducktyping of lists/dicts * add enum for copy to info * move type visit functions into TypeVisitor helper * When dropping a table - clear the local storage of any appended data to that table * Add include * Set flags again in test * Also call OnDropEntry for CREATE OR REPLACE * Add HTTP error code to extension install failures * Issue duckdb#12600: Streaming Positive LAG Use buffering to support streaming computation of constant positive LAGs and negative LEADs that are at most one vector away. This doesn't fix the "look ahead" problem, but the benchmark shows it is about 5x faster than the non-streaming version. * Issue duckdb#12600: Streaming Positive LAG Add new benchmark. * Feature duckdb#1272: Window Group Preparation Move the construction of the row data collections and masks to the Finalize phase. These are relatively fast and will use data that is still hot (e.g., the sort keys). This will make it easier parallelise the remaining two passes over the data (build and execute). * Feature duckdb#1272: Window Group Preparation Code cleanup and cast fixes. * Turn window_start and window_end into idx_t * Initialize payload collection DataChunk with payload_count to prevent resizing * Format * VectorOperations::Copy - fast path when copying an aligned flat validity mask into a flat vector * Set is_dropped flag instead of actually dropping data, as other scans might be depending on the local storage (in particular when running CREATE OR REPLACE tbl AS SELECT * FROM tbl) * Add missing include * Create sort key helper class and use it in min/max * Simplify histogram combine * StateCombine for binned histogram * Rework mode to also use sort key helpers * Remove specialized decimal implementation of minmax * Disable android CI on pushes/PRs * Quantile clean-up part 1: move bind data and sort tree to separate files * Move quantile window state into separate struct * Move MAD and quantile states/operations to separate files * Rework median - allow median(ANY) instead of having different function overloads * Avoid usage of std::string in quantiles, and switch to arena allocated strings as well * Add fallback option to discrete quantile * Issue duckdb#12600: Streaming Positive LAG Incorporate various PR feedback improvements: * Cached Executors * Validity Mask reset * Small Buffers Along with the mask copy improvement in VectorOperations::Copy these reduce the benchmark runtime by another 3x. * quantile_disc scalar and lists working for arbitrary types * Remove pre-allocation for now * Feature 1272: Window Payload Preallocation Only preallocate payloads for the value window functions (LEAD, LAG, FIRST, LAST, VALUE) instad of all of them. * Quantile binding clean-up, add test_all_types for quantile_disc and median * Test + CI fixes * Format * Clean up old deserialization code * Set destructor, remove some more dead code --------- Co-authored-by: Carlo Piovesan <piovesan.carlo@gmail.com> Co-authored-by: Tishj <t_b@live.nl> Co-authored-by: Mark Raasveldt <mark.raasveldt@gmail.com> Co-authored-by: taniabogatsch <44262898+taniabogatsch@users.noreply.github.com> Co-authored-by: Hannes Mühleisen <hannes@duckdblabs.com> Co-authored-by: Christina Sioula <chrisiou.myk@gmail.com> Co-authored-by: Hannes Mühleisen <hannes@muehleisen.org> Co-authored-by: Richard Wesley <13156216+hawkfish@users.noreply.github.com> Co-authored-by: Max Gabrielsson <max@gabrielsson.com> Co-authored-by: Elliana May <me@mause.me> Co-authored-by: Richard Wesley <hawkfish@electricfish.com> Co-authored-by: Maia <maia@duckdblabs.com>
Thanks for the PR! Instead of having a different sampling method for this, I think a cleaner design would be to implement pushdown of the system sampling operator into the scans instead. This PR also doesn't have any tests yet - what happens if this is used on e.g. a Parquet scan? |
Hi Mark, we are happy to make changes. Could you be more specific about what does the following design idea mean?
Is it implementing a pushdown method for the sampling operator similar to this filter push down? |
sync upstream
sync upstream
@Mytherin Could you take a look at our current implementation? We revised the logic as you suggested and created a We have a failing check on the wrapping test, which I think is not relevant to our changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
Just one small comment from my side
@@ -184,6 +185,9 @@ InsertionOrderPreservingMap<string> PhysicalTableScan::ParamsToString() const { | |||
} | |||
result["Filters"] = filters_info; | |||
} | |||
if (extra_info.sample_options) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can you add this to the LogicalGet::ParamsToString()
as well?
@Mytherin @TomBurdge Could you please take a look at our implementation? |
@Mytherin We are happy to make further adjustments as needed. Could you please share your thoughts on the current implementation looks? |
Thanks for the fixes! This looks good from my side (aside from the minor merge conflict). @Tmonster perhaps you can have a final look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good from my side
Thanks! |
This PR is attempting to implement a separate sampler that skip chunks during full scan for table sample queries.
We observe that the original table sample systems is sometimes very slow because duckdb will always full scan the whole database during the first sequential scan, so we implement a new method “chunk” to skip chunks.
Our experiments show that by implementing this sampler, we can have up to 100x (tpch-1) speed-up for table sample aggregation queries.