Skip to content

Conversation

Tmonster
Copy link
Contributor

Adds a new benchmark directory to our regression scripts to test ingestion performance. Some questions I still have

  • Should we also test ingestion TPCH/TPCDS from csv?
  • Should we test ingesting larger scale factors during nightlies?

These things can also be added in a future PR.

@github-actions github-actions bot marked this pull request as draft January 26, 2024 15:25
@Tmonster Tmonster marked this pull request as ready for review January 29, 2024 09:02
@github-actions github-actions bot marked this pull request as draft January 29, 2024 09:05
@Tmonster Tmonster marked this pull request as ready for review January 30, 2024 13:33
@github-actions github-actions bot marked this pull request as draft January 30, 2024 13:46
@Tmonster Tmonster marked this pull request as ready for review February 1, 2024 11:00
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I'm not so certain if there's much of a benefit to benchmarking all of the tables. Many of these tables are tiny (e.g. nation has 25 rows?). Furthermore having lots of benchmarks means the chance of false negatives is higher which generates noise in the CI.

Can we perhaps limit this to the largest tables, and try to go for more of a variety of benchmarks? e.g. reading from Parquet, reading from CSV, copying from a separate table, running in-memory, running with persistent storage, etc.

@github-actions github-actions bot marked this pull request as draft February 2, 2024 10:39
@Tmonster Tmonster marked this pull request as ready for review February 6, 2024 09:30
@github-actions github-actions bot marked this pull request as draft February 6, 2024 13:34
@Tmonster
Copy link
Contributor Author

Tmonster commented Feb 7, 2024

The now benchmarks the following matrix

Format: CSV, PARQUET, NATIVE
Instance type: In memory, persistent

Getting In-memory native storage to work is not super straightforward in the current benchmark framework. I saw two options,

  1. Open an in memory connection and generate the data for each benchmark
    Or
  2. add a command to the interpreted benchmark script so that allows you to do the following
    1. require the presence of a db file,
    2. If the db file doesn't exist, use the provided load script to create the data and export it (if necessary)
    3. destroy the old connection and create a new in memory connection
    Then with another command
    4. Use another load command to import the data from the required db into the new in memory connection

(2) seems confusing and potentially error prone. I implemented a cache {db_file} no_reconnect function to do steps 1-3. This works for benchmarking ingestion of csv and parquet files since you don't need to data natively.

@Tmonster Tmonster marked this pull request as ready for review February 7, 2024 15:58
@Tmonster
Copy link
Contributor Author

Tmonster commented Feb 7, 2024

This fails regression now because the current duckdb doesn't have the new cache {db_file} no_connect command

@Mytherin
Copy link
Collaborator

Thanks for the updates! Looks good - could we perhaps just remove the ingestion benchmarks that aren't run (i.e. leaving only lineitem, store_sales, inventory and orders)? I think adding the small tables - even if they are not run - adds a lot of noise to when you want to run these benchmarks and I can't really see a scenario in which we want to run them.

@Tmonster
Copy link
Contributor Author

Thanks for the updates! Looks good - could we perhaps just remove the ingestion benchmarks that aren't run (i.e. leaving only lineitem, store_sales, inventory and orders)? I think adding the small tables - even if they are not run - adds a lot of noise to when you want to run these benchmarks and I can't really see a scenario in which we want to run them.

sure 👍

@github-actions github-actions bot marked this pull request as draft February 12, 2024 07:40
@Tmonster Tmonster marked this pull request as ready for review February 12, 2024 11:35
@Mytherin Mytherin merged commit a72a6cc into duckdb:main Feb 15, 2024
@Mytherin
Copy link
Collaborator

Thanks!

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Mar 15, 2024
Merge pull request duckdb/duckdb#10658 from hannes/csvpathlength
Merge pull request duckdb/duckdb#10756 from Mytherin/preserveinsertionordermemory
Merge pull request duckdb/duckdb#10746 from samansmink/enable-azure-autoload
Merge pull request duckdb/duckdb#10747 from maiadegraaf/list_reverse_bug
Merge pull request duckdb/duckdb#10748 from taniabogatsch/capi-tests
Merge pull request duckdb/duckdb#10739 from peterboncz/pb/immmedate_mode_only_in_non_autocommit
Merge pull request duckdb/duckdb#10688 from Tmonster/union_exclude
Merge pull request duckdb/duckdb#10710 from samansmink/comment-on-column
Merge pull request duckdb/duckdb#10725 from hawkfish/fuzzer-preceding-frame
Merge pull request duckdb/duckdb#10723 from hawkfish/fuzzer-null-timestamp
Merge pull request duckdb/duckdb#10436 from taniabogatsch/map-fixes
Merge pull request duckdb/duckdb#10587 from kryonix/main
Merge pull request duckdb/duckdb#10738 from TinyTinni/fix-assert-in-iscntrl
Merge pull request duckdb/duckdb#10708 from carlopi/ci_fixes
Merge pull request duckdb/duckdb#10726 from hawkfish/fuzzer-to-weeks
Merge pull request duckdb/duckdb#10727 from hawkfish/fuzzer-window-bind
Merge pull request duckdb/duckdb#10733 from TinyTinni/remove-static-string
Merge pull request duckdb/duckdb#10715 from Tishj/python_tpch_regression_rework
Merge pull request duckdb/duckdb#10728 from hawkfish/fuzzer-argminmax-decimal
Merge pull request duckdb/duckdb#10717 from carlopi/fix_extension_deploy
Merge pull request duckdb/duckdb#10694 from Mytherin/castquerylocation
Merge pull request duckdb/duckdb#10448 from peteraisher/feature/use-assertThrows-for-jdbc-tests
Merge pull request duckdb/duckdb#10691 from Mytherin/issue10685
Merge pull request duckdb/duckdb#10684 from Mytherin/distincton
Merge pull request duckdb/duckdb#9539 from Tishj/timestamp_unit_to_tz
Merge pull request duckdb/duckdb#10341 from Tmonster/tpch_ingestion_benchmark
Merge pull request duckdb/duckdb#10689 from Mytherin/juliaversion
Merge pull request duckdb/duckdb#10669 from Mytherin/skippedtests
Merge pull request duckdb/duckdb#10679 from Tishj/reenable_window_rows_overflow
Merge pull request duckdb/duckdb#10672 from carlopi/wasm_extensions_ci
Merge pull request duckdb/duckdb#10660 from szarnyasg/update-storage-info-for-v0100
Merge pull request duckdb/duckdb#10643 from bleskes/duck_transaction_o11y
Merge pull request duckdb/duckdb#10654 from carlopi/fix_10548
Merge pull request duckdb/duckdb#10650 from hannes/noprintf
Merge pull request duckdb/duckdb#10649 from Mytherin/explicitenumnumbers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants