Add Taxi Dataset Benchmark #14197

pdet · 2024-10-02T14:26:26Z

This PR adds the benchmarks described in the Billion NYC Taxi Rides Redshift and Taxi Benchmarks blogs.

A description of how to run the benchmark and how the files are generated is available in benchmark/taxi/README.md. All files are available for download, the links are in the benchmark/taxi/files.txt. and can easily be downloaded by executing:

cd ./benchmark/taxi/
./download.sh

This benchmark consists of 92 compressed, gzipped CSV files, totaling around 50GB of data and approximately 1.8 billion rows. The benchmark queries are inspired by the Taxi Benchmark blog post, with the output results being ordered to guarantee consistency. This allows us to perform result-checking and ensure that the benchmark is accurate.

The addition of this benchmark is also directly related to issues #14111 and #12453, as this process originated as an in-depth attempt to reproduce these issues.

I've also added multiple tests with different methods for loading these CSV files (see tests in test/sql/copy/csv/taxi/), but unfortunately, I was not able to reproduce any of the aforementioned issues. However, we now have an easier way to check if they might be reproducible in the future under specific scenarios.

@marklit, I followed the steps described in your blog posts. Could you verify that these files are the same as the ones you have? If not, I'm happy to adapt the generation process or the queries to attempt to reproduce the issues you mentioned. :-)

For reviewing this PR, it would be interesting if the reviewer attempted to run the benchmark, and report if something goes wrong, or if the descriptions should be updated.

Mytherin · 2024-10-07T11:00:54Z

Thanks!

pdet added 19 commits September 28, 2024 14:57

Bunch of tests trying to reproduce the taxi dataset import issue

293d810

Try different numbers of threads

7e76300

more on the tests

9be5f0e

Add result checking

6d71d0f

add it as benchmark

5e31255

Lets see if CI can handle this

6ef1f8d

some small adjustments

839f2b9

Lets verify results on the benchmark as well

d47c23a

Initial readme

774e33f

Add remaining results, skip taxi_dataset.test_slow

1681e53

Missing copy statement

e7d85d5

make bash easy

61b57e1

break download to a separate step, to speed things up

6b270ff

wip

c477478

Update readome

1b6af67

update loader

31f15a7

remove require

5dea823

Format

03b54a9

Update README.md

4e2c380

Mytherin changed the base branch from main to feature October 7, 2024 11:00

Mytherin merged commit dfcaee3 into duckdb:feature Oct 7, 2024
40 checks passed

pdet deleted the taxi_dataset branch November 27, 2024 12:33

renovate bot mentioned this pull request Feb 23, 2025

fix(deps): update all-minor-patch elsbrock/hetzner-radar#118

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Taxi Dataset Benchmark #14197

Add Taxi Dataset Benchmark #14197

Uh oh!

pdet commented Oct 2, 2024

Uh oh!

Uh oh!

Mytherin commented Oct 7, 2024

Uh oh!

Uh oh!

Add Taxi Dataset Benchmark #14197

Add Taxi Dataset Benchmark #14197

Uh oh!

Conversation

pdet commented Oct 2, 2024

Uh oh!

Uh oh!

Mytherin commented Oct 7, 2024

Uh oh!

Uh oh!