Skip to content

Conversation

pdet
Copy link
Contributor

@pdet pdet commented Oct 2, 2024

This PR adds the benchmarks described in the Billion NYC Taxi Rides Redshift and Taxi Benchmarks blogs.

A description of how to run the benchmark and how the files are generated is available in benchmark/taxi/README.md. All files are available for download, the links are in the benchmark/taxi/files.txt. and can easily be downloaded by executing:

cd ./benchmark/taxi/
./download.sh

This benchmark consists of 92 compressed, gzipped CSV files, totaling around 50GB of data and approximately 1.8 billion rows. The benchmark queries are inspired by the Taxi Benchmark blog post, with the output results being ordered to guarantee consistency. This allows us to perform result-checking and ensure that the benchmark is accurate.

The addition of this benchmark is also directly related to issues #14111 and #12453, as this process originated as an in-depth attempt to reproduce these issues.

I've also added multiple tests with different methods for loading these CSV files (see tests in test/sql/copy/csv/taxi/), but unfortunately, I was not able to reproduce any of the aforementioned issues. However, we now have an easier way to check if they might be reproducible in the future under specific scenarios.

@marklit, I followed the steps described in your blog posts. Could you verify that these files are the same as the ones you have? If not, I'm happy to adapt the generation process or the queries to attempt to reproduce the issues you mentioned. :-)

For reviewing this PR, it would be interesting if the reviewer attempted to run the benchmark, and report if something goes wrong, or if the descriptions should be updated.

@Mytherin Mytherin changed the base branch from main to feature October 7, 2024 11:00
@Mytherin Mytherin merged commit dfcaee3 into duckdb:feature Oct 7, 2024
40 checks passed
@Mytherin
Copy link
Collaborator

Mytherin commented Oct 7, 2024

Thanks!

@pdet pdet deleted the taxi_dataset branch November 27, 2024 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants