Skip to content

Conversation

pdet
Copy link
Contributor

@pdet pdet commented Oct 9, 2024

NYC Taxi Dataset Regeneration

Basically, I noticed that generating the dataset by following the exact steps described in this article would not work because the TLC has anonymized the latitude and longitude information when converting the datasets to Parquet files.

To address this, I obtained an older version of the data and properly regenerated the dataset.

For example, in a data snippet from this pull request (PR):

649084905,VTS,2012-08-31 22:00:00,2012-08-31 22:07:00,0,1,-73.993908,40.741383000000006,-73.989915,40.75273800000001,1,1.32,6.1,0.5,0.5,0,0,0,0,7.1,CSH,0,0101000020E6100000E6CE4C309C7F52C0BA675DA3E55E4440,0101000020E610000078B471C45A7F52C06D3A02B859604440,yellow,0.00,0.0,0.0,91,69,4.70,142,54,1,Manhattan,005400,1005400,I,MN13,Hudson Yards-Chelsea-Flatiron-Union Square,3807,132,109,1,Manhattan,010900,1010900,I,MN17,Midtown-Midtown South,3807

we see precise longitude and latitude data points: -73.993908,40.741383000000006,-73.989915,40.75273800000001, along with a PostGIS Geometry hex blob created from this longitude and latitude information: 0101000020E6100000E6CE4C309C7F52C0BA675DA3E55E4440,0101000020E610000078B471C45A7F52C06D3A02B859604440 (generated using ST_SetSRID(ST_Point(longitude, latitude), 4326)).

Since latitude and longitude information is essential to this schema, I’ve limited the dataset to data up to mid-2016, which is the last period where this information was available.

The final dataset consists of 65 files with a total size of approximately 1.8 GB.
In the end this benchmark also has a few more rides than the one from the billion row blogpost, because in the blogpost the uber rides were excluded and the trips were capped until the end of 2015.

@Mytherin Mytherin merged commit 8d7f1ae into duckdb:feature Oct 10, 2024
39 of 40 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

@pdet pdet deleted the taxi_2 branch November 27, 2024 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants