Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
NYC Taxi Dataset Regeneration
Basically, I noticed that generating the dataset by following the exact steps described in this article would not work because the TLC has anonymized the latitude and longitude information when converting the datasets to Parquet files.
To address this, I obtained an older version of the data and properly regenerated the dataset.
For example, in a data snippet from this pull request (PR):
we see precise longitude and latitude data points:
-73.993908,40.741383000000006,-73.989915,40.75273800000001
, along with a PostGIS Geometry hex blob created from this longitude and latitude information:0101000020E6100000E6CE4C309C7F52C0BA675DA3E55E4440,0101000020E610000078B471C45A7F52C06D3A02B859604440
(generated usingST_SetSRID(ST_Point(longitude, latitude), 4326)
).Since latitude and longitude information is essential to this schema, I’ve limited the dataset to data up to mid-2016, which is the last period where this information was available.
The final dataset consists of 65 files with a total size of approximately 1.8 GB.
In the end this benchmark also has a few more rides than the one from the billion row blogpost, because in the blogpost the uber rides were excluded and the trips were capped until the end of 2015.