With the exponential growth in urban transportation data, analyzing and modeling taxi trip records can offer valuable insights for transit planning, policy-making, and operational optimization. This project focuses on a large-scale analysis of the NYC Yellow Taxi Trip dataset, leveraging both relational (PostgreSQL) and document-based (MongoDB) database paradigms. It encompasses end-to-end data handling โ from ER modeling, schema normalization, and performance tuning to data mining techniques such as frequent itemset and association rule mining โ to uncover actionable trends in trip behavior, payment preferences, and spatial-temporal patterns.
The primary goal of this project is to design, implement, and compare relational and document-oriented data models for the NYC Taxi dataset in order to:
-
Construct normalized schemas and enforce referential integrity for OLAP-style analytical queries.
-
Design an efficient document model for flexible, real-time data access.
-
Evaluate query performance and indexing strategies across both systems.
-
Apply data mining techniques (Apriori, association rule mining) to extract frequent co-occurrence patterns between trip zones, payment types, and time-based factors.
-
Derive meaningful insights for urban mobility, such as trip hotspots, fare distribution, and temporal usage trends, while showcasing the trade-offs between different database architectures.
- Source: NYC Taxi & Limousine Commission (TLC)
- Sample Size: 44M+ trip records (reduced to ~1M for feasibility)
- Features: Pickup/Dropoff times, locations, fare, tip, payment type, passenger count, rate code, vendor ID, etc.
The ER model represents the high-level conceptual design of the dataset. It captures the key entities (Trip
, Vendor
, Location
, Payment
, RateCode
, and Time
) and their relationships.
Trip
is the central entity linked to all others via FKs.Time
is modeled as a weak entity for flexible date-time analysis.- Lookup/reference tables like
Payment
,Vendor
,RateCode
promote normalization and consistency.
- Normalization up to 3NF or BCNF
- Efficient for OLAP-style queries
- Avoid redundancy and anomalies
Trip(ID, VendorID, PaymentID, RateCodeID, PickupLocationID, DropoffLocationID, PassengerCount, TripDistance, FareAmount, TipAmount, TotalAmount, ...)
Time(TripID, PickupDate, PickupTime, DropoffDate, DropoffTime, DayOfWeek, IsWeekend)
Location(ID, Borough, Zone)
Vendor(ID, Description)
Payment(ID, Description)
RateCode(ID, Description)
- Surrogate
ID
keys for dimension tables - Foreign key relationships with
ON DELETE CASCADE
- Referential integrity enforced for joins and aggregations
๐ SQL Schema: Phase-1/Data_Import/table_creation.sql
- Store raw CSV files in
Phase-1/Data_Import/raw_data/
- Run:
python3 DataReader/load_from_kaggle.py
- Execute SQL:
psql -f DataReader/table_creation.sql
The document model embeds key information within a trip
collection, where each document includes vendor, payment, location, and time data.
{
"trip_id": 12345,
"vendor": "CMT",
"pickup": {
"datetime": "2023-02-01T10:30:00",
"location": {
"borough": "Manhattan",
"zone": "Midtown Center"
}
},
"dropoff": {
"datetime": "2023-02-01T10:50:00",
"location": {
"borough": "Brooklyn",
"zone": "Downtown Brooklyn"
}
},
"fare": 20.5,
"tip": 4.0,
"payment_type": "Credit Card"
}
๐ MongoDB Script: Phase-2/load_to_mongo.py
- Top 5 zones by average fare
- Average trip duration by day of week
- Most common payment types
- Aggregations using
$group
,$project
,$unwind
- Geospatial queries on pickup zones
๐ SQL Queries: Phase-2/final_queries.sql
- Indexes on
PickupLocationID
,DropoffLocationID
,PickupDate
,PaymentID
- B-Tree and composite indexes improved query times by ~40%
- Compound indexes on
pickup.datetime
+pickup.location.zone
- Performance monitored via
explain()
andAtlas profiler
๐ Index creation: Phase-2/indexes.sql
- Ensured 3NF/BCNF: e.g.,
Trip โ PaymentID
,PaymentID โ PaymentType
โTrip โ PaymentType
- Avoided transitive and partial dependencies
- No derived or multivalued attributes in base schema
๐ FD Discovery: Phase-2/get_functional_dependencies.py
- Removed nulls and outliers (
fare_amount <= 0
, unrealistic trip distances) - Consolidated payment types (
Credit Card
,CC
โCredit Card
) - Added flags like
IsWeekend
,RushHour
for analysis
๐ Cleaning Script: Phase-3/clean_data.py
Used Apriori to identify co-occurrence patterns between zones and payment types.
- Example:
{Zone=Midtown, Payment=Credit Card} โ Frequent set
- Minimum support threshold: 0.03
๐ Preprocess: Phase-3/preprocess.py
๐ Mining: Phase-3/itemset_mining.py
๐ Mined Rules: [Phase-3/[rules_2.txt, rules_3.txt, rules_4.txt]
]
Generated rules such as:
If PickupZone=Midtown โ likely DropoffZone=Downtown Brooklyn (confidence=0.72)
If PaymentType=Cash โ shorter trip distance (confidence=0.61)
๐ Notebook: Phase-3/association_rules.py
Feature | Relational (PostgreSQL) | Document (MongoDB) |
---|---|---|
Normalization | Fully normalized | Embedded & denormalized |
Joins | Native support with FK | Avoided by embedding |
Query Flexibility | High for complex, multi-table queries | Great for hierarchical, nested data |
Performance | Better for structured batch analysis | Better for single document retrieval |
Storage | More efficient due to normalization | Redundant fields, higher space usage |
Use Case Fit | Ideal for analytical OLAP queries | Suitable for real-time app backends |
python3 DataReader/load_from_kaggle.py psql -f DataReader/table_creation.sql
python3 Phase-3/clean_data.py
python3 Phase-3/preprocess.py
python3 Phase-3/itemset_mining.py
python3 Phase-3/association_rules.py
โ Readme.md
โ ER_Diagram.png
โ Relational_Model.png
โ
โโโโDataReader/
โ โโโโload_from_kaggle.py
โ โโโโtable_creation.sql
โโโโPhase-1/
โ โโโโData_Import/
โ โโโโload_from_kaggle.py
โ โโโโtable_creation.sql
โโโโphase-2/
โ โโโโget_functional_dependencies.py
โ โโโโfinal_queries.sql
โ โโโโindexes.sql
โโโโPhase-3/
โโโโclean_data.py
โโโโpreprocess.py
โโโโitemset_mining.py
โโโโassociation_rules.py