Skip to content

ashwin1596/bigData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

12 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

TaxiTrip Data Mining: Scalable Pattern Discovery on 44M Trips

Overview

With the exponential growth in urban transportation data, analyzing and modeling taxi trip records can offer valuable insights for transit planning, policy-making, and operational optimization. This project focuses on a large-scale analysis of the NYC Yellow Taxi Trip dataset, leveraging both relational (PostgreSQL) and document-based (MongoDB) database paradigms. It encompasses end-to-end data handling โ€” from ER modeling, schema normalization, and performance tuning to data mining techniques such as frequent itemset and association rule mining โ€” to uncover actionable trends in trip behavior, payment preferences, and spatial-temporal patterns.


Objective

The primary goal of this project is to design, implement, and compare relational and document-oriented data models for the NYC Taxi dataset in order to:

  • Construct normalized schemas and enforce referential integrity for OLAP-style analytical queries.

  • Design an efficient document model for flexible, real-time data access.

  • Evaluate query performance and indexing strategies across both systems.

  • Apply data mining techniques (Apriori, association rule mining) to extract frequent co-occurrence patterns between trip zones, payment types, and time-based factors.

  • Derive meaningful insights for urban mobility, such as trip hotspots, fare distribution, and temporal usage trends, while showcasing the trade-offs between different database architectures.


Dataset Source & Description

  • Source: NYC Taxi & Limousine Commission (TLC)
  • Sample Size: 44M+ trip records (reduced to ~1M for feasibility)
  • Features: Pickup/Dropoff times, locations, fare, tip, payment type, passenger count, rate code, vendor ID, etc.

Entity-Relationship (ER) Model

The ER model represents the high-level conceptual design of the dataset. It captures the key entities (Trip, Vendor, Location, Payment, RateCode, and Time) and their relationships.

Highlights:

  • Trip is the central entity linked to all others via FKs.
  • Time is modeled as a weak entity for flexible date-time analysis.
  • Lookup/reference tables like Payment, Vendor, RateCode promote normalization and consistency.

ER Diagram:
ER Diagram


Relational Database Schema Design

Key Design Goals:

  • Normalization up to 3NF or BCNF
  • Efficient for OLAP-style queries
  • Avoid redundancy and anomalies

Tables & Attributes:

  • Trip(ID, VendorID, PaymentID, RateCodeID, PickupLocationID, DropoffLocationID, PassengerCount, TripDistance, FareAmount, TipAmount, TotalAmount, ...)
  • Time(TripID, PickupDate, PickupTime, DropoffDate, DropoffTime, DayOfWeek, IsWeekend)
  • Location(ID, Borough, Zone)
  • Vendor(ID, Description)
  • Payment(ID, Description)
  • RateCode(ID, Description)

Keys & Constraints:

  • Surrogate ID keys for dimension tables
  • Foreign key relationships with ON DELETE CASCADE
  • Referential integrity enforced for joins and aggregations

Relational model:
R Diagram

๐Ÿ“‚ SQL Schema: Phase-1/Data_Import/table_creation.sql


Data Loading Instructions (SQL)

  1. Store raw CSV files in Phase-1/Data_Import/raw_data/
  2. Run:
    python3 DataReader/load_from_kaggle.py
    
  3. Execute SQL:
    psql -f DataReader/table_creation.sql
    

๐Ÿ“„ Document-Oriented Model (MongoDB)

The document model embeds key information within a trip collection, where each document includes vendor, payment, location, and time data.

Sample Document:

{
  "trip_id": 12345,
  "vendor": "CMT",
  "pickup": {
    "datetime": "2023-02-01T10:30:00",
    "location": {
      "borough": "Manhattan",
      "zone": "Midtown Center"
    }
  },
  "dropoff": {
    "datetime": "2023-02-01T10:50:00",
    "location": {
      "borough": "Brooklyn",
      "zone": "Downtown Brooklyn"
    }
  },
  "fare": 20.5,
  "tip": 4.0,
  "payment_type": "Credit Card"
}

๐Ÿ“‚ MongoDB Script: Phase-2/load_to_mongo.py


Query Examples & Performance Tuning

PostgreSQL:

  • Top 5 zones by average fare
  • Average trip duration by day of week
  • Most common payment types

MongoDB:

  • Aggregations using $group, $project, $unwind
  • Geospatial queries on pickup zones

๐Ÿ“‚ SQL Queries: Phase-2/final_queries.sql


Indexing Strategy & Performance Benchmarks

PostgreSQL:

  • Indexes on PickupLocationID, DropoffLocationID, PickupDate, PaymentID
  • B-Tree and composite indexes improved query times by ~40%

MongoDB:

  • Compound indexes on pickup.datetime + pickup.location.zone
  • Performance monitored via explain() and Atlas profiler

๐Ÿ“‚ Index creation: Phase-2/indexes.sql


Functional Dependencies & Normalization

  • Ensured 3NF/BCNF: e.g.,
    Trip โ†’ PaymentID, PaymentID โ†’ PaymentType โ‡’ Trip โ†’ PaymentType
  • Avoided transitive and partial dependencies
  • No derived or multivalued attributes in base schema

๐Ÿ“‚ FD Discovery: Phase-2/get_functional_dependencies.py


Data Cleaning

  • Removed nulls and outliers (fare_amount <= 0, unrealistic trip distances)
  • Consolidated payment types (Credit Card, CC โ†’ Credit Card)
  • Added flags like IsWeekend, RushHour for analysis

๐Ÿ“‚ Cleaning Script: Phase-3/clean_data.py


Frequent Itemset Mining

Used Apriori to identify co-occurrence patterns between zones and payment types.

  • Example: {Zone=Midtown, Payment=Credit Card} โ†’ Frequent set
  • Minimum support threshold: 0.03

๐Ÿ“‚ Preprocess: Phase-3/preprocess.py ๐Ÿ“‚ Mining: Phase-3/itemset_mining.py

๐Ÿ“‚ Mined Rules: [Phase-3/[rules_2.txt, rules_3.txt, rules_4.txt]]


Association Rule Mining

Generated rules such as:

  • If PickupZone=Midtown โ†’ likely DropoffZone=Downtown Brooklyn (confidence=0.72)
  • If PaymentType=Cash โ†’ shorter trip distance (confidence=0.61)

๐Ÿ“‚ Notebook: Phase-3/association_rules.py


Relational vs Document Database Justification

Feature Relational (PostgreSQL) Document (MongoDB)
Normalization Fully normalized Embedded & denormalized
Joins Native support with FK Avoided by embedding
Query Flexibility High for complex, multi-table queries Great for hierarchical, nested data
Performance Better for structured batch analysis Better for single document retrieval
Storage More efficient due to normalization Redundant fields, higher space usage
Use Case Fit Ideal for analytical OLAP queries Suitable for real-time app backends

Execution Steps & Folder Structure

Load data

python3 DataReader/load_from_kaggle.py psql -f DataReader/table_creation.sql

Clean data

python3 Phase-3/clean_data.py

Preprocess for mining

python3 Phase-3/preprocess.py

Frequent itemsets

python3 Phase-3/itemset_mining.py

Association rules

python3 Phase-3/association_rules.py


๐Ÿ“‚ Folder Structure

โ”‚   Readme.md
โ”‚   ER_Diagram.png
โ”‚   Relational_Model.png
โ”‚
โ”œโ”€โ”€โ”€DataReader/
โ”‚   โ”œโ”€โ”€โ”€load_from_kaggle.py
โ”‚   โ””โ”€โ”€โ”€table_creation.sql
โ”œโ”€โ”€โ”€Phase-1/
โ”‚   โ””โ”€โ”€โ”€Data_Import/
โ”‚       โ”œโ”€โ”€โ”€load_from_kaggle.py
โ”‚       โ””โ”€โ”€โ”€table_creation.sql
โ”œโ”€โ”€โ”€phase-2/
โ”‚   โ”œโ”€โ”€โ”€get_functional_dependencies.py
โ”‚   โ”œโ”€โ”€โ”€final_queries.sql
โ”‚   โ””โ”€โ”€โ”€indexes.sql
โ””โ”€โ”€โ”€Phase-3/
    โ”œโ”€โ”€โ”€clean_data.py
    โ”œโ”€โ”€โ”€preprocess.py
    โ”œโ”€โ”€โ”€itemset_mining.py
    โ””โ”€โ”€โ”€association_rules.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages