Amazon Reviews ETL Pipeline

Amazon Reviews ETL Pipeline

Amazon Reviews ETL Pipeline

Introduction


This project builds a robust ETL pipeline to transform raw Amazon review data into a self-service analytics platform that drives actionable insights. By ingesting a 34.7 million-record JSON dump of reviews and products, structuring it for efficient querying, and delivering dynamic Power BI dashboards, stakeholders gain on-demand views of brand performance and sentiment trends all with enterprise-grade reliability and maintainability.

Process

  • Data Collection and Cleaning
    • Ingested a 34.7 M-record JSON dump of Amazon reviews (89.5 K products) into HDFS.
    • Employed jq and Apache Pig scripts to normalize JSON, handle missing values, and remove duplicates laying a clean foundation for downstream processing.

  • Exploratory Data Analysis & Structuring
    • Queried a 2.4 M-review sample to surface top brand share (0.84 % each) and uncover sentiment trends over time.
    • Designed Hive schemas and populated structured tables (fact tables for reviews, dimension tables for products, dates, and sentiments) to optimize query performance.

  • ETL Orchestration & BI Integration
    • Built an SSIS package to extract from Hive, transform as needed, and load into SQL Server.
    • Developed Power BI reports featuring KPI cards, donut charts, treemaps, and trend lines enabling interactive drill-downs by brand, product category, and time period.

  • MLOps Inspired Automation & Monitoring
    • Version-controlled all scripts and configurations in Git, and parameterized SSIS packages for flexibility across environments.
    • Scheduled automated data refresh pipelines to maintain up-to-date dashboards, with alerting on job failures and SLA breaches.

  • Validation & Iteration
    • Regularly compared dashboard metrics against source data to ensure accuracy.
    • Incorporated stakeholder feedback to refine visualizations and add new filters (e.g., sentiment polarity, review length).

Outcome

  • Actionable Insights: Enabled identification of underperforming brands and emerging sentiment shifts, informing timely marketing adjustments.

  • Efficiency Gains: Reduced data-prep and reporting lead time by 70%, from multi-hour manual processes to an automated end-to-end flow.

  • Scalable, Self-Service Analytics: Empowered product managers and marketers with on-demand dashboards eliminating bottlenecks and democratizing data access across the organization.

Do you have any project idea you want to discuss about?

Do you have any project idea you want to discuss about?

Do you have any project idea you want to discuss about?