Category:
Data Engineering
Introduction
This project builds a robust ETL pipeline to transform raw Amazon review data into a self-service analytics platform that drives actionable insights. By ingesting a 34.7 million-record JSON dump of reviews and products, structuring it for efficient querying, and delivering dynamic Power BI dashboards, stakeholders gain on-demand views of brand performance and sentiment trends all with enterprise-grade reliability and maintainability.
Process
Data Collection and Cleaning
• Ingested a 34.7 M-record JSON dump of Amazon reviews (89.5 K products) into HDFS.
• Employed jq and Apache Pig scripts to normalize JSON, handle missing values, and remove duplicates laying a clean foundation for downstream processing.Exploratory Data Analysis & Structuring
• Queried a 2.4 M-review sample to surface top brand share (0.84 % each) and uncover sentiment trends over time.
• Designed Hive schemas and populated structured tables (fact tables for reviews, dimension tables for products, dates, and sentiments) to optimize query performance.ETL Orchestration & BI Integration
• Built an SSIS package to extract from Hive, transform as needed, and load into SQL Server.
• Developed Power BI reports featuring KPI cards, donut charts, treemaps, and trend lines enabling interactive drill-downs by brand, product category, and time period.MLOps Inspired Automation & Monitoring
• Version-controlled all scripts and configurations in Git, and parameterized SSIS packages for flexibility across environments.
• Scheduled automated data refresh pipelines to maintain up-to-date dashboards, with alerting on job failures and SLA breaches.Validation & Iteration
• Regularly compared dashboard metrics against source data to ensure accuracy.
• Incorporated stakeholder feedback to refine visualizations and add new filters (e.g., sentiment polarity, review length).
Outcome
Actionable Insights: Enabled identification of underperforming brands and emerging sentiment shifts, informing timely marketing adjustments.
Efficiency Gains: Reduced data-prep and reporting lead time by 70%, from multi-hour manual processes to an automated end-to-end flow.
Scalable, Self-Service Analytics: Empowered product managers and marketers with on-demand dashboards eliminating bottlenecks and democratizing data access across the organization.