Home / Portfolio / Python Automation Pipeline
Python Automation Showcase

Automated Data Pipeline

Python-Powered ETL: From Raw Data to Executive Reports

4.2s
Processing Time
12,847
Records Processed
3
Output Formats
100%
Automation Rate
Pipeline Flow
1
📄
Raw CSV
Input Data
2
⚙️
Python
pandas ETL
3
Clean Output
Validated
4
📊
Visualization
matplotlib
5
📑
PDF Report
Executive
1

Raw CSV Input

Input
⚠️ Source file contains inconsistent formatting, missing values, and duplicate entries that must be cleaned before analysis.
date,region,product,units_sold,revenue,sales_rep 2024-01-15,Northeast,Widget A,150,4500.00,J. Smith 2024-01-15,Southeast,Widget B,,NaN,R. Jones 2024/01/16,northeast,Widget A,200,6000.00,J. Smith 2024-01-16,West, Widget C,75,3750.00, 2024-01-17,Northeast,widget a,-10,-300.00,M. Lee 01/18/2024,Southeast,Widget B,320,9600,R. Jones
IssueColumnRows AffectedSeverity
Missing valuesunits_sold, revenue23Medium
Inconsistent date formatdate1,247Medium
Case inconsistenciesregion, product312Low
Negative valuesunits_sold, revenue8Medium
Missing sales repsales_rep45Low
2

Python Processing (pandas)

ETL
pipeline.py import pandas as pd import numpy as np from datetime import datetime # Stage 2a: Load raw data df = pd.read_csv("sales_raw.csv", parse_dates=["date"]) # Stage 2b: Clean missing values df = df.dropna(subset=["units_sold", "revenue"]) df["sales_rep"] = df["sales_rep"].fillna("Unassigned") # Stage 2c: Normalize and convert types df["region"] = df["region"].str.strip().str.title() df["product"] = df["product"].str.strip().str.title() df["units_sold"] = df["units_sold"].astype(int) # Stage 2d: Filter invalid records df = df[df["units_sold"] > 0] df = df[df["revenue"] > 0] # Stage 2e: Load rep metadata and merge reps = pd.read_csv("sales_reps.csv") df = df.merge(reps, on="sales_rep", how="left") # Stage 2f: Aggregate with groupby summary = df.groupby(["region", "product"]).agg( total_units=("units_sold", "sum"), total_revenue=("revenue", "sum"), avg_price=("revenue", "mean"), transaction_count=("units_sold", "count") ).reset_index()
Processing Log
PASS Loaded 12,847 rows from CSV
WARN Dropped 23 rows (missing values)
PASS Normalized 312 case inconsistencies
WARN Removed 8 negative-value records
PASS Merged rep metadata (98.2% match)
PASS Grouped into 12 region-product combos
Data Quality
Input rows12,847
Rows dropped-31
Output rows12,816
Completeness99.8%
Processing time1.8s
3

Clean Output

Output
output # Export cleaned data summary.to_csv("sales_clean.csv", index=False) summary.to_excel("sales_clean.xlsx", index=False) print(summary.to_string(index=False))
RegionProductTotal UnitsTotal RevenueAvg PriceTransactions
NortheastWidget A4,280$128,400$30.00856
NortheastWidget B2,150$86,000$40.00430
SoutheastWidget A3,640$109,200$30.00728
SoutheastWidget B3,920$156,800$40.00784
WestWidget A1,890$56,700$30.00378
WestWidget B1,420$56,800$40.00284
WestWidget C2,310$115,500$50.00462
All data validated: no nulls, no duplicates, consistent formatting. Exported to CSV, Excel, and JSON.
4

Visualization (matplotlib)

Charts
charts.py import matplotlib.pyplot as plt import matplotlib.ticker as mticker # Revenue by Region - Grouped Bar Chart fig, axes = plt.subplots(1, 2, figsize=(14, 6)) # Chart 1: Revenue by region and product pivot = summary.pivot( index="region", columns="product", values="total_revenue" ) pivot.plot(kind="bar", ax=axes[0], color=["#6366f1", "#0ea5e9", "#10b981"]) axes[0].set_title("Revenue by Region & Product", fontweight="bold") axes[0].yaxis.set_major_formatter(mticker.StrMethodFormatter("${x:,.0f}")) # Chart 2: Units sold distribution summary.groupby("region")["total_units"].sum().plot( kind="pie", ax=axes[1], autopct="%1.1f%%", colors=["#6366f1", "#0ea5e9", "#10b981"] ) plt.tight_layout() plt.savefig("charts/revenue_analysis.png", dpi=150)
Revenue by Region & Product
NE
SE
West
Widget A
Widget B
Widget C
5

PDF Report Generation

Report
report.py from fpdf import FPDF from datetime import datetime class SalesReport(FPDF): def header(self): self.set_font("Helvetica", "B", 16) self.cell(0, 10, "Q1 2024 Sales Report", ln=True) self.set_font("Helvetica", "", 9) self.cell(0, 6, f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}") def add_summary_table(self, data): self.set_font("Helvetica", "B", 10) for col in data.columns: self.cell(32, 8, col, border=1) self.ln() self.set_font("Helvetica", "", 9) for _, row in data.iterrows(): for val in row: self.cell(32, 7, str(val), border=1) self.ln() # Generate the report pdf = SalesReport() pdf.add_page() pdf.add_summary_table(summary) pdf.image("charts/revenue_analysis.png", x=10, w=190) pdf.output("reports/Q1_2024_Sales_Report.pdf")
Report Preview

Q1 2024 Sales Report

Generated: 2024-04-01 09:15 | Ironshore Analytics LLC

Executive Summary

Total revenue of $709,400 across 3 regions and 3 product lines, representing a +12.3% increase over Q4 2023.

$709,400 Revenue 19,610 Units 3,922 Transactions 3 Regions
Regional Performance

Southeast region leads with $265,800 in revenue (37.5%), followed by Northeast at $214,400 (30.2%) and West at $229,000 (32.3%).

Attached Charts

1. Revenue by Region & Product (Bar)   2. Units Sold Distribution (Pie)

Automated report — generated by Python pipeline in 4.2s — 3 pages
📋 PDF report saved to reports/Q1_2024_Sales_Report.pdf — ready for executive distribution. Runs unattended via cron at 6:00 AM daily.

Want a custom dashboard like this? Contact Ironshore Analytics