Project Overview
This project implements a robust data pipeline to extract data from the **Open Brewery DB API**, transform it, and persist it into a structured data lake. The core objective is to demonstrate skills in API consumption, data transformation, orchestration with **Apache Airflow**, and big data processing with **Apache Spark**, following the **Medallion Architecture**.
🔗 API Integration
Fetches real-time data from external sources.
💨 Spark Transformation
Leverages PySpark for efficient, large-scale data processing.
🏛️ Medallion Architecture
Organizes data into Bronze, Silver, and Gold layers for quality and governance.
Data Lake Architecture: The Medallion Model
This project uses the Medallion Architecture to progressively refine data. Click on a layer to see more details.
Bronze Layer
Raw, immutable data.
Silver Layer
Cleaned & structured data.
Gold Layer
Aggregated, business-ready data.
Setup Guide
Follow these steps to set up and run the project locally. Click each step to expand.
Ensure you have **Linux** and **Visual Studio Code** (or your preferred IDE) installed.
Python 3.9 & Virtual Environment
Install Python 3.9 and the `venv` module:
sudo apt update
sudo apt install python3.9 python3.9-venv
Create and activate your virtual environment (from project root):
python3.9 -m venv venv
source venv/bin/activate
Install Python dependencies for Airflow and custom operators:
# requirements.txt
# Core Airflow and Providers
apache-airflow[postgres,celery,redis]==2.3.2
apache-airflow-providers-apache-spark==4.0.0
apache-airflow-providers-http==2.1.2
# Testing
pytest==8.4.1
Then install them:
pip install -r requirements.txt
Java Development Kit (JDK)
Spark requires Java 8 or later. Check your version and install if needed:
java -version
sudo apt install openjdk-11-jdk # or openjdk-8-jdk
Download and extract Spark 3.1.3 (pre-built for Hadoop 3.2):
wget https://archive.apache.org/dist/spark/spark-3.1.3/spark-3.1.3-bin-hadoop3.2.tgz
tar -xvzf spark-3.1.3-bin-hadoop3.2.tgz
Move the extracted folder (`spark-3.1.3-bin-hadoop3.2`) to your project root.
For Spark scripts, create `scripts/requirements_spark.txt`:
# scripts/requirements_spark.txt
pyspark==3.1.3
These dependencies are typically handled by `spark-submit` or pre-installed in the Spark environment.
Define Airflow and Python versions:
export AIRFLOW_VERSION=2.3.2
export PYTHON_VERSION=3.9
export CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
Install Apache Airflow (ensure virtual environment is active):
pip install "apache-airflow[postgres,celery,redis]==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
Set Airflow and Spark Home directories (add to `~/.bashrc` for persistence):
export AIRFLOW_HOME=$(pwd)/airflow_pipeline
export SPARK_HOME=$(pwd)/spark-3.1.3-bin-hadoop3.2/
Initialize Airflow Database and create an admin user:
airflow db migrate
airflow users create \
--username admin \
--firstname Admin \
--lastname User \
--role Admin \
--email admin@example.com \
--password admin
Optional: Configure Webserver for public access (for development only):
# Create or edit airflow_pipeline/webserver_config.py
# Add the following line:
AUTH_ROLE_PUBLIC = 'Admin'
Configure these connections in the Airflow UI (Admin > Connections):
`brewery_default` (HTTP Connection)
- Connection ID: `brewery_default`
- Connection Type: `HTTP`
- Host: `https://api.openbrewerydb.org`
`spark_default` (Spark Connection)
- Connection ID: `spark_default`
- Connection Type: `Spark`
- Host: `local`
- Extra (JSON):
{"spark_home": "/home/enduser/Desktop/Apache Airflow/spark-3.1.3-bin-hadoop3.2"}
1. Start Airflow in standalone mode (from `$AIRFLOW_HOME`):
airflow standalone
2. Access the Airflow UI at `http://localhost:8080`.
3. Locate `BreweryDAG` in the DAGs list and toggle it ON.
4. Trigger the DAG manually or wait for its scheduled run.
Pipeline Explorer: Understanding the Flow
This section details the individual scripts and components that make up the data pipeline. Click on each step to learn more about its role and the files involved.
1. Data Extraction (Bronze Layer) 📥
2. Data Transformation (Silver Layer) ⚙️
3. Data Aggregation (Gold Layer) 📊
Data Insights: Brewery Counts by Type & Country
This section provides a sample visualization of the aggregated data available in the Gold layer. The data represents a hypothetical count of breweries by type and country, demonstrating the kind of insights derived from the pipeline.
This chart is a representation of data from the Gold layer. Actual data may vary based on API updates and pipeline runs.