Ohpen Data Engineer Case Study (2026) — Implementation Pack

This project contains a structured, interview-ready submission for the Ohpen case study:

Financial Data Pipeline and Data Lake Optimization

Project Structure

ohpen-case-2026/
├── README.md                    # This file
├── .gitignore                   # Git ignore rules
├── Makefile                     # Test automation
├── docs/                        # All documentation
│   ├── EXECUTIVE_SUMMARY.md
│   ├── BUSINESS_CASE_SUMMARY.md
│   ├── SUBMISSION_GUIDE.md
│   └── ... (other docs)
├── config/                      # Setup & configuration
├── docker/                      # Docker test files
│   ├── docker-compose.test.yml
│   └── Dockerfile.test
├── dist/                        # Build artifacts & archives
└── tasks/                       # Task implementations
    ├── 01_data_ingestion_transformation/  # Python ETL
    ├── 02_data_lake_architecture_design/  # Architecture
    ├── 03_sql/                  # SQL solution
    ├── 04_devops_cicd/          # CI/CD workflow
    └── 05_communication_documentation/  # Stakeholder comms

Task Structure

  • tasks/01_data_ingestion_transformation/ — Python ETL (CSV in S3 → validate → Parquet partitions)
  • tasks/02_data_lake_architecture_design/ — architecture + schema evolution
  • tasks/03_sql/ — SQL solution for month-end balance history
  • tasks/04_devops_cicd/ — CI/CD workflow + IaC artifacts list + Terraform stubs
  • tasks/05_communication_documentation/ — stakeholder email + one-page technical doc

Overview diagram

flowchart TD Start[Repo] --> Tasks[tasks/] Start --> Infra[infra/local] Tasks --> T1[01_data_ingestion_transformation] Tasks --> T2[02_data_lake_architecture_design] Tasks --> T3[03_sql] Tasks --> T4[04_devops_cicd] Tasks --> T5[05_communication_documentation] Infra --> LocalTest[Local_Testing_Infra] LocalTest --> RunETL[Run_ETL_Script] RunETL --> Outputs[Processed_Parquet_And_Quarantine] style Start fill:#424242,color:#fff style Tasks fill:#424242,color:#fff style Infra fill:#424242,color:#fff style T1 fill:#7b1fa2,color:#fff style T2 fill:#1976d2,color:#fff style T3 fill:#388e3c,color:#fff style T4 fill:#5c6bc0,color:#fff style T5 fill:#5c6bc0,color:#fff style LocalTest fill:#66bb6a,color:#111 style RunETL fill:#7b1fa2,color:#fff style Outputs fill:#388e3c,stroke:#7b1fa2,stroke-width:2px,color:#fff

Quick start (local demo for Task 1)

From the project root:

cd /home/stephen/projects/ohpen-case-2026

# (Optional) create venv and install deps

python3 -m venv .venv
source .venv/bin/activate
pip install -r tasks/01_data_ingestion_transformation/requirements.txt

# Run ETL: S3 -> validated Parquet -> S3


# Set AWS credentials and S3 bucket names as environment variables or CLI args

python3 tasks/01_data_ingestion_transformation/src/etl/ingest_transactions.py \
  --input-bucket ohpen-raw \
  --input-key transactions/transactions.csv \
  --output-bucket ohpen-processed \
  --output-prefix processed/transactions \
  --quarantine-bucket ohpen-quarantine \
  --quarantine-prefix quarantine/transactions

Deliverables checklist

  • Task 1: tasks/01_data_ingestion_transformation/src/etl/ingest_transactions.py + edge cases/assumptions in tasks/01_data_ingestion_transformation/ASSUMPTIONS_AND_EDGE_CASES.md
  • Task 2: tasks/02_data_lake_architecture_design/architecture.md + diagram.mmd
  • Task 3: tasks/03_sql/balance_history_2024_q1.sql
  • Task 4: tasks/04_devops_cicd/ (workflow + artifacts + terraform)
  • Task 5: tasks/05_communication_documentation/ (email + one-pager)