Ohpen Data Engineer Case Study (2026) — Implementation Pack¶
This project contains a structured, interview-ready submission for the Ohpen case study:
Financial Data Pipeline and Data Lake Optimization¶
Project Structure¶
ohpen-case-2026/
├── README.md # This file
├── .gitignore # Git ignore rules
├── Makefile # Test automation
├── docs/ # All documentation
│ ├── EXECUTIVE_SUMMARY.md
│ ├── BUSINESS_CASE_SUMMARY.md
│ ├── SUBMISSION_GUIDE.md
│ └── ... (other docs)
├── config/ # Setup & configuration
├── docker/ # Docker test files
│ ├── docker-compose.test.yml
│ └── Dockerfile.test
├── dist/ # Build artifacts & archives
└── tasks/ # Task implementations
├── 01_data_ingestion_transformation/ # Python ETL
├── 02_data_lake_architecture_design/ # Architecture
├── 03_sql/ # SQL solution
├── 04_devops_cicd/ # CI/CD workflow
└── 05_communication_documentation/ # Stakeholder comms
Task Structure¶
tasks/01_data_ingestion_transformation/— Python ETL (CSV in S3 → validate → Parquet partitions)tasks/02_data_lake_architecture_design/— architecture + schema evolutiontasks/03_sql/— SQL solution for month-end balance historytasks/04_devops_cicd/— CI/CD workflow + IaC artifacts list + Terraform stubstasks/05_communication_documentation/— stakeholder email + one-page technical doc
Overview diagram¶
flowchart TD Start[Repo] --> Tasks[tasks/] Start --> Infra[infra/local] Tasks --> T1[01_data_ingestion_transformation] Tasks --> T2[02_data_lake_architecture_design] Tasks --> T3[03_sql] Tasks --> T4[04_devops_cicd] Tasks --> T5[05_communication_documentation] Infra --> LocalTest[Local_Testing_Infra] LocalTest --> RunETL[Run_ETL_Script] RunETL --> Outputs[Processed_Parquet_And_Quarantine] style Start fill:#424242,color:#fff style Tasks fill:#424242,color:#fff style Infra fill:#424242,color:#fff style T1 fill:#7b1fa2,color:#fff style T2 fill:#1976d2,color:#fff style T3 fill:#388e3c,color:#fff style T4 fill:#5c6bc0,color:#fff style T5 fill:#5c6bc0,color:#fff style LocalTest fill:#66bb6a,color:#111 style RunETL fill:#7b1fa2,color:#fff style Outputs fill:#388e3c,stroke:#7b1fa2,stroke-width:2px,color:#fff
Quick start (local demo for Task 1)¶
From the project root:
cd /home/stephen/projects/ohpen-case-2026
# (Optional) create venv and install deps
python3 -m venv .venv
source .venv/bin/activate
pip install -r tasks/01_data_ingestion_transformation/requirements.txt
# Run ETL: S3 -> validated Parquet -> S3
# Set AWS credentials and S3 bucket names as environment variables or CLI args
python3 tasks/01_data_ingestion_transformation/src/etl/ingest_transactions.py \
--input-bucket ohpen-raw \
--input-key transactions/transactions.csv \
--output-bucket ohpen-processed \
--output-prefix processed/transactions \
--quarantine-bucket ohpen-quarantine \
--quarantine-prefix quarantine/transactions
Deliverables checklist¶
- Task 1:
tasks/01_data_ingestion_transformation/src/etl/ingest_transactions.py+ edge cases/assumptions intasks/01_data_ingestion_transformation/ASSUMPTIONS_AND_EDGE_CASES.md - Task 2:
tasks/02_data_lake_architecture_design/architecture.md+diagram.mmd - Task 3:
tasks/03_sql/balance_history_2024_q1.sql - Task 4:
tasks/04_devops_cicd/(workflow + artifacts + terraform) - Task 5:
tasks/05_communication_documentation/(email + one-pager)