Executive Summary (Ohpen Data Engineer Case Study)¶

Overview¶

flowchart TB subgraph "Ohpen Case Study - Complete Solution" T1[Task 1: ETL Pipeline CSV → Validate → Parquet] T2[Task 2: Data Lake Architecture Medallion + Quarantine] T3[Task 3: SQL Analytics Balance History Query] T4[Task 4: CI/CD & DevOps Terraform + GitHub Actions] T5[Task 5: Communication Stakeholder Emails] end T1 -->|Writes to| T2 T2 -->|Queryable by| T3 T4 -->|Deploys & Tests| T1 T4 -->|Provisions| T2 T5 -->|Documents| T1 T5 -->|Documents| T2 style T1 fill:#7b1fa2,color:#fff style T2 fill:#1976d2,color:#fff style T3 fill:#388e3c,color:#fff style T4 fill:#5c6bc0,color:#fff style T5 fill:#5c6bc0,color:#fff

Important Note on Scope and Assumptions¶

⚠️ Scope Disclaimer: This solution is designed based on the case study requirements and my interpretation of the problem statement. I have made assumptions that may underestimate the current scope of operations at Ohpen. The architecture and implementation presented here should be viewed as a starting point that would need refinement based on:

Actual transaction volumes and data growth patterns
Real-world compliance and regulatory requirements
Existing infrastructure and operational constraints
Team size, skills, and operational maturity
Business priorities and budget constraints

This solution demonstrates architectural thinking and technical capabilities, but would require significant collaboration with the Ohpen team to align with production realities.

Summary¶

This submission implements a finance-grade, history-safe data pipeline for transaction analytics using a cloud-native architecture built on AWS object storage (S3) and serverless compute (AWS Glue, Amazon Athena):

Task 1: Python ETL implementations (Pandas and PySpark) read CSV from S3, validate the data, write partitioned Parquet, quarantine invalid rows, and manage condemned data (rows exceeding max attempts or exact duplicates) with 7-year retention for compliance. Includes optional TransactionID deduplication via Silver layer scanning.
Task 2: A lake architecture (Raw/Processed/Aggregates + Quarantine + Condemned) designed for auditability and scalable Athena analytics, with an Iceberg-first schema evolution protocol.
Task 3: A performant SQL query for month-end balance history (Q1 2024) designed for large tables (100M rows) with partition pruning in mind.
Task 4: CI/CD design and artifacts with EventBridge scheduling and Step Functions orchestration to validate changes, provision infra (Terraform), and deploy safely with backfill support and monitoring. EventBridge schedules daily runs (2 AM UTC), Step Functions orchestrates with automatic retry (3 attempts, exponential backoff), and Glue executes the ETL job.
Task 5: A stakeholder-facing status email and a one-page technical summary.

Assumptions¶

Note: The assumptions below are based on the case study requirements. These may underestimate the actual scope and complexity of Ohpen's current operations, and would need validation with the team.

Batch-first ingestion: Streaming is not required by the case; it can be added upstream if latency requirements change.
History matters: Raw data is treated as immutable; curated outputs are reproducible via run isolation (run_id) and deterministic transforms.
Negative amounts are allowed: withdrawals/refunds are valid; fraud/outlier detection is out of scope.
Time semantics: ingestion time (ingest_date) is used for raw landing; business/event time (TransactionTimestamp) drives processed partitions.

Trade-offs (intentional)¶

Fast validation over full platform realism: SQL correctness is validated with DuckDB tests for speed; production would add engine-specific integration tests.
ETL engine options: Both Pandas (for development/testing) and PySpark (recommended for production) implementations are provided. PySpark offers distributed processing, vectorized operations, and is optimized for AWS Glue execution.
Template config: We ship a config.yaml template to demonstrate code/config separation, but keep runtime wiring out of scope for the case.

Scope boundaries (what we intentionally did not build)¶

No real-time trading / millisecond SLAs.
TransactionID deduplication is implemented via Silver layer scanning (optional feature for preventing duplicate processing across runs), but no full enterprise deduplication strategy without a defined business key and idempotency contract.
No FX conversion or valuation framework (requires rates, cutoffs, rounding policy).
No full governance stack implementation (Lake Formation policies, lineage tooling), but the design supports it.

Where to find deliverables¶

Task 1 ETL (Pandas): tasks/01_data_ingestion_transformation/src/etl/ingest_transactions.py
Task 1 ETL (PySpark): tasks/01_data_ingestion_transformation/src/etl/ingest_transactions_spark.py (recommended for production)
Task 1 assumptions/edge cases: tasks/01_data_ingestion_transformation/ASSUMPTIONS_AND_EDGE_CASES.md
Task 2 architecture: tasks/02_data_lake_architecture_design/architecture.md
Task 3 SQL: tasks/03_sql/balance_history_2024_q1.sql
Task 4 CI/CD: tasks/04_devops_cicd/cicd_workflow.md (+ tasks/04_devops_cicd/.github/workflows/ci.yml)
Task 4 Infrastructure: tasks/04_devops_cicd/infra/terraform/main.tf (includes EventBridge, Step Functions, Glue)
Task 5 docs: tasks/05_communication_documentation/
Interview handout: HANDOUT.md
Testing guide: TESTING.md