Data Lake Architecture Design¶
Important Note on Scope¶
⚠️ Scope Disclaimer: This architecture design is based on the case study requirements and my interpretation of the problem statement. I have made assumptions that may underestimate the current scope of operations at Ohpen. The architecture presented here should be viewed as a starting point that would require:
- Validation against actual data volumes and growth patterns
- Alignment with existing infrastructure and operational practices
- Refinement based on real-world compliance and regulatory requirements
- Collaboration with the Ohpen team to understand production constraints
This design demonstrates architectural thinking and best practices, but would need significant refinement for production deployment.
1. Lake Architecture Overview¶
The lake follows a Bronze / Silver / Gold medallion architecture model.
| Layer | Purpose | Mutability | Primary Consumers | Implementation Status |
|---|---|---|---|---|
| Bronze | Raw audit trail | Immutable | Platform engineers, audit | ✅ Implemented (Task 1) |
| Silver | Validated analytics | Append-only | Analysts, data scientists | ✅ Implemented (Task 1) |
| Gold | Business contracts | Versioned | Finance, BI, stakeholders | Task 2: Structure Design, Task 3: SQL Aggregation |
Note:
- Task 1 (ETL Pipeline): Implements Bronze → Silver transformation only (does not write Gold layer files)
- Task 2 (Architecture): Designs Gold layer structure (folder organization, governance, ownership)
- Task 3 (SQL): Demonstrates Silver → Gold aggregation patterns via SQL queries (shows how to create Gold data)
2. Folder Structure (S3)¶
We adopt a structured Data Lake organization to support scalability, reproducibility, and data quality.
Naming Convention Examples¶
s3://data-lake-bucket/
├── bronze/ # Bronze Layer (Raw, Immutable)
│ └── mortgages/
│ └── transactions/
│ └── ingest_date=2026-01-21/ # Partition by arrival time
│ └── run_id=20260121T120000Z/ # Run isolation for idempotency
│ └── file.csv.gz # Original source file (verbatim copy)
│
├── silver/ # Silver Layer (Validated & Enriched)
│ └── mortgages/
│ └── transactions/
│ └── year=2026/ # Business-time partition (optimized for queries)
│ └── month=01/
│ └── schema_v=v1/ # Schema version for evolution
│ └── run_id=20260121T120000Z/ # Write-once run isolation
│ ├── _SUCCESS # Atomic completion marker
│ └── part-00000.parquet # Optimized storage format
│
├── quarantine/ # Invalid Data (Audit Trail)
│ └── mortgages/
│ └── transactions/
│ └── ingest_date=2026-01-21/
│ └── run_id=20260121T120000Z/
│ └── invalid_rows.parquet # Contains original data + validation_error column
│ └── condemned/ # Condemned Data (No More Retries)
│ └── ingest_date=2026-01-21/
│ └── run_id=20260121T120000Z/
│ └── condemned_rows.parquet # Max attempts exceeded or exact duplicates
│
└── gold/ # Gold Layer (Business Views)
└── finance/
└── account_balances/
└── schema_v=v1/
├── _LATEST.json # Authority pointer to current run
├── run_id=20260121T120000Z/ # Historical run
│ └── as_of_month=2024-03/
│ └── part-00000.parquet
└── current/ # Stable prefix for SQL access
└── as_of_month=2024-03/
└── part-00000.parquet Key Path Patterns¶
- Bronze:
bronze/{domain}/{dataset}/ingest_date={YYYY-MM-DD}/run_id={...}/file.csv.gz - Silver:
silver/{domain}/{dataset}/year={YYYY}/month={MM}/schema_v={vN}/run_id={...}/part-*.parquet - Quarantine:
quarantine/{domain}/{dataset}/ingest_date={YYYY-MM-DD}/run_id={...}/invalid_rows.parquet - Condemned:
quarantine/{domain}/{dataset}/condemned/ingest_date={YYYY-MM-DD}/run_id={...}/condemned_rows.parquet - Gold:
gold/{domain}/{dataset}/schema_v={vN}/run_id={...}/as_of_month={YYYY-MM}/part-*.parquet - Gold (current):
gold/{domain}/{dataset}/schema_v={vN}/current/{partition}/part-*.parquet
Key Design Choices¶
run_idIsolation: Every ETL run writes to a unique path. This prevents partial overwrites and allows safe backfills (reprocessing old months without deleting current data immediately).- Quarantine Layer: We never silently drop data. Bad rows (schema violations, invalid currencies) are preserved for audit and debugging with metadata enrichment (row_hash, attempt_count, retry_history).
- Condemned Layer: Rows exceeding max attempts (≥3) or exact duplicates are moved to condemned layer for permanent retention (no automatic retries). Retention 7 years minimum for compliance/audit. Human approval required before deletion (see
HUMAN_VALIDATION_POLICY.md). - Loop Prevention: TransactionID deduplication (Silver layer scan), metadata enrichment (row_hash, attempt_count), duplicate detection in quarantine history, and attempt limit enforcement prevent infinite retry loops and ensure transaction-level idempotency.
- Circuit Breaker: Pipeline halts automatically when >100 same errors occur within 1 hour window, requiring human intervention.
- Partitioning:
- Bronze is partitioned by
ingest_date(when it arrived). - Silver is partitioned by
year/month(business-time, derived fromTransactionTimestamp) for query performance. - Gold is partitioned by business dimensions (e.g.,
as_of_month) for reporting.
- Bronze is partitioned by
- Schema Versioning: All Silver and Gold paths include
schema_v={version}to support schema evolution without breaking consumers. - Object store reality (operational safety):
- Treat bucket listings/partition discovery as eventually consistent across engines/caches and S3-compatible implementations.
- Avoid “rename as commit” patterns (no true atomic rename for directories).
- Use write-then-publish: write to a run-isolated path and publish a
_SUCCESS(and optionally a manifest) as the commit signal.
3. Safe Publishing Pattern & Authority¶
Write-Then-Publish Pattern¶
Each dataset run follows this pattern:
- Writes data files to a
run_id-scoped prefix. - Writes a
manifest.jsondescribing the output (optional but recommended). - Writes a
_SUCCESSmarker once complete.
Consumers only read runs with _SUCCESS markers present.
Failure Mode Analysis - _SUCCESS Marker¶
- If
_SUCCESSmarker is missing: Consumers cannot distinguish complete runs from partial/failed runs, leading to incomplete queries or query failures. Partial data may be read, causing incorrect analytics results. - Impact: Data quality issues, incorrect reporting, potential business decisions based on incomplete data.
- Mitigation: ETL jobs only write
_SUCCESSafter all data writes complete successfully. Consumers implement strict checks to ignore runs without_SUCCESS.
Authority & "Current" Data¶
Two mechanisms are used to manage authoritative datasets:
_LATEST.json (Control Plane)¶
A small control-plane object pointing to the authoritative run.
Location: gold/{domain}/{dataset}/schema_v={vN}/_LATEST.json
Content:
{
"run_id": "20260121T120000Z",
"schema_version": "v1",
"published_at": "2026-01-21T12:00:00Z",
"partition_path": "gold/finance/account_balances/schema_v=v1/run_id=20260121T120000Z"
} Used for:
- Promotion: Update
_LATEST.jsonto point to a new run after validation. - Rollback: Point
_LATEST.jsonback to a previous run if issues are discovered. - Reproducibility: Track which run is currently authoritative.
Failure Mode Analysis - _LATEST.json Pointer¶
- If
_LATEST.jsonis missing: No authoritative pointer to the current dataset version exists, making promotion/rollback impossible and requiring manual discovery of the latest run. Automated workflows cannot determine which dataset version to use. - Impact: Inability to promote new datasets, rollback to previous versions, or automate dataset version management. Manual intervention required for all dataset updates.
- Mitigation:
_LATEST.jsonis updated atomically after validation. Promotion workflows verify_LATEST.jsonexists before proceeding. Fallback mechanisms can query_SUCCESSmarkers to discover latest run if needed.
current/ Prefix (Stable SQL Access)¶
A stable location for Amazon Athena queries via AWS Glue Data Catalog tables.
Pattern: gold/{domain}/{dataset}/schema_v={vN}/current/{partition}/part-*.parquet
Mechanism:
- New run writes to
run_id={...}/path. - After validation, copy/symlink to
current/prefix. - AWS Glue Data Catalog tables point to
current/for stable SQL semantics in Amazon Athena.
Benefits:
- Stable SQL semantics for analysts (no need to update table locations).
- Full lineage and rollback via run history (all
run_idpaths preserved). - Separation of concerns:
current/for queries,run_id/for audit.
Example Flow:
1. Write: gold/.../schema_v=v1/run_id=20260121T120000Z/as_of_month=2024-03/part-00000.parquet
1. Validate: Check _SUCCESS, row counts, schema compliance
1. Publish: Copy to gold/.../schema_v=v1/current/as_of_month=2024-03/part-00000.parquet
1. Update: Write _LATEST.json pointing to run_id=20260121T120000Z
4. Schema Evolution Strategy¶
As business requirements change (e.g., adding TransactionType), our architecture must adapt without downtime.
Schema is a contract (not just columns)¶
In financial domains (mortgages today, investments long-term), schema changes are integration changes:
- Producers and consumers evolve independently (different deploy cycles).
- Data is often retained for years (audit + replay), so consumers must handle old shapes indefinitely.
- The biggest risk is not “parsing fails”, but semantic drift (a field keeps the same name but changes meaning).
This means we manage schema evolution on two layers:
- Lake/analytics schema evolution (tables in Glue/Athena/Iceberg).
- Integration/event contract evolution (APIs/events between services). The same compatibility discipline applies even if payloads are JSON/Avro/Protobuf.
Compatibility policy (recommended default)¶
- Backward-compatible reads: updated consumers can read both old and new data.
- Forward-compatible writes where possible: updated producers may emit additional fields without breaking older consumers.
- No “dangerous defaults”: avoid defaults that change interpretation (finance examples: default interest rate, default risk score, default LTV). Prefer
NULL/missing and explicit “unknown”.
Allowed changes (safe / zero-downtime)¶
- Add a new nullable/optional field (e.g.,
TransactionType), and treat missing as “unknown”. - Deprecate before remove: keep field, stop populating it, document replacement, remove only after all consumers are migrated.
- Add new derived views instead of changing semantics: if a definition changes (e.g., “balance” definition), publish a new view/table/metric name.
Breaking changes (playbooks)¶
Some changes are fundamentally breaking (semantics or security), and require explicit playbooks:
-
Rolling migration via flags (dual-read / dual-logic)
-
Add a new field + a version/flag (e.g.,
encryption_version,calculation_version). - Consumers implement both paths; producers start emitting the new version gradually.
-
Works when keeping old data is acceptable and you can support both versions for a period.
-
Versioned streams/tables (v2) + migration
-
Create
..._v2topic/table/view. - Backfill/migrate old records into v2 (e.g., re-encrypt, re-calculate) when old data must not remain.
- Switch consumers to v2, then producers, then retire v1.
- This is the safer option for security-driven changes (e.g., compromised encryption) or meaning changes where “dual interpretation” is risky.
Replayability constraints (long retention)¶
If events/records can be replayed years later:
- New fields must remain optional; consumers must tolerate missing fields.
- Renames are treated as breaking unless you keep the old field as an alias during a transition.
- Any change that requires “all historical rows updated” implies either:
- an Iceberg snapshot/backfill plan, or
- a versioned v2 dataset.
Current Implementation: Parquet-only¶
Current Status: The ETL pipeline uses Parquet-only format. This is sufficient for current scale and requirements.
Parquet Schema Evolution Strategy:
- Additive Changes Only: New columns are added to the end of the schema.
- Schema Registry: The Glue Data Catalog is the source of truth.
- Backfill on Demand: If old data needs the new column populated, AWS Step Functions triggers an AWS Glue reprocessing job (reading
raw, writing to a newrun_idinprocessed) and updates the AWS Glue Data Catalog pointer.
Future Enhancement: Iceberg Tables (Optional)¶
Note: Iceberg is a future consideration, not currently implemented.
For larger scale (10M+ transactions/month), Apache Iceberg tables via AWS Glue could provide:
- Advanced schema evolution (add/rename/reorder columns without rewriting files)
- Hidden partitioning (automatic partition management)
- Time travel queries (query historical snapshots)
Current Approach: Parquet-only with schema versioning (schema_v=v1, schema_v=v2) provides sufficient schema evolution for current needs.
Type Change Example¶
Changing a column type (e.g., amount from string to decimal(18,2)) requires a new schema version and backfill:
- Create
schema_v=v2: Update Glue Catalog table definition with new type inschema_v=v2path. - Backfill Historical Data: Reprocess all historical partitions to convert
amountfrom string to decimal, writing toschema_v=v2/run_id=BACKFILL_YYYYMMDD/. - Update Consumers: Migrate all consumers to query
schema_v=v2tables. - Deprecate v1: After migration period, mark
schema_v=v1as deprecated (but preserve for audit). - Why Breaking: Type changes are breaking because existing queries expecting
stringwill fail ondecimal, and data semantics change (e.g., string "100.50" vs decimal 100.50).
Domain notes: mortgages and investments¶
- Prefer immutable facts + effective dating: rather than overwriting a contract attribute, append a new record/event with
effective_from(and optionallyeffective_to). - Treat amendments as new facts: interest rate resets, refinancing, contract changes, corporate actions should be modeled as new events/types/tables rather than redefining old fields.
- Money & precision are part of the contract: keep currency explicit; avoid changing precision/rounding rules without a versioned definition (and often a v2 dataset).
Handling schema drift (runtime != documented)¶
- Fail fast or quarantine non-conforming records (never silently coerce).
- Emit drift metrics (unknown fields, missing required fields, new enum values) and trigger a review.
- For lake ingestion, quarantine is the safety valve; for integrations/events, schema validation at the boundary is the goal.
5. Backfills & Reprocessing¶
- Backfills recompute only affected partitions (e.g., reprocess
year=2024/month=03). - New outputs are written under a new
run_id(e.g.,run_id=20260128_BACKFILL). - Promotion occurs by updating
_LATEST.jsonandcurrent/after validation. - No historical data is overwritten; all
run_idpaths are preserved for audit.
Example Backfill Flow:
1. Identify issue: Data quality problem in March 2024
1. Reprocess: Read bronze/.../ingest_date=2024-03-*/..., write to silver/.../schema_v=v1/run_id=20260128_BACKFILL/year=2024/month=03/...
1. Validate: Check _SUCCESS, row counts, quality metrics
1. Promote: Update _LATEST.json, copy to current/
1. Notify: Alert consumers of updated dataset
6. Ownership & Governance (Task 2)¶
| Layer | Owner | Steward | Responsibility | Implementation Status |
|---|---|---|---|---|
| Bronze | Data Platform Team | Ingestion Lead | Immutability, raw data preservation, ingestion reliability | ✅ Implemented (Task 1) |
| Silver | Domain Teams | Domain Analyst | Validation logic, schema evolution, data quality | ✅ Implemented (Task 1) |
| Gold | Business (Finance) | Finance Controller | Business contracts, reporting accuracy, stakeholder communication | Task 2: Structure Design, Task 3: SQL Aggregation |
Note:
- Task 1 (ETL Pipeline): Implements Bronze → Silver transformation only (does not write Gold layer files)
- Task 2 (Architecture): Designs Gold layer structure and governance (folder organization, ownership model)
- Task 3 (SQL): Demonstrates Silver → Gold aggregation patterns via SQL queries (shows how to create Gold data)
Rules¶
- Bronze is immutable and restricted: Only platform team can write; no direct business user access. ✅ Implemented (Task 1)
- Silver schema changes require domain + platform review: Domain team proposes, platform team implements and validates. ✅ Implemented (Task 1)
- Gold changes require explicit approval and versioning: Finance controller approves schema changes; all changes are versioned via
schema_v. (Task 2: Structure Design, Task 3: SQL Aggregation Pattern)
Governance Workflow:
- Schema change request → Domain Analyst (Silver) or Finance Controller (Gold).
- Review → Platform team assesses technical feasibility.
- Approval → Domain/Business owner approves.
- Implementation → Platform team implements with new
schema_v. - Validation → Quality checks, backfill if needed.
- Promotion → Update
_LATEST.jsonandcurrent/after validation (Gold layer only - not implemented in Task 1).
Note: Promotion workflow (_LATEST.json, current/ prefix) is designed for Gold layer in Task 2 (Architecture). Task 3 (SQL) demonstrates how to create Gold data via aggregation. Task 1 (ETL Pipeline) only implements Bronze → Silver transformation and does not write Gold layer files.
7. Architecture Diagram¶
Note: This diagram shows the complete architecture design (Task 2).
- Task 1 (ETL Pipeline): Implements Bronze → Silver transformation only (✅ Implemented)
- Task 2 (Architecture): Designs Gold layer structure (📋 Design Only)
- Task 3 (SQL): Demonstrates Silver → Gold aggregation patterns via SQL queries
S3 Raw
✅ Task 1] end subgraph lake [Data_Lake] Bronze --> Glue[AWS Glue Job
PySpark ETL
✅ Task 1] Glue -->|"Valid_Data"| Silver[Silver_Layer
S3 Processed
✅ Task 1] Glue -->|"Invalid_Data"| Quarantine[Quarantine
S3 Quarantine
✅ Task 1] Silver --> Gold[Gold_Layer
S3 Aggregates
📋 Task 2 Design] Gold --> Latest[🕐 _LATEST.json
📋 Task 2 Design] Gold --> Current[🕐 current/
📋 Task 2 Design] end subgraph analytics [Analytics] Catalog[Glue Data Catalog
Table Metadata
✅ Task 1] --> Athena[Amazon Athena
SQL Query Engine
✅ Task 1] Athena --> BI[BI_Consumers] end Silver -.->|"Registers
Partitions"| Catalog Current -.->|"Registers
Partitions"| Catalog style Bronze fill:#f57c00,color:#fff style Silver fill:#388e3c,stroke:#7b1fa2,stroke-width:2px,color:#fff style Gold fill:#1976d2,color:#fff style Quarantine fill:#d32f2f,color:#fff style Glue fill:#7b1fa2,color:#fff style Latest fill:#757575,color:#fff style Current fill:#757575,color:#fff style Catalog fill:#424242,color:#fff style Athena fill:#388e3c,color:#fff
Athena & Glue Separation¶
- Amazon Athena is a query engine that executes SQL queries. It reads Parquet files from S3 via Glue Catalog metadata. SQL queries execute on Athena compute infrastructure, not directly on S3 storage.
- AWS Glue Data Catalog stores table schemas, partition metadata, and table locations. Glue does not execute queries; it provides metadata that Athena uses to locate and read Parquet files from S3.
- Separation of Concerns: Glue = metadata/catalog layer (where tables are defined), Athena = query execution engine (where SQL runs). This separation allows multiple query engines (Athena, Spark, Presto) to use the same Glue Catalog metadata.
Note: the diagram source is also available in diagram.mmd.
8. Failure Mode Analysis¶
This section explicitly documents what breaks if critical components are removed or fail, demonstrating system resilience understanding.
Component Failure Scenarios¶
If _SUCCESS Marker is Missing¶
Impact: Consumers cannot identify complete runs, leading to incomplete queries or query failures.
- Symptom: Partial data may be read, causing incorrect analytics results
- Root Cause: ETL job failed before writing
_SUCCESS, or marker was accidentally deleted - Detection: Consumers check for
_SUCCESSbefore reading; monitoring alerts on missing markers - Mitigation: ETL jobs only write
_SUCCESSafter all data writes complete successfully. Consumers implement strict checks to ignore runs without_SUCCESS. Failed runs are automatically retried.
If _LATEST.json is Missing¶
Impact: No authoritative pointer to current dataset, making promotion/rollback impossible.
- Symptom: Automated workflows cannot determine which dataset version to use; manual discovery required
- Root Cause: Promotion workflow failed, or file was accidentally deleted
- Detection: Promotion workflows verify
_LATEST.jsonexists before proceeding - Mitigation:
_LATEST.jsonis updated atomically after validation. Fallback mechanisms can query_SUCCESSmarkers to discover latest run if needed. Version control tracks_LATEST.jsonchanges.
If Glue Data Catalog is Missing¶
Impact: Athena cannot locate tables, all SQL queries fail.
- Symptom: All Athena queries return "Table not found" errors
- Root Cause: Glue Catalog table definition deleted, or table location points to non-existent path
- Detection: Query failures trigger immediate alerts; catalog health checks monitor table existence
- Mitigation: Catalog definitions are version-controlled in Terraform. Automated tests verify catalog tables exist before deployment. Backup/restore procedures for catalog metadata.
If Quarantine Layer is Removed¶
Impact: Invalid data is silently dropped, audit trail lost.
- Symptom: Data quality issues go undetected; no way to investigate or recover invalid records
- Root Cause: Quarantine write failures, or quarantine layer access denied
- Detection: Monitoring alerts on quarantine write failures; data quality metrics track expected vs actual row counts
- Mitigation: Quarantine writes are mandatory and fail the ETL job if they fail. Quarantine layer has separate IAM policies to prevent accidental deletion. Regular audits verify quarantine data retention.
If Bronze Layer is Overwritten (Immutability Violation)¶
Impact: Historical audit trail lost; backfills become impossible.
- Symptom: Cannot reproduce historical reports; audit compliance failures
- Root Cause: Accidental overwrite, or immutability controls bypassed
- Detection: S3 versioning alerts on overwrites; CloudTrail logs track all write operations
- Mitigation: Bronze layer uses S3 versioning and object lock. IAM policies prevent overwrites (only append allowed). Regular backups of critical partitions.
If run_id Isolation is Removed¶
Impact: Reruns overwrite previous outputs, causing data loss and making backfills unsafe.
- Symptom: Historical run outputs disappear; cannot rollback to previous versions
- Root Cause: ETL job writes to fixed path instead of
run_id-scoped path - Detection: Monitoring detects duplicate
run_idpaths or missing run isolation - Mitigation:
run_idgeneration is mandatory and timestamp-based (unique per run). ETL code enforcesrun_idin all output paths. Code reviews verify run isolation patterns.
If Schema Versioning is Removed¶
Impact: Schema changes break existing consumers; no backward compatibility.
- Symptom: Existing queries fail after schema changes; data type mismatches
- Root Cause: Schema changes applied without versioning; consumers not migrated
- Detection: Query failures after schema changes; schema drift monitoring
- Mitigation: All schema changes require new
schema_vversion. Migration playbooks ensure consumers are updated before old versions deprecated. Schema registry tracks all versions.
System Resilience Principles¶
- Fail-Safe Defaults: System fails in a safe state (no partial data published) rather than silently corrupting data.
- Defense in Depth: Multiple layers of validation (ETL validation, catalog checks, consumer checks) prevent single points of failure.
- Auditability: All operations are logged (CloudTrail, CloudWatch) enabling post-incident analysis.
- Immutability: Critical layers (Bronze) are append-only, preventing accidental data loss.
- Idempotency: Reruns are safe; each run writes to unique paths preventing overwrites.