Data Lake Architecture Design¶

Important Note on Scope¶

⚠️ Scope Disclaimer: This architecture design is based on the case study requirements and my interpretation of the problem statement. I have made assumptions that may underestimate the current scope of operations at Ohpen. The architecture presented here should be viewed as a starting point that would require:

Validation against actual data volumes and growth patterns
Alignment with existing infrastructure and operational practices
Refinement based on real-world compliance and regulatory requirements
Collaboration with the Ohpen team to understand production constraints

This design demonstrates architectural thinking and best practices, but would need significant refinement for production deployment.

1. Lake Architecture Overview¶

The lake follows a Bronze / Silver / Gold medallion architecture model.

Layer	Purpose	Mutability	Primary Consumers	Implementation Status
Bronze	Raw audit trail	Immutable	Platform engineers, audit	✅ Implemented (Task 1)
Silver	Validated analytics	Append-only	Analysts, data scientists	✅ Implemented (Task 1)
Gold	Business contracts	Versioned	Finance, BI, stakeholders	Task 2: Structure Design, Task 3: SQL Aggregation

Note:

Task 1 (ETL Pipeline): Implements Bronze → Silver transformation only (does not write Gold layer files)
Task 2 (Architecture): Designs Gold layer structure (folder organization, governance, ownership)
Task 3 (SQL): Demonstrates Silver → Gold aggregation patterns via SQL queries (shows how to create Gold data)

2. Folder Structure (S3)¶

We adopt a structured Data Lake organization to support scalability, reproducibility, and data quality.

Naming Convention Examples¶

s3://data-lake-bucket/
├── bronze/                                    # Bronze Layer (Raw, Immutable)
│   └── mortgages/
│       └── transactions/
│           └── ingest_date=2026-01-21/        # Partition by arrival time
│               └── run_id=20260121T120000Z/    # Run isolation for idempotency
│                   └── file.csv.gz            # Original source file (verbatim copy)
│
├── silver/                                    # Silver Layer (Validated & Enriched)
│   └── mortgages/
│       └── transactions/
│           └── year=2026/               # Business-time partition (optimized for queries)
│               └── month=01/
│                   └── schema_v=v1/          # Schema version for evolution
│                       └── run_id=20260121T120000Z/  # Write-once run isolation
│                           ├── _SUCCESS      # Atomic completion marker
│                           └── part-00000.parquet  # Optimized storage format
│
├── quarantine/                                # Invalid Data (Audit Trail)
│   └── mortgages/
│       └── transactions/
│           └── ingest_date=2026-01-21/
│               └── run_id=20260121T120000Z/
│                   └── invalid_rows.parquet  # Contains original data + validation_error column
│           └── condemned/                    # Condemned Data (No More Retries)
│               └── ingest_date=2026-01-21/
│                   └── run_id=20260121T120000Z/
│                       └── condemned_rows.parquet  # Max attempts exceeded or exact duplicates
│
└── gold/                                      # Gold Layer (Business Views)
    └── finance/
        └── account_balances/
            └── schema_v=v1/
                ├── _LATEST.json              # Authority pointer to current run
                ├── run_id=20260121T120000Z/   # Historical run
                │   └── as_of_month=2024-03/
                │       └── part-00000.parquet
                └── current/                  # Stable prefix for SQL access
                    └── as_of_month=2024-03/
                        └── part-00000.parquet

Key Path Patterns¶

Bronze: bronze/{domain}/{dataset}/ingest_date={YYYY-MM-DD}/run_id={...}/file.csv.gz
Silver: silver/{domain}/{dataset}/year={YYYY}/month={MM}/schema_v={vN}/run_id={...}/part-*.parquet
Quarantine: quarantine/{domain}/{dataset}/ingest_date={YYYY-MM-DD}/run_id={...}/invalid_rows.parquet
Condemned: quarantine/{domain}/{dataset}/condemned/ingest_date={YYYY-MM-DD}/run_id={...}/condemned_rows.parquet
Gold: gold/{domain}/{dataset}/schema_v={vN}/run_id={...}/as_of_month={YYYY-MM}/part-*.parquet
Gold (current): gold/{domain}/{dataset}/schema_v={vN}/current/{partition}/part-*.parquet

Key Design Choices¶

run_id Isolation: Every ETL run writes to a unique path. This prevents partial overwrites and allows safe backfills (reprocessing old months without deleting current data immediately).
Quarantine Layer: We never silently drop data. Bad rows (schema violations, invalid currencies) are preserved for audit and debugging with metadata enrichment (row_hash, attempt_count, retry_history).
Condemned Layer: Rows exceeding max attempts (≥3) or exact duplicates are moved to condemned layer for permanent retention (no automatic retries). Retention 7 years minimum for compliance/audit. Human approval required before deletion (see HUMAN_VALIDATION_POLICY.md).
Loop Prevention: TransactionID deduplication (Silver layer scan), metadata enrichment (row_hash, attempt_count), duplicate detection in quarantine history, and attempt limit enforcement prevent infinite retry loops and ensure transaction-level idempotency.
Circuit Breaker: Pipeline halts automatically when >100 same errors occur within 1 hour window, requiring human intervention.
Partitioning:
- Bronze is partitioned by ingest_date (when it arrived).
- Silver is partitioned by year/month (business-time, derived from TransactionTimestamp) for query performance.
- Gold is partitioned by business dimensions (e.g., as_of_month) for reporting.
Schema Versioning: All Silver and Gold paths include schema_v={version} to support schema evolution without breaking consumers.
Object store reality (operational safety):
- Treat bucket listings/partition discovery as eventually consistent across engines/caches and S3-compatible implementations.
- Avoid “rename as commit” patterns (no true atomic rename for directories).
- Use write-then-publish: write to a run-isolated path and publish a _SUCCESS (and optionally a manifest) as the commit signal.

3. Safe Publishing Pattern & Authority¶

Write-Then-Publish Pattern¶

Each dataset run follows this pattern:

Writes data files to a run_id-scoped prefix.
Writes a manifest.json describing the output (optional but recommended).
Writes a _SUCCESS marker once complete.

Consumers only read runs with _SUCCESS markers present.

Failure Mode Analysis - `_SUCCESS` Marker¶

If _SUCCESS marker is missing: Consumers cannot distinguish complete runs from partial/failed runs, leading to incomplete queries or query failures. Partial data may be read, causing incorrect analytics results.
Impact: Data quality issues, incorrect reporting, potential business decisions based on incomplete data.
Mitigation: ETL jobs only write _SUCCESS after all data writes complete successfully. Consumers implement strict checks to ignore runs without _SUCCESS.

Authority & "Current" Data¶

Two mechanisms are used to manage authoritative datasets:

`_LATEST.json` (Control Plane)¶

A small control-plane object pointing to the authoritative run.

Location: gold/{domain}/{dataset}/schema_v={vN}/_LATEST.json

Content:

{
  "run_id": "20260121T120000Z",
  "schema_version": "v1",
  "published_at": "2026-01-21T12:00:00Z",
  "partition_path": "gold/finance/account_balances/schema_v=v1/run_id=20260121T120000Z"
}

Used for:

Promotion: Update _LATEST.json to point to a new run after validation.
Rollback: Point _LATEST.json back to a previous run if issues are discovered.
Reproducibility: Track which run is currently authoritative.

Failure Mode Analysis - `_LATEST.json` Pointer¶

If _LATEST.json is missing: No authoritative pointer to the current dataset version exists, making promotion/rollback impossible and requiring manual discovery of the latest run. Automated workflows cannot determine which dataset version to use.
Impact: Inability to promote new datasets, rollback to previous versions, or automate dataset version management. Manual intervention required for all dataset updates.
Mitigation: _LATEST.json is updated atomically after validation. Promotion workflows verify _LATEST.json exists before proceeding. Fallback mechanisms can query _SUCCESS markers to discover latest run if needed.

`current/` Prefix (Stable SQL Access)¶

A stable location for Amazon Athena queries via AWS Glue Data Catalog tables.

Pattern: gold/{domain}/{dataset}/schema_v={vN}/current/{partition}/part-*.parquet

Mechanism:

New run writes to run_id={...}/ path.
After validation, copy/symlink to current/ prefix.
AWS Glue Data Catalog tables point to current/ for stable SQL semantics in Amazon Athena.

Benefits:

Stable SQL semantics for analysts (no need to update table locations).
Full lineage and rollback via run history (all run_id paths preserved).
Separation of concerns: current/ for queries, run_id/ for audit.

Example Flow:


1. Write: gold/.../schema_v=v1/run_id=20260121T120000Z/as_of_month=2024-03/part-00000.parquet
1. Validate: Check _SUCCESS, row counts, schema compliance
1. Publish: Copy to gold/.../schema_v=v1/current/as_of_month=2024-03/part-00000.parquet
1. Update: Write _LATEST.json pointing to run_id=20260121T120000Z

4. Schema Evolution Strategy¶

As business requirements change (e.g., adding TransactionType), our architecture must adapt without downtime.

Schema is a contract (not just columns)¶

In financial domains (mortgages today, investments long-term), schema changes are integration changes:

Producers and consumers evolve independently (different deploy cycles).
Data is often retained for years (audit + replay), so consumers must handle old shapes indefinitely.
The biggest risk is not “parsing fails”, but semantic drift (a field keeps the same name but changes meaning).

This means we manage schema evolution on two layers:

Lake/analytics schema evolution (tables in Glue/Athena/Iceberg).
Integration/event contract evolution (APIs/events between services). The same compatibility discipline applies even if payloads are JSON/Avro/Protobuf.

Compatibility policy (recommended default)¶

Backward-compatible reads: updated consumers can read both old and new data.
Forward-compatible writes where possible: updated producers may emit additional fields without breaking older consumers.
No “dangerous defaults”: avoid defaults that change interpretation (finance examples: default interest rate, default risk score, default LTV). Prefer NULL/missing and explicit “unknown”.

Allowed changes (safe / zero-downtime)¶

Add a new nullable/optional field (e.g., TransactionType), and treat missing as “unknown”.
Deprecate before remove: keep field, stop populating it, document replacement, remove only after all consumers are migrated.
Add new derived views instead of changing semantics: if a definition changes (e.g., “balance” definition), publish a new view/table/metric name.

Breaking changes (playbooks)¶

Some changes are fundamentally breaking (semantics or security), and require explicit playbooks:

Rolling migration via flags (dual-read / dual-logic)
Add a new field + a version/flag (e.g., encryption_version, calculation_version).
Consumers implement both paths; producers start emitting the new version gradually.
Works when keeping old data is acceptable and you can support both versions for a period.
Versioned streams/tables (v2) + migration
Create ..._v2 topic/table/view.
Backfill/migrate old records into v2 (e.g., re-encrypt, re-calculate) when old data must not remain.
Switch consumers to v2, then producers, then retire v1.
This is the safer option for security-driven changes (e.g., compromised encryption) or meaning changes where “dual interpretation” is risky.

Replayability constraints (long retention)¶

If events/records can be replayed years later:

New fields must remain optional; consumers must tolerate missing fields.
Renames are treated as breaking unless you keep the old field as an alias during a transition.
Any change that requires “all historical rows updated” implies either:
an Iceberg snapshot/backfill plan, or
a versioned v2 dataset.

Current Implementation: Parquet-only¶

Current Status: The ETL pipeline uses Parquet-only format. This is sufficient for current scale and requirements.

Parquet Schema Evolution Strategy:

Additive Changes Only: New columns are added to the end of the schema.
Schema Registry: The Glue Data Catalog is the source of truth.
Backfill on Demand: If old data needs the new column populated, AWS Step Functions triggers an AWS Glue reprocessing job (reading raw, writing to a new run_id in processed) and updates the AWS Glue Data Catalog pointer.

Future Enhancement: Iceberg Tables (Optional)¶

Note: Iceberg is a future consideration, not currently implemented.

For larger scale (10M+ transactions/month), Apache Iceberg tables via AWS Glue could provide:

Advanced schema evolution (add/rename/reorder columns without rewriting files)
Hidden partitioning (automatic partition management)
Time travel queries (query historical snapshots)

Current Approach: Parquet-only with schema versioning (schema_v=v1, schema_v=v2) provides sufficient schema evolution for current needs.

Type Change Example¶

Changing a column type (e.g., amount from string to decimal(18,2)) requires a new schema version and backfill:

Create schema_v=v2: Update Glue Catalog table definition with new type in schema_v=v2 path.
Backfill Historical Data: Reprocess all historical partitions to convert amount from string to decimal, writing to schema_v=v2/run_id=BACKFILL_YYYYMMDD/.
Update Consumers: Migrate all consumers to query schema_v=v2 tables.
Deprecate v1: After migration period, mark schema_v=v1 as deprecated (but preserve for audit).
Why Breaking: Type changes are breaking because existing queries expecting string will fail on decimal, and data semantics change (e.g., string "100.50" vs decimal 100.50).

Domain notes: mortgages and investments¶

Prefer immutable facts + effective dating: rather than overwriting a contract attribute, append a new record/event with effective_from (and optionally effective_to).
Treat amendments as new facts: interest rate resets, refinancing, contract changes, corporate actions should be modeled as new events/types/tables rather than redefining old fields.
Money & precision are part of the contract: keep currency explicit; avoid changing precision/rounding rules without a versioned definition (and often a v2 dataset).

Handling schema drift (runtime != documented)¶

Fail fast or quarantine non-conforming records (never silently coerce).
Emit drift metrics (unknown fields, missing required fields, new enum values) and trigger a review.
For lake ingestion, quarantine is the safety valve; for integrations/events, schema validation at the boundary is the goal.

5. Backfills & Reprocessing¶

Backfills recompute only affected partitions (e.g., reprocess year=2024/month=03).
New outputs are written under a new run_id (e.g., run_id=20260128_BACKFILL).
Promotion occurs by updating _LATEST.json and current/ after validation.
No historical data is overwritten; all run_id paths are preserved for audit.

Example Backfill Flow:


1. Identify issue: Data quality problem in March 2024
1. Reprocess: Read bronze/.../ingest_date=2024-03-*/..., write to silver/.../schema_v=v1/run_id=20260128_BACKFILL/year=2024/month=03/...
1. Validate: Check _SUCCESS, row counts, quality metrics
1. Promote: Update _LATEST.json, copy to current/
1. Notify: Alert consumers of updated dataset

6. Ownership & Governance (Task 2)¶

Layer	Owner	Steward	Responsibility	Implementation Status
Bronze	Data Platform Team	Ingestion Lead	Immutability, raw data preservation, ingestion reliability	✅ Implemented (Task 1)
Silver	Domain Teams	Domain Analyst	Validation logic, schema evolution, data quality	✅ Implemented (Task 1)
Gold	Business (Finance)	Finance Controller	Business contracts, reporting accuracy, stakeholder communication	Task 2: Structure Design, Task 3: SQL Aggregation

Note:

Task 1 (ETL Pipeline): Implements Bronze → Silver transformation only (does not write Gold layer files)
Task 2 (Architecture): Designs Gold layer structure and governance (folder organization, ownership model)
Task 3 (SQL): Demonstrates Silver → Gold aggregation patterns via SQL queries (shows how to create Gold data)

Rules¶

Bronze is immutable and restricted: Only platform team can write; no direct business user access. ✅ Implemented (Task 1)
Silver schema changes require domain + platform review: Domain team proposes, platform team implements and validates. ✅ Implemented (Task 1)
Gold changes require explicit approval and versioning: Finance controller approves schema changes; all changes are versioned via schema_v. (Task 2: Structure Design, Task 3: SQL Aggregation Pattern)

Governance Workflow:

Schema change request → Domain Analyst (Silver) or Finance Controller (Gold).
Review → Platform team assesses technical feasibility.
Approval → Domain/Business owner approves.
Implementation → Platform team implements with new schema_v.
Validation → Quality checks, backfill if needed.
Promotion → Update _LATEST.json and current/ after validation (Gold layer only - not implemented in Task 1).

Note: Promotion workflow (_LATEST.json, current/ prefix) is designed for Gold layer in Task 2 (Architecture). Task 3 (SQL) demonstrates how to create Gold data via aggregation. Task 1 (ETL Pipeline) only implements Bronze → Silver transformation and does not write Gold layer files.

7. Architecture Diagram¶

Note: This diagram shows the complete architecture design (Task 2).

Task 1 (ETL Pipeline): Implements Bronze → Silver transformation only (✅ Implemented)
Task 2 (Architecture): Designs Gold layer structure (📋 Design Only)
Task 3 (SQL): Demonstrates Silver → Gold aggregation patterns via SQL queries

flowchart TD subgraph sources [Sources] CSV[CSV_Files] --> Bronze[Bronze_Layer
S3 Raw
✅ Task 1] end subgraph lake [Data_Lake] Bronze --> Glue[AWS Glue Job
PySpark ETL
✅ Task 1] Glue -->|"Valid_Data"| Silver[Silver_Layer
S3 Processed
✅ Task 1] Glue -->|"Invalid_Data"| Quarantine[Quarantine
S3 Quarantine
✅ Task 1] Silver --> Gold[Gold_Layer
S3 Aggregates
📋 Task 2 Design] Gold --> Latest[🕐 _LATEST.json
📋 Task 2 Design] Gold --> Current[🕐 current/
📋 Task 2 Design] end subgraph analytics [Analytics] Catalog[Glue Data Catalog
Table Metadata
✅ Task 1] --> Athena[Amazon Athena
SQL Query Engine
✅ Task 1] Athena --> BI[BI_Consumers] end Silver -.->|"Registers
Partitions"| Catalog Current -.->|"Registers
Partitions"| Catalog style Bronze fill:#f57c00,color:#fff style Silver fill:#388e3c,stroke:#7b1fa2,stroke-width:2px,color:#fff style Gold fill:#1976d2,color:#fff style Quarantine fill:#d32f2f,color:#fff style Glue fill:#7b1fa2,color:#fff style Latest fill:#757575,color:#fff style Current fill:#757575,color:#fff style Catalog fill:#424242,color:#fff style Athena fill:#388e3c,color:#fff

Athena & Glue Separation¶

Amazon Athena is a query engine that executes SQL queries. It reads Parquet files from S3 via Glue Catalog metadata. SQL queries execute on Athena compute infrastructure, not directly on S3 storage.
AWS Glue Data Catalog stores table schemas, partition metadata, and table locations. Glue does not execute queries; it provides metadata that Athena uses to locate and read Parquet files from S3.
Separation of Concerns: Glue = metadata/catalog layer (where tables are defined), Athena = query execution engine (where SQL runs). This separation allows multiple query engines (Athena, Spark, Presto) to use the same Glue Catalog metadata.

Note: the diagram source is also available in diagram.mmd.

8. Failure Mode Analysis¶

This section explicitly documents what breaks if critical components are removed or fail, demonstrating system resilience understanding.

Component Failure Scenarios¶

If `_SUCCESS` Marker is Missing¶

Impact: Consumers cannot identify complete runs, leading to incomplete queries or query failures.

Symptom: Partial data may be read, causing incorrect analytics results
Root Cause: ETL job failed before writing _SUCCESS, or marker was accidentally deleted
Detection: Consumers check for _SUCCESS before reading; monitoring alerts on missing markers
Mitigation: ETL jobs only write _SUCCESS after all data writes complete successfully. Consumers implement strict checks to ignore runs without _SUCCESS. Failed runs are automatically retried.

If `_LATEST.json` is Missing¶

Impact: No authoritative pointer to current dataset, making promotion/rollback impossible.

Symptom: Automated workflows cannot determine which dataset version to use; manual discovery required
Root Cause: Promotion workflow failed, or file was accidentally deleted
Detection: Promotion workflows verify _LATEST.json exists before proceeding
Mitigation: _LATEST.json is updated atomically after validation. Fallback mechanisms can query _SUCCESS markers to discover latest run if needed. Version control tracks _LATEST.json changes.

If Glue Data Catalog is Missing¶

Impact: Athena cannot locate tables, all SQL queries fail.

Symptom: All Athena queries return "Table not found" errors
Root Cause: Glue Catalog table definition deleted, or table location points to non-existent path
Detection: Query failures trigger immediate alerts; catalog health checks monitor table existence
Mitigation: Catalog definitions are version-controlled in Terraform. Automated tests verify catalog tables exist before deployment. Backup/restore procedures for catalog metadata.

If Quarantine Layer is Removed¶

Impact: Invalid data is silently dropped, audit trail lost.

Symptom: Data quality issues go undetected; no way to investigate or recover invalid records
Root Cause: Quarantine write failures, or quarantine layer access denied
Detection: Monitoring alerts on quarantine write failures; data quality metrics track expected vs actual row counts
Mitigation: Quarantine writes are mandatory and fail the ETL job if they fail. Quarantine layer has separate IAM policies to prevent accidental deletion. Regular audits verify quarantine data retention.

If Bronze Layer is Overwritten (Immutability Violation)¶

Impact: Historical audit trail lost; backfills become impossible.

Symptom: Cannot reproduce historical reports; audit compliance failures
Root Cause: Accidental overwrite, or immutability controls bypassed
Detection: S3 versioning alerts on overwrites; CloudTrail logs track all write operations
Mitigation: Bronze layer uses S3 versioning and object lock. IAM policies prevent overwrites (only append allowed). Regular backups of critical partitions.

If `run_id` Isolation is Removed¶

Impact: Reruns overwrite previous outputs, causing data loss and making backfills unsafe.

Symptom: Historical run outputs disappear; cannot rollback to previous versions
Root Cause: ETL job writes to fixed path instead of run_id-scoped path
Detection: Monitoring detects duplicate run_id paths or missing run isolation
Mitigation: run_id generation is mandatory and timestamp-based (unique per run). ETL code enforces run_id in all output paths. Code reviews verify run isolation patterns.

If Schema Versioning is Removed¶

Impact: Schema changes break existing consumers; no backward compatibility.

Symptom: Existing queries fail after schema changes; data type mismatches
Root Cause: Schema changes applied without versioning; consumers not migrated
Detection: Query failures after schema changes; schema drift monitoring
Mitigation: All schema changes require new schema_v version. Migration playbooks ensure consumers are updated before old versions deprecated. Schema registry tracks all versions.

System Resilience Principles¶

Fail-Safe Defaults: System fails in a safe state (no partial data published) rather than silently corrupting data.
Defense in Depth: Multiple layers of validation (ETL validation, catalog checks, consumer checks) prevent single points of failure.
Auditability: All operations are logged (CloudTrail, CloudWatch) enabling post-incident analysis.
Immutability: Critical layers (Bronze) are append-only, preventing accidental data loss.
Idempotency: Reruns are safe; each run writes to unique paths preventing overwrites.