CI/CD Workflow & Infrastructure

Important Note on Scope

⚠️ Scope Disclaimer: This CI/CD workflow design is based on the case study requirements and my interpretation of the problem. I have made assumptions that may underestimate the current scope of operations at Ohpen. The workflow presented here would need refinement based on:

  • Actual deployment frequency and release processes
  • Existing CI/CD infrastructure and tooling
  • Team size, skills, and operational maturity
  • Production change management and approval processes
  • Real-world monitoring and alerting requirements

This design demonstrates DevOps thinking and best practices, but would require collaboration with the Ohpen team for production implementation.


1. Pipeline Design (GitHub Actions)

We implement a "History-Safe" CI/CD process. Since we support backfills and reprocessing, our deployment pipeline must support versioning and safe rollouts.

flowchart LR Dev[Developer] --> PR[Pull_Request] PR --> CI[CI_LintAndTests] CI --> Build[Build_VersionedArtifact] Build --> Plan[Terraform_Plan] Plan --> Apply[Terraform_Apply] Apply --> Deploy[Update_GlueJob_Or_RunConfig] Deploy --> Run[EventBridge → Step Functions → Glue Job] Run --> Logs[〰️ CloudWatch_Logs_And_Metrics] Logs --> Alerts[Alerts_On_FailureOrQuarantineSpike] style Dev fill:#1976d2,color:#fff style PR fill:#5c6bc0,color:#fff style CI fill:#66bb6a,color:#111 style Build fill:#7b1fa2,color:#fff style Plan fill:#424242,color:#fff style Apply fill:#424242,color:#fff style Deploy fill:#424242,color:#fff style Run fill:#5c6bc0,color:#fff style Logs fill:#00897b,color:#fff style Alerts fill:#ffa000,color:#111

Workflow Stages

  1. Validation (CI): Runs on every Pull Request.
    • ruff linting (code style).
    • pytest unit tests (partition logic, null handling, quarantine checks).
  2. Artifact Build:
    • Packages the Python ETL code.
    • Tags artifact with Git SHA (e.g., etl-v1.0.0-a1b2c3d.zip).
  3. Deployment (CD):
    • Uploads artifact to S3 (Code Bucket).
    • Terraform plan & apply to update AWS infrastructure (Glue Jobs, IAM, Buckets).
    • Updates the Glue Job to point to the new artifact.

Backfill Safety Checks

Our tests specifically cover "history safety":

  • Determinism: Rerunning the same input produces the exact same counts.
  • Partitioning: Timestamps strictly map to the correct year=YYYY/month=MM folder.
  • Quarantine: Invalid rows are never silently dropped; they must appear in quarantine.

Failure Handling

Critical Rule: Failed runs do not update _LATEST.json or current/ prefix.

Failure Scenarios & Behavior:

  1. ETL Job Failure (No Output):
  2. Job exits with non-zero status.
  3. No _SUCCESS marker is written.
  4. No data files are written (or partial files are ignored).
  5. _LATEST.json remains unchanged (points to previous successful run).
  6. current/ prefix remains unchanged (stable for SQL queries).
  7. Action: Alert triggers, platform team investigates, job can be rerun safely.

  8. Partial Write (Job Crashes Mid-Execution):

  9. Some data files may be written to run_id={...}/ path.
  10. _SUCCESS marker is missing (incomplete run).
  11. Consumers ignore the run (only read runs with _SUCCESS).
  12. _LATEST.json and current/ remain unchanged.
  13. Action: Platform team can safely rerun (new run_id), or clean up partial files.

  14. Validation Failure (Data Quality Issues):

  15. Job completes but validation checks fail (e.g., quarantine rate > threshold).
  16. _SUCCESS marker may or may not be written (depends on validation stage).
  17. If validation fails before _SUCCESS, run is treated as failed.
  18. _LATEST.json and current/ remain unchanged.
  19. Action: Data quality team reviews quarantine, fixes source data, reruns.

  20. Circuit Breaker Triggered:

  21. Pipeline halts automatically when >100 same errors occur within 1 hour.
  22. Job exits with RuntimeError (non-zero status).
  23. No _SUCCESS marker is written.
  24. No data files are written (or partial files are ignored).
  25. _LATEST.json and current/ remain unchanged.
  26. Action: Platform team investigates root cause, fixes systemic issue, manually restarts pipeline.

  27. Schema Validation Failure:

  28. Schema drift detected (unexpected columns, type mismatches).
  29. Job fails fast before writing any output.
  30. _LATEST.json and current/ remain unchanged.
  31. Action: Schema registry updated, job rerun with new schema version.

Safe Rerun Behavior:

  • Each rerun uses a new run_id (timestamp-based).
  • Previous failed runs remain in storage (audit trail).
  • Only successful runs (with _SUCCESS) are considered for promotion.
  • Promotion to _LATEST.json and current/ is a separate, explicit step after validation.

Promotion Workflow

⚠️ IMPORTANT: For financial data compliance, human approval is required before promoting Silver layer data to production. See HUMAN_VALIDATION_POLICY.md for details.


1. ETL writes to: silver/.../schema_v=v1/run_id=20260121T120000Z/... (isolated run_id path)
1. ETL writes _SUCCESS marker with metrics
1. CloudWatch alarm triggers: "New Run Available for Review" (P2 - 4 hours)
1. Human Review Required:
   - Domain Analyst reviews quality metrics (quarantine rate, volume, schema)
   - Business validation (sample data review)
   - Technical validation (Platform Team)
1. If approved: Update Glue Catalog → promote to production consumption
1. If rejected: Leave run in isolated path, notify Platform Team to investigate
1. Audit: All approvals logged with timestamp, approver, metrics, reason

Note: _LATEST.json and current/ prefix are for Gold layer (Task 2 design), not Silver layer (Task 1).

2. Infrastructure as Code (Terraform)

We use Terraform to manage the Data Lake lifecycle.

Key Resources

  • S3 Buckets: raw, processed, quarantine, code-artifacts.
  • Policy: Block public access, enable versioning (critical for recovery).
  • IAM Roles: Least-privilege role for the ETL job (Read Raw, Write Processed/Quarantine).
  • Prefix-Scoped Permissions: IAM policies are scoped to S3 prefixes for fine-grained access control:
    • ETL Job Role: s3://bucket/bronze/* (read), s3://bucket/silver/* (write), s3://bucket/quarantine/* (write)
    • Platform Team: s3://bucket/bronze/*, s3://bucket/silver/*, s3://bucket/quarantine/* (read/write)
    • Domain Teams: s3://bucket/silver/{domain}/* (write), s3://bucket/gold/{domain}/* (read)
    • Business/Analysts: s3://bucket/gold/* (read-only via Athena)
    • Compliance: s3://bucket/bronze/*, s3://bucket/quarantine/* (read-only for audit)
  • AWS Glue Job: Defines the Python Shell or Spark job, injected with the S3 path to the script.
  • AWS Step Functions: ✅ Implemented - Orchestrates scheduled ETL runs with automatic retry and error handling
  • EventBridge: ✅ Implemented - Schedules daily ETL runs (configurable cron expression)
  • AWS Lambda: ❌ Not Implemented - Step Functions provides better orchestration; Lambda not needed
  • Monitoring: CloudWatch Alarms.
  • Alarm: QuarantineRows > 0 (investigate data quality issues).
  • Alarm: JobFailure.

Step Functions Orchestration

What Events Are Scheduled:

  • Daily ETL Runs: Process new transaction CSV files from Bronze layer
  • Default schedule: Daily at 2 AM UTC (cron(0 2 * * ? *))
  • Alternative: Monthly on 1st day (cron(0 2 1 * ? *))
  • Configurable via EventBridge rule

What Step Functions Does:

  1. RunETL State:
  2. Invokes AWS Glue Spark job synchronously
  3. Automatically retries on transient failures (Glue throttling, service exceptions)
  4. Max 3 retry attempts with exponential backoff

  5. ValidateOutput State:

  6. Checks for _SUCCESS marker in Silver layer output
  7. Verifies ETL job completed successfully
  8. Retries if marker not found (handles eventual consistency)

  9. Error Handling:

  10. Catches all failures and transitions to HandleFailure state
  11. Publishes failure metrics to CloudWatch
  12. Logs execution details for debugging

Benefits:

  • Automatic Retry: Handles transient Glue failures automatically
  • Visual Monitoring: Step Functions console shows execution flow
  • Error Recovery: Built-in error handling and state management
  • Cost: ~$0.01/month for daily runs (negligible)

EventBridge Integration:

  • EventBridge rule triggers Step Functions state machine on schedule
  • No Lambda needed - direct EventBridge → Step Functions integration
  • Configurable cron expression for different schedules

3. Deployment Artifacts

List of files required to deploy this solution:

Artifact Description
tasks/01_data_ingestion_transformation/src/etl/ingest_transactions.py The main ETL logic.
tasks/01_data_ingestion_transformation/requirements.txt Python dependencies (pandas, pyarrow, boto3).
tasks/04_devops_cicd/infra/terraform/main.tf Infrastructure definition.
tasks/04_devops_cicd/.github/workflows/ci.yml Automation pipeline definition.
tasks/01_data_ingestion_transformation/config.yaml Runtime config template (bucket names, prefixes, allow-lists). The ETL currently uses CLI args/env vars; this file is a template for config-driven runs.

4. Operational Monitoring

To ensure reliability, we emit structured logs (JSON) that CloudWatch Insights can query:

Volume Metrics:

  • run_id: Trace ID for the execution.
  • input_rows: Total rows read from Bronze layer.
  • valid_rows_count: Rows successfully validated and written to Silver layer.
  • quarantined_rows_count: Rows quarantined due to validation failures.
  • condemned_rows_count: Rows auto-condemned (duplicates or max attempts).

Quality Metrics:

  • quarantine_rate: Percentage of rows quarantined.
  • validation_failure_rate: Quality metric.
  • error_type_distribution: Breakdown by error type (SCHEMA_ERROR, NULL_VALUE_ERROR, etc.).

Loop Prevention Metrics:

  • avg_attempt_count: Average attempt_count across all processed rows.
  • duplicate_detection_rate: Percentage of rows flagged as exact duplicates.
  • auto_condemnation_rate: Percentage of rows auto-condemned.
  • circuit_breaker_triggers: Count of circuit breaker activations.

Performance Metrics:

  • rows_processed_per_run: Throughput metric.
  • duration_seconds: ETL execution time.
  • missing_partitions: Completeness metric.
  • runtime_anomalies: Performance metric.

Monitoring Alerts

  • Infrastructure Alerts (Data Platform Team):
  • Job failure (non-zero exit, missing _SUCCESS).
  • Missing partitions (expected partitions not present).
  • Runtime anomalies (unusually long execution time).
  • Circuit breaker triggered (>100 same errors/hour → automatic pipeline halt, P1 - Immediate).
  • Data Quality Alerts (Data Quality Team):
  • Quarantine rate spike (quarantined_rows / input_rows > 1%, P2 - 4 hours).
  • Validation failure (schema drift, type mismatches, P2 - 2 hours).
  • High attempt_count (avg > 1.5, P2 - 4 hours).
  • Auto-condemnation spike (rate > 0.5%, P2 - 4 hours).
  • Business Metric Alerts (Domain Teams / Business):
  • Volume anomaly (too few or too many rows vs baseline, P3 - 8 hours).
  • SLA breach (data freshness, availability, P1 - 1 hour).

5. Ownership & Governance (Task 4)

5.1 Ownership Matrix

Aspect Owner Steward Execution Responsibility
Pipeline Infrastructure Data Platform Team Platform Lead Data Platform Team CI/CD pipeline reliability, deployment automation, AWS Step Functions/EventBridge job scheduling, run isolation mechanisms
CI/CD Automation Data Platform Team Platform Lead Data Platform Team GitHub Actions workflows, Terraform IaC, artifact versioning, AWS Glue job deployment automation
AWS Infrastructure Data Platform Team Platform Lead Data Platform Team S3 buckets, Glue jobs, IAM policies, CloudWatch infrastructure, resource provisioning
Infrastructure Monitoring Data Platform Team Platform Lead Data Platform Team Job failure detection, system health monitoring, infrastructure alerting, run completeness checks
Validation Rules Domain Teams (Silver) / Business (Gold) Domain Analyst / Finance Controller Data Platform Team Validation logic definition, business rule specification, quality threshold configuration
Data Quality Monitoring Data Quality Team Data Quality Lead Data Quality Team Quarantine review, quality metric interpretation, quality threshold monitoring, source data issue triage
Quarantine Resolution Data Quality Team / Domain Teams Data Quality Lead Data Quality Team / Domain Teams Invalid row investigation, source data correction, quarantine workflow management
Silver Layer Schema Domain Teams Domain Analyst Data Platform Team Schema change approval, schema version management, validation logic evolution
Gold Layer Schema Business (Finance) Finance Controller Data Platform Team Business contract definition, reporting schema approval, stakeholder communication
Schema Implementation Data Platform Team Platform Lead Data Platform Team Technical implementation of approved schema changes, schema versioning, backward compatibility
Dataset Ownership Metadata Domain Teams (Silver) / Business (Gold) Domain Analyst / Finance Controller Data Platform Team Business context, dataset purpose, consumer requirements, ownership assignment
Run Lineage & Audit Logs Data Platform Team Platform Lead Data Platform Team Technical lineage tracking, infrastructure audit logs, run metadata capture
Backfill Execution Data Platform Team Platform Lead Data Platform Team Technical reprocessing, partition-level backfills, run isolation for backfills
Backfill Approval Domain Teams (Silver) / Business (Gold) Domain Analyst / Finance Controller N/A Business approval for reprocessing, data correction requests, historical data updates
Promotion to Current Domain Teams (Silver) / Business (Gold) Domain Analyst / Finance Controller Data Platform Team Validation approval, promotion decision, _LATEST.json and current/ updates

5.4 Contact Information

Role Contact Method Primary Contact Escalation Contact
Platform Lead Email: [email protected]; Slack: #data-platform Platform Lead Infrastructure Manager
Domain Analyst Email: [email protected]; Slack: #data-domain Domain Analyst Domain Manager
Finance Controller Email: [email protected]; Slack: #finance-data Finance Controller Finance Director
Data Quality Lead Email: [email protected]; Slack: #data-quality Data Quality Lead Data Quality Manager
On-Call Engineer PagerDuty: data-platform-oncall; Slack: @oncall-data-platform On-Call Rotation Platform Lead

Note: Replace placeholder email addresses and Slack channels with actual organizational contacts. Contact information should be maintained in a centralized directory (e.g., company wiki, directory service) and kept up-to-date.

5.2 Alert Ownership & Escalation

Alert Type Primary Owner Escalation Path Response Time
Job Failure (non-zero exit, missing _SUCCESS) Data Platform Team Platform Lead → On-call Engineer Immediate (P1)
Infrastructure Errors (S3 access failures, Glue job crashes) Data Platform Team Platform Lead → Infrastructure Team Immediate (P1)
Circuit Breaker Triggered (>100 same errors/hour, pipeline halted) Data Platform Team Platform Lead → On-call Engineer Immediate (P1)
Quarantine Rate Spike (quarantined_rows / input_rows > 1%) Data Quality Team Data Quality Lead → Domain Teams 4 hours (P2)
Validation Failure (schema drift, type mismatches) Data Quality Team Data Quality Lead → Domain Teams → Platform Team 2 hours (P2)
High Attempt Count (avg attempt_count > 1.5) Data Quality Team Data Quality Lead → Domain Teams 4 hours (P2)
Auto-Condemnation Spike (auto-condemnation rate > 0.5%) Data Quality Team Data Quality Lead → Domain Teams 4 hours (P2)
Volume Anomaly (too few/many rows vs baseline) Domain Teams Domain Analyst → Data Quality Team 8 hours (P3)
Missing Partitions (expected partitions not present) Data Platform Team Platform Lead → Domain Teams 4 hours (P2)
Runtime Anomalies (unusually long execution time) Data Platform Team Platform Lead → Infrastructure Team 4 hours (P2)
SLA Breach (data freshness, availability) Domain Teams / Business Domain Analyst / Finance Controller → Platform Team 1 hour (P1)

5.3 Operational Responsibilities Matrix

Responsibility Type Bronze Layer Silver Layer Gold Layer
Ownership (who decides) Data Platform Team Domain Teams Business (Finance)
Stewardship (who maintains) Ingestion Lead Domain Analyst Finance Controller
Execution (who implements) Data Platform Team Data Platform Team Data Platform Team
Consumption (who uses) Platform engineers, audit Analysts, data scientists Finance, BI, stakeholders
Change Approval Platform Lead Domain Analyst + Platform review Finance Controller + Platform review
Quality Monitoring Platform Team (ingestion reliability) Data Quality Team (validation rules) Business (reporting accuracy)

5.4 Governance Workflows

Schema Change Workflow

flowchart TD Request[Schema_Change_Request] --> Layer{Which_Layer?} Layer -->|Silver| Domain[Domain_Analyst_Review] Layer -->|Gold| Business[Finance_Controller_Review] Domain --> PlatformReview[Platform_Team_Technical_Feasibility] Business --> PlatformReview PlatformReview -->|Feasible| Approval[Owner_Approval] PlatformReview -->|Not_Feasible| Reject[Request_Rejected_or_Modified] Approval --> Implementation[Platform_Team_Implementation] Implementation --> Versioning[New_Schema_Version_Created] Versioning --> Validation[Quality_Checks_and_Backfill] Validation -->|Pass| Promotion[Update_LATEST_and_Current] Validation -->|Fail| Fix[Fix_Issues_and_Revalidate] Fix --> Validation

Data Quality Issue Resolution Workflow

flowchart TD Alert[Quality_Alert_Triggered] --> Triage[Data_Quality_Team_Triage] Triage --> Investigation[Investigate_Quarantine_Data] Investigation --> SourceIssue{Source_Data_Issue?} SourceIssue -->|Yes| SourceFix[Domain_Team_Fixes_Source] SourceIssue -->|No| ValidationIssue{Validation_Rule_Issue?} ValidationIssue -->|Yes| RuleUpdate[Domain_Team_Updates_Rules] ValidationIssue -->|No| PlatformIssue[Platform_Team_Investigates] SourceFix --> Backfill[Backfill_Approval_Request] RuleUpdate --> Backfill Backfill --> Approval[Domain_Business_Approval] Approval --> Reprocess[Platform_Team_Reprocesses] Reprocess --> Validate[Quality_Validation] Validate -->|Pass| Promote[Promote_to_Current] Validate -->|Fail| Fix[Fix_and_Revalidate]

Backfill Approval Workflow

flowchart TD Request[Backfill_Request] --> Layer{Which_Layer?} Layer -->|Silver| DomainApproval[Domain_Analyst_Approval] Layer -->|Gold| BusinessApproval[Finance_Controller_Approval] DomainApproval --> PlatformReview[Platform_Team_Technical_Assessment] BusinessApproval --> PlatformReview PlatformReview --> Schedule[Schedule_Backfill] Schedule --> Execute[Platform_Team_Executes] Execute --> Validate[Quality_Validation] Validate -->|Pass| Promote[Promote_to_Current] Validate -->|Fail| Notify[Notify_Requestor]

5.5 Governance Rules

Infrastructure & Platform Rules

  • Platform team owns all pipeline reliability, deployment automation, and infrastructure provisioning.
  • All infrastructure changes must go through Terraform IaC and CI/CD pipeline.
  • Failed runs never update _LATEST.json or current/ (explicit promotion only).
  • Run isolation via run_id is mandatory for all ETL executions.
  • Human approval required before promoting Silver layer data to production (see HUMAN_VALIDATION_POLICY.md).
  • Human approval required before deleting condemned data (see HUMAN_VALIDATION_POLICY.md).

Data Quality Rules

  • Data Quality Team owns quarantine review and quality metric interpretation.
  • Quarantine rate thresholds are configurable per dataset (default: 1%).
  • All invalid rows must be preserved in quarantine (never silently dropped).
  • Quality alerts require Data Quality Team triage before escalation.

Schema Governance Rules

  • Silver layer schema changes require Domain Team approval + Platform Team implementation.
  • Gold layer schema changes require Business/Finance approval + Platform Team implementation.
  • All schema changes must be versioned via schema_v for backward compatibility.
  • Schema changes require quality validation and backfill if needed.

Backfill & Reprocessing Rules

  • Technical execution of backfills is owned by Platform Team.
  • Business approval for backfills is required from Domain Teams (Silver) or Business (Gold).
  • All backfills write to new run_id paths (no overwrites).
  • Promotion to current/ requires explicit validation and approval (Gold layer only - Task 2 design).
  • Human approval required before promoting Silver layer data to production (see HUMAN_VALIDATION_POLICY.md).

Monitoring & Alerting Rules

  • Infrastructure alerts (P1) route to Platform Team for immediate response.
  • Data quality alerts (P2) route to Data Quality Team for investigation.
  • Business metric alerts (P3) route to Domain Teams for review.
  • Alert ownership is clearly defined with escalation paths documented.

Alignment with Task 2 Architecture

  • Bronze layer ownership aligns with Task 2: Platform Team (immutability, ingestion).
  • Silver layer ownership aligns with Task 2: Domain Teams (validation, schema).
  • Gold layer ownership aligns with Task 2: Business/Finance (contracts, reporting).
  • Governance workflows are consistent across Task 2 and Task 4 documentation.