Testing & Validation Guide

This document describes the comprehensive test suite for validating both Python and SQL code in the Ohpen case study implementation.

Overview

flowchart TD CI[CI_Pipeline] --> PythonTests[Python_Validation] CI --> SQLTests[SQL_Validation] PythonTests --> Ruff[Ruff_Linting] PythonTests --> UnitTests[Unit_Tests] PythonTests --> IntegrationTests[Integration_Tests] SQLTests --> SQLFluff[SQLFluff_Linting] SQLTests --> SQLUnitTests[DuckDB_Unit_Tests] Ruff --> Pass1[Syntax_And_Style_OK] UnitTests --> Pass2[Validation_Logic_OK] IntegrationTests --> Pass3[End_to_End_OK] SQLFluff --> Pass4[SQL_Syntax_OK] SQLUnitTests --> Pass5[Query_Results_OK] style CI fill:#5c6bc0,color:#fff style PythonTests fill:#66bb6a,color:#111 style SQLTests fill:#66bb6a,color:#111 style Ruff fill:#66bb6a,color:#111 style UnitTests fill:#66bb6a,color:#111 style IntegrationTests fill:#66bb6a,color:#111 style SQLFluff fill:#66bb6a,color:#111 style SQLUnitTests fill:#66bb6a,color:#111 style Pass1 fill:#2e7d32,color:#fff style Pass2 fill:#2e7d32,color:#fff style Pass3 fill:#2e7d32,color:#fff style Pass4 fill:#2e7d32,color:#fff style Pass5 fill:#2e7d32,color:#fff

Python Tests (Task 1 - ETL)

Location

  • tasks/01_data_ingestion_transformation/tests/

Test Files

  1. test_etl.py - Unit tests for validation logic
  2. Valid data handling
  3. Invalid currency detection
  4. Null value handling
  5. Malformed timestamp detection
  6. Partition column generation
  7. Run ID in quarantine
  8. Empty DataFrame edge case
  9. All rows quarantined edge case
  10. Deterministic behavior

  11. test_integration.py - End-to-end integration tests

  12. Full ETL workflow with temp files
  13. Parquet partition structure validation
  14. Success marker validation
  15. Quarantine data validation
  16. Deterministic output verification

Running Python Tests

cd /home/stephen/projects/ohpen-case-2026/tasks/01_data_ingestion_transformation

# Install dependencies

pip install -r requirements.txt
pip install -r requirements-dev.txt

# Run linting

ruff check src/ tests/

# Run all tests

export PYTHONPATH=$PYTHONPATH:$(pwd)
pytest tests/ -v

# Run specific test file

pytest tests/test_etl.py -v
pytest tests/test_integration.py -v

Script Tests (Workspace & Task Scripts)

Location

  • tests/scripts/ - Tests for workspace-level scripts
  • tasks/05_communication_documentation/tests/ - Tests for task-level scripts

Test Files

  1. tests/scripts/test_create_submission_bundles.py - Tests for submission bundle creation script
  2. Script existence and permissions
  3. Bash syntax validation
  4. Directory structure creation
  5. File copying logic
  6. ZIP file generation
  7. Error handling

  8. tasks/05_communication_documentation/tests/test_compile_deliverables.py - Tests for PDF compilation script

  9. Script syntax validation
  10. File reference checks
  11. Pandoc integration
  12. Error handling

Running Script Tests

# From project root
cd /home/stephen/projects/ohpen-case-2026

# Run all script tests
pytest tests/scripts/ tasks/05_communication_documentation/tests/test_compile_deliverables.py -v

# Or use Makefile
make test-scripts

# Run specific test
pytest tests/scripts/test_create_submission_bundles.py -v

SQL Tests (Task 3 - Balance Query)

Location (2)

  • tasks/03_sql/tests/

Test Files (2)

  1. test_balance_query.py - DuckDB-based SQL tests
  2. SQL syntax validation
  3. Expected results verification
  4. Missing months handling (NULL values)
  5. Last transaction of month selection

  6. test_data.sql - Sample data from Appendix A

  7. expected_output.csv - Expected query results

Running SQL Tests

cd /home/stephen/projects/ohpen-case-2026/tasks/03_sql

# Install dependencies (2)

pip install -r requirements.txt

# Run SQL linting

sqlfluff lint balance_history_2024_q1.sql --dialect postgres

# Run SQL unit tests

pytest tests/ -v

CI/CD Integration

The GitHub Actions workflow (.github/workflows/ci.yml) automatically runs all tests:

Jobs

  1. python-validation
  2. Ruff linting on source and test files
  3. All pytest unit and integration tests

  4. sql-validation

  5. SQLFluff linting on SQL queries
  6. DuckDB-based SQL unit tests

Triggering CI

git add .
git commit -m "Add comprehensive test suite"
git push origin master

```text

## Test Coverage

### Python ETL Coverage

- ✅ Validation logic (nulls, currency, timestamp)
- ✅ Partition generation (year/month extraction)
- ✅ Quarantine mechanism
- ✅ Success marker creation
- ✅ Edge cases (empty, all invalid)
- ✅ Deterministic behavior (backfill safety)
- ✅ End-to-end workflow
- ✅ Metadata enrichment (row_hash, attempt_count)
- ✅ Loop prevention (duplicate detection, attempt limits)
- ✅ Circuit breaker logic
- ✅ Condemned layer handling

### Script Coverage

- ✅ Bash script syntax validation
- ✅ Script existence and permissions
- ✅ Directory creation logic
- ✅ File copying operations
- ✅ Error handling
- ✅ Output validation

### SQL Query Coverage

- ✅ Syntax validation (PostgreSQL/ANSI dialect)
- ✅ Expected output verification
- ✅ NULL handling for missing months
- ✅ Last transaction selection logic
- ✅ Cross join (account × month spine)

## Testing Strategy Decisions

### Why DuckDB instead of real PostgreSQL/Aurora

- **Speed**: DuckDB runs in-memory, tests complete in milliseconds
- **Portability**: No external dependencies, works in any CI environment
- **Sufficient**: Validates SQL syntax and logic (the case study requirement)
- **Pragmatic**: Avoids over-engineering for a case study submission

### Why Temp Files instead of Testcontainers for Python

For this case study, we use temporary directories + mocking instead of Testcontainers (Docker-in-Docker with real S3).

## Rationale

- **Fast**: Tests run in <2 seconds vs 10+ seconds with container startup
- **Simple**: Fewer dependencies, easier to debug
- **Validates core logic**: Partition generation, validation, quarantine handling
- **Interview scope**: Demonstrates testing best practices without over-engineering

## When to use Testcontainers (production context)

- Testing S3-specific features (multipart uploads, versioning, lifecycle policies)
- Integration tests requiring exact AWS SDK behavior
- End-to-end workflows with Glue, Lambda, Step Functions
- Testing against multiple database engines simultaneously

**Trade-off acknowledged**: Our tests validate transformation logic but not S3 API edge cases. For production deployment at Ohpen, we would add Testcontainers-based integration tests as a second layer.

## Benefits for Interview

This test suite demonstrates:

1. **Production Readiness**: Comprehensive validation before deployment
1. **Data Quality**: Automated checks prevent data corruption
1. **Backfill Safety**: Determinism tests ensure reproducible outputs
1. **Best Practices**: Linting, unit tests, integration tests, CI/CD
1. **Documentation**: Tests serve as executable examples
1. **Pragmatic Engineering**: Chose the right tool for the scope (DuckDB, temp files) while knowing when to upgrade (Testcontainers for production)

## Quick Validation

### Option 1: Docker (Recommended)

Run all tests in isolated Docker environment:

```bash
cd /home/stephen/projects/ohpen-case-2026

# Run all tests (Python + SQL)

make test

# Or run specific tests

make test-python   # Python ETL tests only
make test-sql      # SQL query tests only
make test-scripts  # Script tests only (runs locally, not in Docker)

```text

## Option 2: Local Environment

If you prefer to run tests locally:

```bash
cd /home/stephen/projects/ohpen-case-2026

# Python tests

cd tasks/01_data_ingestion_transformation
pip install -r requirements.txt -r requirements-dev.txt
ruff check src/ tests/
pytest tests/ -v

# SQL tests

cd ../03_sql
pip install -r requirements.txt
sqlfluff lint balance_history_2024_q1.sql --dialect postgres
pytest tests/ -v

# Script tests

cd ../..
pip install pytest
pytest tests/scripts/ tasks/05_communication_documentation/tests/test_compile_deliverables.py -v

Expected output: All tests passing ✅