Testing & Validation Guide¶

This document describes the comprehensive test suite for validating both Python and SQL code in the Ohpen case study implementation.

Overview¶

flowchart TD CI[CI_Pipeline] --> PythonTests[Python_Validation] CI --> SQLTests[SQL_Validation] PythonTests --> Ruff[Ruff_Linting] PythonTests --> UnitTests[Unit_Tests] PythonTests --> IntegrationTests[Integration_Tests] SQLTests --> SQLFluff[SQLFluff_Linting] SQLTests --> SQLUnitTests[DuckDB_Unit_Tests] Ruff --> Pass1[Syntax_And_Style_OK] UnitTests --> Pass2[Validation_Logic_OK] IntegrationTests --> Pass3[End_to_End_OK] SQLFluff --> Pass4[SQL_Syntax_OK] SQLUnitTests --> Pass5[Query_Results_OK] style CI fill:#5c6bc0,color:#fff style PythonTests fill:#66bb6a,color:#111 style SQLTests fill:#66bb6a,color:#111 style Ruff fill:#66bb6a,color:#111 style UnitTests fill:#66bb6a,color:#111 style IntegrationTests fill:#66bb6a,color:#111 style SQLFluff fill:#66bb6a,color:#111 style SQLUnitTests fill:#66bb6a,color:#111 style Pass1 fill:#2e7d32,color:#fff style Pass2 fill:#2e7d32,color:#fff style Pass3 fill:#2e7d32,color:#fff style Pass4 fill:#2e7d32,color:#fff style Pass5 fill:#2e7d32,color:#fff

Python Tests (Task 1 - ETL)¶

Location¶

tasks/01_data_ingestion_transformation/tests/

Test Files¶

test_etl.py - Unit tests for validation logic
Valid data handling
Invalid currency detection
Null value handling
Malformed timestamp detection
Partition column generation
Run ID in quarantine
Empty DataFrame edge case
All rows quarantined edge case
Deterministic behavior
test_integration.py - End-to-end integration tests
Full ETL workflow with temp files
Parquet partition structure validation
Success marker validation
Quarantine data validation
Deterministic output verification

Running Python Tests¶

cd /home/stephen/projects/ohpen-case-2026/tasks/01_data_ingestion_transformation

# Install dependencies

pip install -r requirements.txt
pip install -r requirements-dev.txt

# Run linting

ruff check src/ tests/

# Run all tests

export PYTHONPATH=$PYTHONPATH:$(pwd)
pytest tests/ -v

# Run specific test file

pytest tests/test_etl.py -v
pytest tests/test_integration.py -v

Script Tests (Workspace & Task Scripts)¶

Location¶

tests/scripts/ - Tests for workspace-level scripts
tasks/05_communication_documentation/tests/ - Tests for task-level scripts

Test Files¶

tests/scripts/test_create_submission_bundles.py - Tests for submission bundle creation script
Script existence and permissions
Bash syntax validation
Directory structure creation
File copying logic
ZIP file generation
Error handling
tasks/05_communication_documentation/tests/test_compile_deliverables.py - Tests for PDF compilation script
Script syntax validation
File reference checks
Pandoc integration
Error handling

Running Script Tests¶

# From project root
cd /home/stephen/projects/ohpen-case-2026

# Run all script tests
pytest tests/scripts/ tasks/05_communication_documentation/tests/test_compile_deliverables.py -v

# Or use Makefile
make test-scripts

# Run specific test
pytest tests/scripts/test_create_submission_bundles.py -v

SQL Tests (Task 3 - Balance Query)¶

Location (2)¶

tasks/03_sql/tests/

Test Files (2)¶

test_balance_query.py - DuckDB-based SQL tests
SQL syntax validation
Expected results verification
Missing months handling (NULL values)
Last transaction of month selection
test_data.sql - Sample data from Appendix A
expected_output.csv - Expected query results

Running SQL Tests¶

cd /home/stephen/projects/ohpen-case-2026/tasks/03_sql

# Install dependencies (2)

pip install -r requirements.txt

# Run SQL linting

sqlfluff lint balance_history_2024_q1.sql --dialect postgres

# Run SQL unit tests

pytest tests/ -v

CI/CD Integration¶

The GitHub Actions workflow (.github/workflows/ci.yml) automatically runs all tests:

Jobs¶

python-validation
Ruff linting on source and test files
All pytest unit and integration tests
sql-validation
SQLFluff linting on SQL queries
DuckDB-based SQL unit tests

Triggering CI¶

git add .
git commit -m "Add comprehensive test suite"
git push origin master

```text

## Test Coverage

### Python ETL Coverage

- ✅ Validation logic (nulls, currency, timestamp)
- ✅ Partition generation (year/month extraction)
- ✅ Quarantine mechanism
- ✅ Success marker creation
- ✅ Edge cases (empty, all invalid)
- ✅ Deterministic behavior (backfill safety)
- ✅ End-to-end workflow
- ✅ Metadata enrichment (row_hash, attempt_count)
- ✅ Loop prevention (duplicate detection, attempt limits)
- ✅ Circuit breaker logic
- ✅ Condemned layer handling

### Script Coverage

- ✅ Bash script syntax validation
- ✅ Script existence and permissions
- ✅ Directory creation logic
- ✅ File copying operations
- ✅ Error handling
- ✅ Output validation

### SQL Query Coverage

- ✅ Syntax validation (PostgreSQL/ANSI dialect)
- ✅ Expected output verification
- ✅ NULL handling for missing months
- ✅ Last transaction selection logic
- ✅ Cross join (account × month spine)

## Testing Strategy Decisions

### Why DuckDB instead of real PostgreSQL/Aurora

- **Speed**: DuckDB runs in-memory, tests complete in milliseconds
- **Portability**: No external dependencies, works in any CI environment
- **Sufficient**: Validates SQL syntax and logic (the case study requirement)
- **Pragmatic**: Avoids over-engineering for a case study submission

### Why Temp Files instead of Testcontainers for Python

For this case study, we use temporary directories + mocking instead of Testcontainers (Docker-in-Docker with real S3).

## Rationale

- **Fast**: Tests run in <2 seconds vs 10+ seconds with container startup
- **Simple**: Fewer dependencies, easier to debug
- **Validates core logic**: Partition generation, validation, quarantine handling
- **Interview scope**: Demonstrates testing best practices without over-engineering

## When to use Testcontainers (production context)

- Testing S3-specific features (multipart uploads, versioning, lifecycle policies)
- Integration tests requiring exact AWS SDK behavior
- End-to-end workflows with Glue, Lambda, Step Functions
- Testing against multiple database engines simultaneously

**Trade-off acknowledged**: Our tests validate transformation logic but not S3 API edge cases. For production deployment at Ohpen, we would add Testcontainers-based integration tests as a second layer.

## Benefits for Interview

This test suite demonstrates:

1. **Production Readiness**: Comprehensive validation before deployment
1. **Data Quality**: Automated checks prevent data corruption
1. **Backfill Safety**: Determinism tests ensure reproducible outputs
1. **Best Practices**: Linting, unit tests, integration tests, CI/CD
1. **Documentation**: Tests serve as executable examples
1. **Pragmatic Engineering**: Chose the right tool for the scope (DuckDB, temp files) while knowing when to upgrade (Testcontainers for production)

## Quick Validation

### Option 1: Docker (Recommended)

Run all tests in isolated Docker environment:

```bash
cd /home/stephen/projects/ohpen-case-2026

# Run all tests (Python + SQL)

make test

# Or run specific tests

make test-python   # Python ETL tests only
make test-sql      # SQL query tests only
make test-scripts  # Script tests only (runs locally, not in Docker)

```text

## Option 2: Local Environment

If you prefer to run tests locally:

```bash
cd /home/stephen/projects/ohpen-case-2026

# Python tests

cd tasks/01_data_ingestion_transformation
pip install -r requirements.txt -r requirements-dev.txt
ruff check src/ tests/
pytest tests/ -v

# SQL tests

cd ../03_sql
pip install -r requirements.txt
sqlfluff lint balance_history_2024_q1.sql --dialect postgres
pytest tests/ -v

# Script tests

cd ../..
pip install pytest
pytest tests/scripts/ tasks/05_communication_documentation/tests/test_compile_deliverables.py -v

Expected output: All tests passing ✅