SQL Query Breakdown: From Inner to Outer

This document breaks down the balance_history_2024_q1.sql query from the most inner table working outwards, explaining each CTE and the final SELECT statement.


Query Structure Overview

The query uses a nested CTE (Common Table Expression) approach, building from simple to complex:

flowchart TD Base[transactions
Base Table
100M rows] --> Filter[WHERE Filter
tx_date Q1 2024
Partition Pruning] Filter --> CTE1[CTE 1: months
Hardcoded Values
3 rows] Filter --> CTE2[CTE 2: unique_accounts
DISTINCT account_id
N unique accounts] CTE1 --> CTE3[CTE 3: account_months
CROSS JOIN
N × 3 rows] CTE2 --> CTE3 Filter --> CTE4[CTE 4: monthly_last_tx
Window Function ROW_NUMBER
Partition by account+month
Order by tx_date DESC] CTE3 --> Final[Final SELECT
LEFT JOIN
account_months × monthly_last_tx] CTE4 --> Final Final --> Output[Output
Account × Month × Balance
NULL for missing months] style Base fill:#f57c00,color:#fff style Filter fill:#7b1fa2,color:#fff style CTE1 fill:#388e3c,color:#fff style CTE2 fill:#388e3c,color:#fff style CTE3 fill:#1976d2,color:#fff style CTE4 fill:#ffa000,color:#111 style Final fill:#d32f2f,color:#fff style Output fill:#2e7d32,color:#fff

Textual Flow:

transactions (base table)
    ↓
months (CTE 1 - innermost)
    ↓
unique_accounts (CTE 2)
    ↓
account_months (CTE 3)
    ↓
monthly_last_tx (CTE 4)
    ↓
Final SELECT (outermost)

Breakdown: Inner to Outer

1. Base Table: transactions (Innermost - Data Source)

FROM transactions
WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01'
flowchart LR Table[transactions table
100M rows
Partitioned by tx_date] --> Filter[WHERE Filter
tx_date >= '2024-01-01'
AND tx_date < '2024-04-01'] Filter --> Prune[Partition Pruning
Only scan Q1 2024 partitions] Prune --> Output[Filtered Rows
Q1 2024 transactions only] style Table fill:#f57c00,color:#fff style Filter fill:#7b1fa2,color:#fff style Prune fill:#388e3c,color:#fff style Output fill:#1976d2,color:#fff

What it does:

  • Reads from the transactions table (as defined in Appendix A)
  • Filters to Q1 2024 (January, February, March)
  • Partition pruning: If table is partitioned by tx_date, only scans relevant partitions

Columns used:

  • account_id - Account identifier
  • new_balance - Balance after transaction (per Appendix A spec)
  • tx_date - Transaction date
  • id - Transaction ID (for tie-breaking)

2. CTE 1: months (First CTE - Month Spine)

WITH months AS (
    SELECT CAST(date_column AS DATE) AS month_end
    FROM (
        VALUES
        ('2024-01-31'),
        ('2024-02-29'),
        ('2024-03-31')
    ) AS t (date_column)
)
flowchart LR Values[VALUES Clause
'2024-01-31'
'2024-02-29'
'2024-03-31'] --> Cast[CAST to DATE
String → DATE type] Cast --> Output[months CTE
3 rows
month_end column] style Values fill:#388e3c,color:#fff style Cast fill:#7b1fa2,color:#fff style Output fill:#1976d2,color:#fff

What it does:

  • Creates a hardcoded list of month-end dates for Q1 2024
  • Purpose: Defines the reporting period explicitly
  • Result: 3 rows (one per month)

Why month-end dates?

  • January 31, February 29 (2024 is leap year), March 31
  • These represent the "as-of" dates for month-end reporting

3. CTE 2: unique_accounts (Second CTE - Account Universe)

unique_accounts AS (
    SELECT DISTINCT account_id FROM transactions
    -- WHERE tx_date >= '2024-01-01' -- Optimization: filter partition scan if possible
)
flowchart LR Input[transactions table
Multiple rows per account] --> Distinct[DISTINCT account_id
Remove duplicates] Distinct --> Hash[Hash/Deduplication
Group by account_id] Hash --> Output[unique_accounts CTE
N rows
One per unique account] style Input fill:#f57c00,color:#fff style Distinct fill:#7b1fa2,color:#fff style Hash fill:#388e3c,color:#fff style Output fill:#1976d2,color:#fff

What it does:

  • Extracts all unique account IDs from the transactions table
  • Purpose: Defines the "universe" of accounts to report on
  • Result: One row per unique account

Note: The commented WHERE clause would optimize by filtering during the DISTINCT operation, but for clarity we filter in the next CTE.


4. CTE 3: account_months (Third CTE - Complete Spine)

account_months AS (
    SELECT
        a.account_id,
        m.month_end
    FROM unique_accounts AS a
    CROSS JOIN months AS m
)
flowchart LR Accounts[unique_accounts
N accounts
ACC001, ACC002, ACC003...] --> Cross[CROSS JOIN
Cartesian Product] Months[months
3 months
2024-01-31, 2024-02-29, 2024-03-31] --> Cross Cross --> Output[account_months CTE
N × 3 rows
Complete spine] Output --> Example[Example:
ACC001 × Jan, Feb, Mar
ACC002 × Jan, Feb, Mar
ACC003 × Jan, Feb, Mar] style Accounts fill:#388e3c,color:#fff style Months fill:#388e3c,color:#fff style Cross fill:#7b1fa2,color:#fff style Output fill:#1976d2,color:#fff style Example fill:#ffa000,color:#111

What it does:

  • CROSS JOIN: Creates Cartesian product of all accounts × all months
  • Purpose: Generates a complete "spine" of (account, month) combinations
  • Result: If there are N accounts and 3 months, produces N×3 rows

Why CROSS JOIN?

  • Ensures every account appears for every month
  • Months with no transactions will show NULL balance (explicit gaps)
  • Without this, accounts with no activity in a month would be missing from results

Example output:

account_id | month_end
-----------|----------
ACC001     | 2024-01-31
ACC001     | 2024-02-29
ACC001     | 2024-03-31
ACC002     | 2024-01-31
ACC002     | 2024-02-29
ACC002     | 2024-03-31
...

5. CTE 4: monthly_last_tx (Fourth CTE - Last Transaction per Month)

monthly_last_tx AS (
    SELECT
        account_id,
        new_balance,
        DATE_TRUNC('month', tx_date) AS tx_month_start,
        ROW_NUMBER() OVER (
            PARTITION BY account_id, DATE_TRUNC('month', tx_date)
            ORDER BY tx_date DESC, id DESC
        ) AS rn
    FROM transactions
    WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01'
)
flowchart TD Input[transactions
Q1 2024 filtered] --> Partition[PARTITION BY
account_id + month] Partition --> Sort[ORDER BY
tx_date DESC, id DESC] Sort --> Window[ROW_NUMBER
Assign rn = 1, 2, 3...] Window --> Filter[Filter rn = 1
Keep only last transaction] Filter --> Output[monthly_last_tx CTE
One row per
account+month with tx] Partition --> P1[Partition 1:
ACC001 + Jan] Partition --> P2[Partition 2:
ACC001 + Mar] Partition --> P3[Partition 3:
ACC002 + Feb] P1 --> S1[Sort: 2024-01-22, 2024-01-15] S1 --> W1[rn: 1, 2] W1 --> F1[Keep rn=1] style Input fill:#f57c00,color:#fff style Partition fill:#7b1fa2,color:#fff style Sort fill:#388e3c,color:#fff style Window fill:#ffa000,color:#111 style Filter fill:#d32f2f,color:#fff style Output fill:#1976d2,color:#fff

What it does:

  • Window Function: ROW_NUMBER() partitions by account and month
  • Ordering: tx_date DESC, id DESC - most recent transaction first
  • Purpose: Identifies the last transaction for each account in each month
  • Filter: Only rows where rn = 1 are the last transactions

How it works:

  1. Partitions transactions by (account_id, month)
  2. Within each partition, orders by tx_date DESC (newest first)
  3. Assigns row number 1 to the most recent transaction
  4. If multiple transactions on same day, uses id DESC as tie-breaker

Example:

Account ACC001 in January:
- Transaction on 2024-01-15: new_balance = 150.75, rn = 2
- Transaction on 2024-01-22: new_balance = 326.00, rn = 1 ← Last transaction

Result: One row per (account, month) combination that has transactions


6. Final SELECT (Outermost - Final Report)

SELECT
    am.account_id AS acct,
    tx.new_balance AS balance,
    TO_CHAR(am.month_end, 'YYYY-MM') AS month
FROM account_months AS am
LEFT JOIN monthly_last_tx AS tx
    ON
        am.account_id = tx.account_id
        AND DATE_TRUNC('month', am.month_end) = tx.tx_month_start
        AND tx.rn = 1
ORDER BY am.account_id, am.month_end;
flowchart TD Spine[account_months
Complete spine
N × 3 rows] --> Join[LEFT JOIN] LastTx[monthly_last_tx
Last transactions
Only accounts+months with tx] --> Join Join --> Match{Match?
account_id + month} Match -->|Yes| HasBalance[Include row
with new_balance] Match -->|No| NullBalance[Include row
balance = NULL] HasBalance --> Format[TO_CHAR month_end
Format as 'YYYY-MM'] NullBalance --> Format Format --> Sort[ORDER BY
account_id, month_end] Sort --> Output[Final Result
Account × Month × Balance
NULL for missing months] style Spine fill:#1976d2,color:#fff style LastTx fill:#ffa000,color:#111 style Join fill:#7b1fa2,color:#fff style Match fill:#388e3c,color:#fff style HasBalance fill:#2e7d32,color:#fff style NullBalance fill:#d32f2f,color:#fff style Format fill:#5c6bc0,color:#fff style Sort fill:#424242,color:#fff style Output fill:#2e7d32,color:#fff

What it does:

  • LEFT JOIN: Joins the complete spine (account_months) with the last transactions (monthly_last_tx)
  • Join conditions:
  • account_id matches
  • Month matches (using DATE_TRUNC to align month-end with month-start)
  • Only last transaction (rn = 1)
  • Result: One row per (account, month) combination
  • If transaction exists: shows new_balance
  • If no transaction: shows NULL (from LEFT JOIN)

Why LEFT JOIN?

  • Preserves all rows from account_months (the spine)
  • Accounts with no transactions in a month get NULL balance
  • Without LEFT JOIN, those months would be missing from results

Final output format:

acct   | balance | month
-------|---------|--------
ACC001 | 326.00  | 2024-01
ACC001 | NULL    | 2024-02
ACC001 | NULL    | 2024-03
ACC002 | NULL    | 2024-01
ACC002 | -25.50  | 2024-02
ACC002 | -125.50 | 2024-03
...

Execution Order (How SQL Engine Processes It)

sequenceDiagram participant Engine as SQL Engine participant Partitions as Partitioned Storage participant Memory as Working Memory Note over Engine,Memory: Query Execution Sequence Engine->>Partitions: 1. Scan transactions table
(Partition pruning: Q1 2024 only) Partitions-->>Engine: Filtered rows (Q1 2024) Engine->>Memory: 2. Create months CTE
(Hardcoded: 3 rows) Note over Memory: Instant (no I/O) Engine->>Memory: 3. Create unique_accounts CTE
(DISTINCT on filtered rows) Note over Memory: Requires scan + hash Engine->>Memory: 4. Create account_months CTE
(CROSS JOIN: N × 3 rows) Note over Memory: Fast (small result set) Engine->>Memory: 5. Create monthly_last_tx CTE
(Window function: ROW_NUMBER) Note over Memory: Most expensive:
Full scan + sort + partition Engine->>Memory: 6. Final SELECT
(LEFT JOIN spine × last_tx) Memory-->>Engine: Final result set Note over Engine: Performance: CTE 4 is bottleneck
for 100M rows (window function)

Step-by-Step:

  1. Base table scan: transactions table (with partition pruning)
  2. CTE 1: months - Hardcoded values (instant)
  3. CTE 2: unique_accounts - DISTINCT on transactions (requires scan)
  4. CTE 3: account_months - CROSS JOIN (fast, small result set)
  5. CTE 4: monthly_last_tx - Window function on transactions (requires full scan + sort)
  6. Final SELECT: LEFT JOIN between spine and last transactions

Performance note: For 100M rows, CTE 4 (window function) is the most expensive operation, but it's necessary to identify the last transaction efficiently.


Key Design Patterns

1. Spine Generation Pattern

  • Problem: Need to show all account×month combinations, even when no data exists
  • Solution: CROSS JOIN to create complete spine, then LEFT JOIN with data
  • Result: Gaps are explicit (NULL) rather than missing rows

2. Window Function Pattern

  • Problem: Need last transaction per account per month efficiently
  • Solution: ROW_NUMBER() with partitioning and ordering
  • Result: Single pass over data (O(n log n) vs O(n²) for nested loops)

3. Partition Pruning Pattern

  • Problem: 100M row table, but only need Q1 2024
  • Solution: WHERE clause on partitioned column (tx_date)
  • Result: Only scans relevant partitions (Jan, Feb, Mar 2024)

Why This Approach?

Alternative 1: Simple GROUP BY (Wrong)

SELECT account_id, MAX(tx_date), new_balance
FROM transactions
GROUP BY account_id, MONTH(tx_date)

Problem: Doesn't show months with no transactions (missing rows)

Alternative 2: SUM(amount) (Wrong per spec)

SELECT account_id, SUM(amount) as balance
FROM transactions
GROUP BY account_id, MONTH(tx_date)

Problem: Spec says to use new_balance from last transaction, not sum amounts

Our Approach: Spine + Window Function (Correct)

  • ✅ Shows all account×month combinations (spine)
  • ✅ Uses new_balance from last transaction (per spec)
  • ✅ Efficient for 100M rows (window function, partition pruning)
  • ✅ Handles NULL months correctly

Gold Layer Aggregation

This query demonstrates the Silver → Gold aggregation pattern (as designed in Task 2 Architecture). After running this query, the result can be stored in the Gold layer as a pre-computed aggregation for efficient reporting.

Storing Results in Gold Layer

The query output represents month-end balance snapshots that can be persisted:

-- Example: Store query results in Gold layer
CREATE TABLE gold.monthly_balances AS
SELECT
    am.account_id AS acct,
    tx.new_balance AS balance,
    TO_CHAR(am.month_end, 'YYYY-MM') AS month,
    CURRENT_TIMESTAMP AS as_of_timestamp
FROM account_months AS am
LEFT JOIN monthly_last_tx AS tx
    ON am.account_id = tx.account_id
    AND DATE_TRUNC('month', am.month_end) = tx.tx_month_start
    AND tx.rn = 1
ORDER BY am.account_id, am.month_end;

Benefits of Gold Layer Storage

  • Performance: Pre-computed aggregations enable O(log n) queries instead of O(n log n)
  • Cost: Reduces Athena query costs (scan less data)
  • Freshness: Can be refreshed monthly after month-end close
  • Governance: Gold layer owned by Business/Finance (per Task 2 architecture)
  • Stability: Stable schema and business contracts for reporting

Monthly Refresh Pattern

For monthly reporting, this query would run:

  1. After month-end close (e.g., first day of next month)
  2. Compute month-end balances from Silver layer (this query)
  3. Store results in Gold layer (s3://aggregated/monthly_balances/year=YYYY/month=MM/)
  4. Update Glue Data Catalog for Athena queries

This pattern transforms ad-hoc Silver layer queries into efficient Gold layer lookups for business reporting.


See Also

  • Full SQL: balance_history_2024_q1.sql
  • Pseudocode: SQL_PSEUDOCODE.md
  • Diagram: SQL_DIAGRAM.md
  • Test Data: tests/test_data.sql
  • Expected Output: tests/expected_output.csv