SQL Query Breakdown: From Inner to Outer¶
This document breaks down the balance_history_2024_q1.sql query from the most inner table working outwards, explaining each CTE and the final SELECT statement.
Query Structure Overview¶
The query uses a nested CTE (Common Table Expression) approach, building from simple to complex:
Base Table
100M rows] --> Filter[WHERE Filter
tx_date Q1 2024
Partition Pruning] Filter --> CTE1[CTE 1: months
Hardcoded Values
3 rows] Filter --> CTE2[CTE 2: unique_accounts
DISTINCT account_id
N unique accounts] CTE1 --> CTE3[CTE 3: account_months
CROSS JOIN
N × 3 rows] CTE2 --> CTE3 Filter --> CTE4[CTE 4: monthly_last_tx
Window Function ROW_NUMBER
Partition by account+month
Order by tx_date DESC] CTE3 --> Final[Final SELECT
LEFT JOIN
account_months × monthly_last_tx] CTE4 --> Final Final --> Output[Output
Account × Month × Balance
NULL for missing months] style Base fill:#f57c00,color:#fff style Filter fill:#7b1fa2,color:#fff style CTE1 fill:#388e3c,color:#fff style CTE2 fill:#388e3c,color:#fff style CTE3 fill:#1976d2,color:#fff style CTE4 fill:#ffa000,color:#111 style Final fill:#d32f2f,color:#fff style Output fill:#2e7d32,color:#fff
Textual Flow:
transactions (base table)
↓
months (CTE 1 - innermost)
↓
unique_accounts (CTE 2)
↓
account_months (CTE 3)
↓
monthly_last_tx (CTE 4)
↓
Final SELECT (outermost) Breakdown: Inner to Outer¶
1. Base Table: transactions (Innermost - Data Source)¶
FROM transactions
WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01' 100M rows
Partitioned by tx_date] --> Filter[WHERE Filter
tx_date >= '2024-01-01'
AND tx_date < '2024-04-01'] Filter --> Prune[Partition Pruning
Only scan Q1 2024 partitions] Prune --> Output[Filtered Rows
Q1 2024 transactions only] style Table fill:#f57c00,color:#fff style Filter fill:#7b1fa2,color:#fff style Prune fill:#388e3c,color:#fff style Output fill:#1976d2,color:#fff
What it does:
- Reads from the
transactionstable (as defined in Appendix A) - Filters to Q1 2024 (January, February, March)
- Partition pruning: If table is partitioned by
tx_date, only scans relevant partitions
Columns used:
account_id- Account identifiernew_balance- Balance after transaction (per Appendix A spec)tx_date- Transaction dateid- Transaction ID (for tie-breaking)
2. CTE 1: months (First CTE - Month Spine)¶
WITH months AS (
SELECT CAST(date_column AS DATE) AS month_end
FROM (
VALUES
('2024-01-31'),
('2024-02-29'),
('2024-03-31')
) AS t (date_column)
) '2024-01-31'
'2024-02-29'
'2024-03-31'] --> Cast[CAST to DATE
String → DATE type] Cast --> Output[months CTE
3 rows
month_end column] style Values fill:#388e3c,color:#fff style Cast fill:#7b1fa2,color:#fff style Output fill:#1976d2,color:#fff
What it does:
- Creates a hardcoded list of month-end dates for Q1 2024
- Purpose: Defines the reporting period explicitly
- Result: 3 rows (one per month)
Why month-end dates?
- January 31, February 29 (2024 is leap year), March 31
- These represent the "as-of" dates for month-end reporting
3. CTE 2: unique_accounts (Second CTE - Account Universe)¶
unique_accounts AS (
SELECT DISTINCT account_id FROM transactions
-- WHERE tx_date >= '2024-01-01' -- Optimization: filter partition scan if possible
) Multiple rows per account] --> Distinct[DISTINCT account_id
Remove duplicates] Distinct --> Hash[Hash/Deduplication
Group by account_id] Hash --> Output[unique_accounts CTE
N rows
One per unique account] style Input fill:#f57c00,color:#fff style Distinct fill:#7b1fa2,color:#fff style Hash fill:#388e3c,color:#fff style Output fill:#1976d2,color:#fff
What it does:
- Extracts all unique account IDs from the transactions table
- Purpose: Defines the "universe" of accounts to report on
- Result: One row per unique account
Note: The commented WHERE clause would optimize by filtering during the DISTINCT operation, but for clarity we filter in the next CTE.
4. CTE 3: account_months (Third CTE - Complete Spine)¶
account_months AS (
SELECT
a.account_id,
m.month_end
FROM unique_accounts AS a
CROSS JOIN months AS m
) N accounts
ACC001, ACC002, ACC003...] --> Cross[CROSS JOIN
Cartesian Product] Months[months
3 months
2024-01-31, 2024-02-29, 2024-03-31] --> Cross Cross --> Output[account_months CTE
N × 3 rows
Complete spine] Output --> Example[Example:
ACC001 × Jan, Feb, Mar
ACC002 × Jan, Feb, Mar
ACC003 × Jan, Feb, Mar] style Accounts fill:#388e3c,color:#fff style Months fill:#388e3c,color:#fff style Cross fill:#7b1fa2,color:#fff style Output fill:#1976d2,color:#fff style Example fill:#ffa000,color:#111
What it does:
- CROSS JOIN: Creates Cartesian product of all accounts × all months
- Purpose: Generates a complete "spine" of (account, month) combinations
- Result: If there are N accounts and 3 months, produces N×3 rows
Why CROSS JOIN?
- Ensures every account appears for every month
- Months with no transactions will show NULL balance (explicit gaps)
- Without this, accounts with no activity in a month would be missing from results
Example output:
account_id | month_end
-----------|----------
ACC001 | 2024-01-31
ACC001 | 2024-02-29
ACC001 | 2024-03-31
ACC002 | 2024-01-31
ACC002 | 2024-02-29
ACC002 | 2024-03-31
... 5. CTE 4: monthly_last_tx (Fourth CTE - Last Transaction per Month)¶
monthly_last_tx AS (
SELECT
account_id,
new_balance,
DATE_TRUNC('month', tx_date) AS tx_month_start,
ROW_NUMBER() OVER (
PARTITION BY account_id, DATE_TRUNC('month', tx_date)
ORDER BY tx_date DESC, id DESC
) AS rn
FROM transactions
WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01'
) Q1 2024 filtered] --> Partition[PARTITION BY
account_id + month] Partition --> Sort[ORDER BY
tx_date DESC, id DESC] Sort --> Window[ROW_NUMBER
Assign rn = 1, 2, 3...] Window --> Filter[Filter rn = 1
Keep only last transaction] Filter --> Output[monthly_last_tx CTE
One row per
account+month with tx] Partition --> P1[Partition 1:
ACC001 + Jan] Partition --> P2[Partition 2:
ACC001 + Mar] Partition --> P3[Partition 3:
ACC002 + Feb] P1 --> S1[Sort: 2024-01-22, 2024-01-15] S1 --> W1[rn: 1, 2] W1 --> F1[Keep rn=1] style Input fill:#f57c00,color:#fff style Partition fill:#7b1fa2,color:#fff style Sort fill:#388e3c,color:#fff style Window fill:#ffa000,color:#111 style Filter fill:#d32f2f,color:#fff style Output fill:#1976d2,color:#fff
What it does:
- Window Function:
ROW_NUMBER()partitions by account and month - Ordering:
tx_date DESC, id DESC- most recent transaction first - Purpose: Identifies the last transaction for each account in each month
- Filter: Only rows where
rn = 1are the last transactions
How it works:
- Partitions transactions by
(account_id, month) - Within each partition, orders by
tx_date DESC(newest first) - Assigns row number 1 to the most recent transaction
- If multiple transactions on same day, uses
id DESCas tie-breaker
Example:
Account ACC001 in January:
- Transaction on 2024-01-15: new_balance = 150.75, rn = 2
- Transaction on 2024-01-22: new_balance = 326.00, rn = 1 ← Last transaction Result: One row per (account, month) combination that has transactions
6. Final SELECT (Outermost - Final Report)¶
SELECT
am.account_id AS acct,
tx.new_balance AS balance,
TO_CHAR(am.month_end, 'YYYY-MM') AS month
FROM account_months AS am
LEFT JOIN monthly_last_tx AS tx
ON
am.account_id = tx.account_id
AND DATE_TRUNC('month', am.month_end) = tx.tx_month_start
AND tx.rn = 1
ORDER BY am.account_id, am.month_end; Complete spine
N × 3 rows] --> Join[LEFT JOIN] LastTx[monthly_last_tx
Last transactions
Only accounts+months with tx] --> Join Join --> Match{Match?
account_id + month} Match -->|Yes| HasBalance[Include row
with new_balance] Match -->|No| NullBalance[Include row
balance = NULL] HasBalance --> Format[TO_CHAR month_end
Format as 'YYYY-MM'] NullBalance --> Format Format --> Sort[ORDER BY
account_id, month_end] Sort --> Output[Final Result
Account × Month × Balance
NULL for missing months] style Spine fill:#1976d2,color:#fff style LastTx fill:#ffa000,color:#111 style Join fill:#7b1fa2,color:#fff style Match fill:#388e3c,color:#fff style HasBalance fill:#2e7d32,color:#fff style NullBalance fill:#d32f2f,color:#fff style Format fill:#5c6bc0,color:#fff style Sort fill:#424242,color:#fff style Output fill:#2e7d32,color:#fff
What it does:
- LEFT JOIN: Joins the complete spine (
account_months) with the last transactions (monthly_last_tx) - Join conditions:
account_idmatches- Month matches (using
DATE_TRUNCto align month-end with month-start) - Only last transaction (
rn = 1) - Result: One row per (account, month) combination
- If transaction exists: shows
new_balance - If no transaction: shows
NULL(from LEFT JOIN)
Why LEFT JOIN?
- Preserves all rows from
account_months(the spine) - Accounts with no transactions in a month get NULL balance
- Without LEFT JOIN, those months would be missing from results
Final output format:
acct | balance | month
-------|---------|--------
ACC001 | 326.00 | 2024-01
ACC001 | NULL | 2024-02
ACC001 | NULL | 2024-03
ACC002 | NULL | 2024-01
ACC002 | -25.50 | 2024-02
ACC002 | -125.50 | 2024-03
... Execution Order (How SQL Engine Processes It)¶
(Partition pruning: Q1 2024 only) Partitions-->>Engine: Filtered rows (Q1 2024) Engine->>Memory: 2. Create months CTE
(Hardcoded: 3 rows) Note over Memory: Instant (no I/O) Engine->>Memory: 3. Create unique_accounts CTE
(DISTINCT on filtered rows) Note over Memory: Requires scan + hash Engine->>Memory: 4. Create account_months CTE
(CROSS JOIN: N × 3 rows) Note over Memory: Fast (small result set) Engine->>Memory: 5. Create monthly_last_tx CTE
(Window function: ROW_NUMBER) Note over Memory: Most expensive:
Full scan + sort + partition Engine->>Memory: 6. Final SELECT
(LEFT JOIN spine × last_tx) Memory-->>Engine: Final result set Note over Engine: Performance: CTE 4 is bottleneck
for 100M rows (window function)
Step-by-Step:
- Base table scan:
transactionstable (with partition pruning) - CTE 1:
months- Hardcoded values (instant) - CTE 2:
unique_accounts- DISTINCT on transactions (requires scan) - CTE 3:
account_months- CROSS JOIN (fast, small result set) - CTE 4:
monthly_last_tx- Window function on transactions (requires full scan + sort) - Final SELECT: LEFT JOIN between spine and last transactions
Performance note: For 100M rows, CTE 4 (window function) is the most expensive operation, but it's necessary to identify the last transaction efficiently.
Key Design Patterns¶
1. Spine Generation Pattern¶
- Problem: Need to show all account×month combinations, even when no data exists
- Solution: CROSS JOIN to create complete spine, then LEFT JOIN with data
- Result: Gaps are explicit (NULL) rather than missing rows
2. Window Function Pattern¶
- Problem: Need last transaction per account per month efficiently
- Solution:
ROW_NUMBER()with partitioning and ordering - Result: Single pass over data (O(n log n) vs O(n²) for nested loops)
3. Partition Pruning Pattern¶
- Problem: 100M row table, but only need Q1 2024
- Solution: WHERE clause on partitioned column (
tx_date) - Result: Only scans relevant partitions (Jan, Feb, Mar 2024)
Why This Approach?¶
Alternative 1: Simple GROUP BY (Wrong)¶
SELECT account_id, MAX(tx_date), new_balance
FROM transactions
GROUP BY account_id, MONTH(tx_date) Problem: Doesn't show months with no transactions (missing rows)
Alternative 2: SUM(amount) (Wrong per spec)¶
SELECT account_id, SUM(amount) as balance
FROM transactions
GROUP BY account_id, MONTH(tx_date) Problem: Spec says to use new_balance from last transaction, not sum amounts
Our Approach: Spine + Window Function (Correct)¶
- ✅ Shows all account×month combinations (spine)
- ✅ Uses
new_balancefrom last transaction (per spec) - ✅ Efficient for 100M rows (window function, partition pruning)
- ✅ Handles NULL months correctly
Gold Layer Aggregation¶
This query demonstrates the Silver → Gold aggregation pattern (as designed in Task 2 Architecture). After running this query, the result can be stored in the Gold layer as a pre-computed aggregation for efficient reporting.
Storing Results in Gold Layer¶
The query output represents month-end balance snapshots that can be persisted:
-- Example: Store query results in Gold layer
CREATE TABLE gold.monthly_balances AS
SELECT
am.account_id AS acct,
tx.new_balance AS balance,
TO_CHAR(am.month_end, 'YYYY-MM') AS month,
CURRENT_TIMESTAMP AS as_of_timestamp
FROM account_months AS am
LEFT JOIN monthly_last_tx AS tx
ON am.account_id = tx.account_id
AND DATE_TRUNC('month', am.month_end) = tx.tx_month_start
AND tx.rn = 1
ORDER BY am.account_id, am.month_end; Benefits of Gold Layer Storage¶
- Performance: Pre-computed aggregations enable O(log n) queries instead of O(n log n)
- Cost: Reduces Athena query costs (scan less data)
- Freshness: Can be refreshed monthly after month-end close
- Governance: Gold layer owned by Business/Finance (per Task 2 architecture)
- Stability: Stable schema and business contracts for reporting
Monthly Refresh Pattern¶
For monthly reporting, this query would run:
- After month-end close (e.g., first day of next month)
- Compute month-end balances from Silver layer (this query)
- Store results in Gold layer (
s3://aggregated/monthly_balances/year=YYYY/month=MM/) - Update Glue Data Catalog for Athena queries
This pattern transforms ad-hoc Silver layer queries into efficient Gold layer lookups for business reporting.
See Also¶
- Full SQL:
balance_history_2024_q1.sql - Pseudocode:
SQL_PSEUDOCODE.md - Diagram:
SQL_DIAGRAM.md - Test Data:
tests/test_data.sql - Expected Output:
tests/expected_output.csv