Appendix B: Full SQL Implementation Code

This appendix contains the complete SQL implementation for generating month-end balance history.

File: balance_history_2024_q1.sql


/*
 Task 3: Month-End Balance History

 Goal: Show account balance history at the end of each month for Jan-Mar 2024.
 Constraint: 100M records (requires efficient indexing/partitioning logic).
 History Logic: Month-end balance is the last `new_balance` observed within a month (ordered by `tx_date`, tie-broken by `id`).
 This implementation keeps months with no transactions as `NULL` (gaps are explicit via the account×month spine).
 If business requirements expect a carry-forward "as-of" balance, extend the final select with a forward-fill of the last known balance.
 Time semantics note: this query uses `tx_date` as the ordering/reporting timestamp; if the domain distinguishes transaction vs posting date,
 month-end reporting should use the posting/cutoff timestamp.

 Engine Note: syntax is standard ANSI/Presto SQL.
*/

-- 1. Generate the spine of all (Account, Month) combinations to ensure gaps are visible
WITH months AS (
    SELECT CAST(date_column AS DATE) AS month_end
    FROM (
        VALUES
        ('2024-01-31'),
        ('2024-02-29'),
        ('2024-03-31')
    ) AS t (date_column)
),

unique_accounts AS (
    SELECT DISTINCT account_id FROM transactions
    -- WHERE tx_date >= '2024-01-01' -- Optimization: filter partition scan if possible
),

account_months AS (
    SELECT
        a.account_id,
        m.month_end
    FROM unique_accounts AS a
    CROSS JOIN months AS m
),

-- 2. Find the last transaction for each account in each month
-- Efficient strategy: Window function partitioning by account/month
monthly_last_tx AS (
    SELECT
        account_id,
        new_balance,
        DATE_TRUNC('month', tx_date) AS tx_month_start,
        ROW_NUMBER() OVER (
            PARTITION BY account_id, DATE_TRUNC('month', tx_date)
            ORDER BY tx_date DESC, id DESC -- id tie-breaker ensures deterministic result
        ) AS rn
    FROM transactions
    WHERE tx_date >= '2024-01-01' AND tx_date < '2024-04-01'
)

-- 3. Join spine with data to produce final report
SELECT
    am.account_id AS acct,
    tx.new_balance AS balance, -- Formatting string may vary by dialect (Presto: format_datetime)
    TO_CHAR(am.month_end, 'YYYY-MM') AS month
FROM account_months AS am
LEFT JOIN monthly_last_tx AS tx
    ON
        am.account_id = tx.account_id
        AND DATE_TRUNC('month', am.month_end) = tx.tx_month_start
        AND tx.rn = 1
ORDER BY am.account_id, am.month_end;

/*
Performance Considerations for 100M rows:

1. Partitioning: The 'transactions' table should be partitioned by tx_date (month/day).

   This allows the query engine to prune partitions outside Jan-Mar 2024 immediately.

2. Clustering/Indexing: Data should be sorted/clustered by 'account_id' within partitions.

   This makes the 'DISTINCT account_id' and window function operations much faster (avoids heavy shuffle).
*/

/*
Query Explanation:

This query produces month-end balance history for Q1 2024 (January, February, March).

1. Month-end balance definition:
   - The balance at the end of a month is defined as the last transaction's `new_balance`

     observed within that month (ordered by `tx_date` descending, tie-broken by `id`).

2. Account×month spine:
   - The query generates a complete spine of all (account, month) combinations using

     CROSS JOIN to ensure gaps are visible.

   - Accounts with no activity in a month return NULL for balance (explicit gaps).

3. Window function approach:
   - Uses ROW_NUMBER() partitioned by account and month to identify the last transaction.
   - Efficient for large datasets (100M rows) as it avoids multiple scans.

4. Partition pruning:
   - The WHERE clause filters to Jan-Mar 2024, enabling partition pruning on `tx_date`.
   - This minimizes data scanned and improves query performance.

5. Engine portability:
   - Uses standard ANSI/Presto SQL syntax.
   - Compatible with Athena, BigQuery, Snowflake, and other modern SQL engines.
   - Avoids engine-specific features for maximum portability.

Ownership & Governance (Task 3):

| Aspect | Owner | Responsibility |
| --- | --- | --- |
| **Gold Tables** | Business (Finance) | Analysts query Gold tables only for reporting |
| **Silver Tables** | Domain Teams | Technical users may query Silver for ad-hoc analysis |
| **Bronze Layer**  Data Platform Team  Not queried for analytics (raw audit trail only)

Rules:

- Analysts query Gold tables only (business contracts, stable schemas).
- Silver tables used by technical users for ad-hoc analysis and debugging.
- Bronze is not queried for analytics (immutable raw data, platform team access only).

*/

See Also

  • Pseudocode: SQL_PSEUDOCODE.md
  • Architecture Diagram: SQL_DIAGRAM.md
  • Schema Definition: schema.sql
  • Test Data: tests/test_data.sql