Window Functions and CTEs in SQL

GROUP BY collapses rows into one per group. That's often what you want — but not always. Sometimes you need each row to stay intact while still knowing its rank within a group, its value relative to the previous period, or a running cumulative total. Window functions do exactly that: they compute across a defined set of rows without reducing them.

CTEs (WITH clauses) complement window functions by letting you name an intermediate result and reference it like a table. The two features together make complex analytical queries readable instead of nested beyond recognition. This project covers both.

Step 1 — Build the dataset. Monthly revenue by region and product — the kind of data that lives in any data warehouse:

import sqlite3
import pandas as pd

conn = sqlite3.connect(':memory:')

conn.executescript("""
CREATE TABLE monthly_sales (
  id       INTEGER PRIMARY KEY,
  month    TEXT,
  region   TEXT,
  product  TEXT,
  revenue  REAL
);

INSERT INTO monthly_sales VALUES
  (1,  '2026-01', 'East',  'Pro',     14200),
  (2,  '2026-01', 'West',  'Pro',     11800),
  (3,  '2026-01', 'North', 'Starter',  4900),
  (4,  '2026-02', 'East',  'Pro',     15600),
  (5,  '2026-02', 'West',  'Pro',     10200),
  (6,  '2026-02', 'North', 'Starter',  5300),
  (7,  '2026-03', 'East',  'Pro',     17100),
  (8,  '2026-03', 'West',  'Pro',     13400),
  (9,  '2026-03', 'North', 'Starter',  6100),
  (10, '2026-04', 'East',  'Pro',     16500),
  (11, '2026-04', 'West',  'Pro',     14800),
  (12, '2026-04', 'North', 'Starter',  5750);
""")
conn.commit()
print("Ready.")

Step 2 — Rank within groups. RANK() and ROW_NUMBER() compute a position for each row within a partition. The OVER (PARTITION BY ... ORDER BY ...) clause defines what "within" means:

# Rank each region by revenue within each month
df = pd.read_sql_query("""
  SELECT
    month,
    region,
    revenue,
    RANK() OVER (PARTITION BY month ORDER BY revenue DESC) AS rank_in_month
  FROM monthly_sales
  ORDER BY month, rank_in_month
""", conn)
print(df)

# ROW_NUMBER() always produces unique ranks (no ties); RANK() can repeat
df = pd.read_sql_query("""
  SELECT
    month,
    region,
    revenue,
    ROW_NUMBER() OVER (PARTITION BY region ORDER BY month) AS period_num
  FROM monthly_sales
  ORDER BY region, period_num
""", conn)
print(df)

Step 3 — Period-over-period comparison with LAG(). LAG() looks back one row in the window to pull the previous period's value — without a self-join:

# Month-over-month revenue change per region
df = pd.read_sql_query("""
  SELECT
    region,
    month,
    revenue,
    LAG(revenue) OVER (PARTITION BY region ORDER BY month) AS prev_month,
    ROUND(
      (revenue - LAG(revenue) OVER (PARTITION BY region ORDER BY month))
      / LAG(revenue) OVER (PARTITION BY region ORDER BY month) * 100,
    1) AS pct_change
  FROM monthly_sales
  ORDER BY region, month
""", conn)
print(df.to_string())

The first row per region will have NULL for prev_month — there's no prior period to look back to. That's expected and correct.

Step 4 — Running totals and CTEs. A CTE names an intermediate result so you can reference it cleanly. Here, a CTE computes the running total, then the outer query filters it:

# Running cumulative revenue per region, then filter to latest snapshot
df = pd.read_sql_query("""
  WITH running AS (
    SELECT
      region,
      month,
      revenue,
      SUM(revenue) OVER (
        PARTITION BY region
        ORDER BY month
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
      ) AS cumulative_revenue
    FROM monthly_sales
  )
  SELECT *
  FROM running
  ORDER BY region, month
""", conn)
print(df.to_string())

# CTE on CTE: rank regions by their total 4-month revenue
df = pd.read_sql_query("""
  WITH totals AS (
    SELECT region, SUM(revenue) AS total_rev
    FROM monthly_sales
    GROUP BY region
  ),
  ranked AS (
    SELECT region, total_rev,
           RANK() OVER (ORDER BY total_rev DESC) AS overall_rank
    FROM totals
  )
  SELECT * FROM ranked
""", conn)
print(df)

conn.close()

CTEs are not a performance feature — they're a readability feature. The database engine can often optimize them the same way it would a subquery. Use them when nesting gets deep enough to obscure what the query is actually doing.

Window functions feel abstract until you need to answer "how does this month compare to last month, per region?" without writing a mess of subqueries. Then they feel essential.

Ready for the deep end? Browse our full resource library →