Most analysts reach for pd.read_csv() out of habit. Load the file, then query it. DuckDB flips that — you write SQL against the file itself, and DuckDB handles the rest. For large files, this isn't just convenient. It's meaningfully faster, and it uses a fraction of the memory.
DuckDB is an in-process analytical database. There's nothing to install beyond the Python package, no server, no config. It runs inside your Colab session and disappears when the notebook closes.
Step 1 — Install and connect. One install, one import:
!pip install duckdb
import duckdb
import pandas as pd
# Connect to an in-memory DuckDB instance
con = duckdb.connect()
Step 2 — Create a sample CSV to query. In Colab we write the file to /tmp, then query it directly:
# Build a sample sales CSV and write it to disk
data = {
'date': ['2026-01-03','2026-01-07','2026-01-12','2026-01-18','2026-01-22',
'2026-02-01','2026-02-09','2026-02-14','2026-02-20','2026-03-05',
'2026-03-11','2026-03-18'],
'region': ['East','West','North','East','West','North','East','West','North','East','West','North'],
'product': ['Pro','Starter','Pro','Starter','Pro','Pro','Starter','Pro','Starter','Pro','Starter','Pro'],
'revenue': [1200,450,1800,390,1650,1100,420,1750,380,1350,470,1500],
'units': [1,2,1,2,1,1,2,1,2,1,2,1]
}
df = pd.DataFrame(data)
df.to_csv('/tmp/sales.csv', index=False)
print(f"Wrote {len(df)} rows to /tmp/sales.csv")
Step 3 — Query the file directly. The path goes straight into SQL — no read_csv(), no DataFrame first:
# Query the CSV file directly — DuckDB reads it on the fly
result = con.execute("""
SELECT *
FROM '/tmp/sales.csv'
ORDER BY date
""").df()
print(result)
# Aggregate by region without loading into pandas first
result = con.execute("""
SELECT
region,
COUNT(*) AS orders,
SUM(revenue) AS total_revenue,
ROUND(AVG(revenue), 0) AS avg_order
FROM '/tmp/sales.csv'
GROUP BY region
ORDER BY total_revenue DESC
""").df()
print(result)
The .df() at the end converts the DuckDB result to a pandas DataFrame — handy when you need the result for further Python work. If you don't need pandas, you can call .fetchall() instead and get plain Python tuples.
Step 4 — Use window functions and filters, same as any SQL. DuckDB supports the full SQL dialect including CTEs, window functions, and QUALIFY:
# Rank products by revenue within each region
result = con.execute("""
SELECT
region,
product,
SUM(revenue) AS revenue,
RANK() OVER (PARTITION BY region ORDER BY SUM(revenue) DESC) AS rank
FROM '/tmp/sales.csv'
GROUP BY region, product
QUALIFY rank = 1
""").df()
print(result)
con.close()
QUALIFY is a DuckDB extension that filters on window function results directly — no subquery needed. It's one of several places where DuckDB's SQL is noticeably cleaner than standard SQLite.
The real payoff comes when your CSV has a million rows. DuckDB reads only what the query needs — it doesn't load the whole file. pandas loads everything first. At scale, that difference is not small.
Want to go deeper? Browse our full resource library →