Most business analysis questions boil down to the same handful of operations: group the data, aggregate it, sort it, and present it clearly. Pandas handles all of that. This project walks through a representative sales analysis from a raw DataFrame to a formatted summary — the kind of thing you'd produce before a weekly review meeting.

No API key needed. No credentials. Just pandas, which Colab has pre-installed.


Build a sample dataset inline — this represents the kind of raw export you'd get from a CRM or order management system:

import pandas as pd

data = {
    "order_id":   [1,2,3,4,5,6,7,8,9,10],
    "rep":        ["Alice","Bob","Alice","Carol","Bob","Carol","Alice","Bob","Carol","Alice"],
    "region":     ["East","West","East","North","West","North","East","West","North","East"],
    "product":    ["Pro","Starter","Pro","Pro","Starter","Pro","Starter","Pro","Starter","Pro"],
    "revenue":    [1200,450,1800,1500,380,1100,420,1650,390,1350],
    "closed_date":["2026-01-05","2026-01-08","2026-01-15","2026-01-20","2026-02-02",
                   "2026-02-10","2026-02-18","2026-03-01","2026-03-12","2026-03-22"],
}

df = pd.DataFrame(data)
df["closed_date"] = pd.to_datetime(df["closed_date"])
df["month"] = df["closed_date"].dt.to_period("M")

With the data structured, the analysis is a series of groupby operations. Revenue by rep, win rate by product, and monthly trend are the three most common asks in a sales review:

# Revenue and deal count by rep — sorted by revenue
by_rep = (df.groupby("rep")
            .agg(total_revenue=("revenue","sum"), deals=("order_id","count"))
            .sort_values("total_revenue", ascending=False))
print("── Revenue by Rep ──")
print(by_rep)

# Average deal size by product
by_product = df.groupby("product")["revenue"].mean().round(0)
print("\n── Avg Deal Size by Product ──")
print(by_product)

# Monthly revenue trend
by_month = df.groupby("month")["revenue"].sum()
print("\n── Monthly Revenue ──")
print(by_month)

The .agg() pattern with named outputs keeps the result readable and avoids column renaming after the fact. Name your aggregations at the point of creation — it saves time every time someone else reads your code.


One more useful slice: ranking reps within their region. This is the kind of comparison that takes a meeting to surface manually but a few lines to compute:

df["region_rank"] = (df.groupby("region")["revenue"]
                       .rank(ascending=False, method="min")
                       .astype(int))

print(df[["rep","region","revenue","region_rank"]]
        .sort_values(["region","region_rank"]))

The analysis itself is fast. What takes time is knowing which question to ask — and whether the answer is actually telling you something useful about the business.


Want to go deeper? Browse our full resource library →