Ask Your Dataset Questions in Plain English with Gemini

Before you write a single query or build a chart, there's a question-asking phase: what's in this data? What looks unusual? What should I investigate first? Traditionally that means df.describe(), a few groupbys, some eyeballing. It works. It's also slow when you're looking at unfamiliar data for the first time.

This project pipes a dataset directly into Gemini and asks it to do that first-pass analysis for you. The goal isn't to replace real analysis — it's to get a faster starting point so you spend your time on the interesting questions, not the obvious ones.

Setup — get your Gemini API key. Google AI Studio provides a free API key with generous rate limits. Go to aistudio.google.com, create a key, and store it in Colab Secrets:

# In Colab: open the Secrets panel (🔑 icon in the left sidebar)
# Add your key with the name GEMINI_API_KEY, then enable notebook access.

!pip install google-generativeai

from google.colab import userdata
import google.generativeai as genai
import pandas as pd

genai.configure(api_key=userdata.get('GEMINI_API_KEY'))
model = genai.GenerativeModel('gemini-1.5-flash')

Step 1 — Load a dataset. We'll use a realistic e-commerce orders table:

# Sample dataset: 20 orders with some interesting patterns baked in
data = {
    'order_id':   range(1001, 1021),
    'customer':   ['Alice','Bob','Carol','Alice','Dave','Eve','Bob','Carol','Frank','Alice',
                   'Dave','Eve','Bob','Grace','Alice','Frank','Carol','Dave','Grace','Eve'],
    'product':    ['Pro','Starter','Pro','Pro','Starter','Pro','Starter','Pro','Starter','Pro',
                   'Pro','Starter','Pro','Starter','Pro','Pro','Starter','Pro','Starter','Pro'],
    'amount':     [1200,450,1800,1350,380,1100,420,1750,390,1200,
                   1650,470,1300,410,1800,1500,430,1100,395,1250],
    'region':     ['East','West','North','East','West','North','West','North','East','East',
                   'West','North','West','South','East','North','West','West','South','North'],
    'status':     ['complete','complete','complete','complete','refunded','complete','complete','complete','complete','complete',
                   'complete','complete','pending','complete','complete','complete','refunded','complete','complete','complete'],
    'month':      ['Jan','Jan','Jan','Feb','Feb','Feb','Feb','Mar','Mar','Mar',
                   'Mar','Mar','Apr','Apr','Apr','Apr','May','May','May','Jun']
}

df = pd.DataFrame(data)
print(df.shape)
print(df.head())

Step 2 — Send the data to Gemini and ask for a first-pass analysis. Convert the DataFrame to a readable string and include it in the prompt:

# Convert dataset to string for the prompt
data_str = df.to_string(index=False)

response = model.generate_content(f"""
You are a data analyst doing a first-pass exploration of a new dataset.

Here is the dataset:
{data_str}

Answer the following questions:
1. Which customer has the highest total spend? How much?
2. Which product generates more revenue overall?
3. Is there any region that stands out as unusually high or low?
4. Are there any orders or patterns that look worth investigating?

Be specific and cite numbers from the data.
""")

print(response.text)

Step 3 — Ask follow-up questions. The model has the data in context — you can ask anything without re-sending it:

# Ask a targeted follow-up
response2 = model.generate_content(f"""
Using the same dataset:
{data_str}

Which customers have made more than 2 purchases?
For each one, what is their average order value and most common product?
""")

print(response2.text)

# Ask for an anomaly check
response3 = model.generate_content(f"""
Using the same dataset:
{data_str}

Flag any orders that look anomalous — unusually high or low amounts,
unexpected status values, or any other pattern worth a closer look.
Explain your reasoning for each flag.
""")

print(response3.text)

Gemini won't replace groupby for precise calculations, and it can hallucinate numbers on complex aggregations — always verify specific figures against the raw data. What it's genuinely useful for is the first five minutes: getting oriented, surfacing what to look at next, and asking questions you didn't know to ask.

Want to go deeper? Browse our full resource library →