Filtering, sorting, grouping — the three questions you ask data

A friend of mine sells kente on Saturdays at Kaneshie market. Last month she sat down with a notebook full of sales — about three hundred entries scribbled across two weeks — and asked me to help her make sense of it. We sat on her veranda with a cup of tea and a pen. Within ten minutes I noticed something: every question she had about that notebook was really one of three questions in a coat.

She would ask "how did the long pieces do?" and we'd cross out every row that wasn't a long piece. She'd ask "what sold first each day?" and we'd put the day's rows in time order. She'd ask "which colour brought the most money in?" and we'd sort her rows into colour piles and add up each pile. Filter. Sort. Group. Over and over.

That afternoon is what I want you to remember when pandas starts to feel like a wall of method names. Almost every analysis you'll ever do is one of those three moves, or two of them stacked, or all three in a row.

The three questions, plainly

When you sit down with a table — a CSV, a spreadsheet, a notebook of sales — you are almost always asking one of these:

Which rows match a condition? (filter)
In what order should I look at them? (sort)
How do these rows split into groups, and what's true of each group? (group)

That's it. Once you can hear which question you're really asking, the pandas code more or less writes itself.

Filtering — the weird-looking line that isn't weird

The first time I saw this line of pandas I thought someone had made a typo:

df[df["gpa"] >= 3.5]

Why is df written twice? What is going on inside those brackets?

Here's the way to read it. The inner part — df["gpa"] >= 3.5 — doesn't pick rows. It produces a column the same length as your table, full of True and False: True where the GPA is at least 3.5, False everywhere else. Then you hand that column of trues and falses back to df[...], and pandas keeps the rows where the answer was True.

It's the same thing a teacher does with a pile of scripts — go through the stack once, marking each one pass or fail, then pull out only the ones marked pass. The marking is one pass. The pulling out is another. Pandas just lets you do both in a single line.

passing = df[df["score"] >= 50]
passing.head()

If df had a hundred rows and forty of them scored 50 or above, passing now has forty rows. The other sixty are not deleted — they're simply not in the new table.

Sorting — putting the forms in order

Sorting is the most boring of the three and the one I use most. Whenever a colleague hands me a messy file, my first instinct is to put it in some kind of order before I do anything else. Earliest deadline at the top. Highest GPA first. Most recent date last. It's the data equivalent of sitting on the floor with a stack of application forms and putting them in date order before you start reading.

df.sort_values("deadline")

One thing trips everyone the first time: this does not change df. It hands you back a new table in sorted order. If you want to keep the sorted version, give it a name:

df_sorted = df.sort_values("deadline")

I have lost more time to forgetting this than I'd like to admit.

Grouping — the market trader's piles

Back to my friend at Kaneshie. When she wanted to know which colour brought in the most money, what she did physically was sort her rows into piles — one for each colour — and add up each pile. That's a groupby.

Before you even reach for groupby, there's a smaller cousin worth knowing. If you just want a quick count of how many of each kind you have, value_counts is the fastest way:

df["country"].value_counts()

That single line tells you how many rows belong to each country, biggest count first. Nine times out of ten that's the first thing I want to see when I open a new dataset.

When you want a calculation per group — not just a count — that's where groupby earns its keep:

df.groupby("country")["gpa"].mean()

Read it left to right: split the rows into piles by country, then for each pile take the average GPA. You get back one number per country. That's the trader's pile-and-total move, written in one line.

The mental model — filter, sort, group, look

The thing I most want you to take away is that real analysis is almost never one of these moves on its own. It's usually a small stack of them, applied in order, with a look at the result in between.

Say you want the five countries with the most high-GPA students. Here's the whole thing:

high = df[df["gpa"] >= 3.5]
counts = high["country"].value_counts()
top5 = counts.head(5)
print(top5)

Three lines. Filter, then group-and-count, then sort-and-trim. Each line does one thing. Each line is something you could explain to my friend on the veranda without using the word pandas.

That's the rhythm I want you to develop. Don't try to write one clever line that does everything at once. Ask yourself which of the three questions you're asking, write that one line, look at what came back, then ask the next question. The cleverness comes from knowing which question to ask, not from cramming them all together.

Take this with you

Next time you open a dataset and feel that small panic of where do I even start, slow down and ask which of the three questions you're really asking. Which rows? In what order? Split into what groups? Name the question first, in plain English, and the line of pandas will follow. The library is large, but the work it does for you most days is small, and it is almost always one of these three.

← All field notes