Filtering, sorting, grouping

What we're doing today

By the end of this session, you will take a table of data, pull out only the rows you care about, put those rows in a useful order, and then count or average them by category. Three moves: filter, sort, group. They are the three questions you will end up asking every dataset you ever touch — which rows matter, what order should they be in, and what does each category look like?

We will keep the examples simple — a small CSV of products from a corner shop. The same moves work for any table: students in a class, patients in a clinic, transactions in a bank statement.

Setting up the table

We will use pandas, the library Python people reach for whenever they have a table of data. Think of pandas as a spreadsheet you drive with code instead of with the mouse. A table in pandas is called a DataFrame — the technical word for "a table with named columns." Whenever I say DataFrame, picture a spreadsheet.

Load it like this at the top of your notebook:

import pandas as pd
df = pd.read_csv("products.csv")

The variable df now holds our table. Imagine it looks something like this:

   name        category    price   stock   region
0  Milo        drinks      35.0    120     Accra
1  Indomie     food         5.0    400     Kumasi
2  Sprite      drinks      10.0     80     Accra
3  Rice        food        90.0     50     Tamale
4  Pringles    snacks      45.0     30     Accra

Five rows, five columns. We will keep coming back to this picture.

Filtering — pulling out the rows you want

Imagine a teacher with a stack of fifty exam scripts who only wants to look at the ones that passed. She flips through, glances at the score, drops each script into the "keep" pile or puts it aside. By the end she has a smaller stack — only the passing ones.

That is filtering. In pandas, you do it with one line that looks strange the first time you see it:

expensive = df[df["price"] >= 50]

Let me unpack that, because every piece matters.

The inner part — df["price"] >= 50 — does not give you back a smaller table. It gives you a column of True and False values, one for each row in df. The technical name for this is a boolean mask. That just means a column of True/False values pandas uses to decide which rows to keep. For our five-row table, that mask would look like this:

0    False    (Milo, 35.0 — not >= 50)
1    False    (Indomie, 5.0)
2    False    (Sprite, 10.0)
3     True    (Rice, 90.0)
4    False    (Pringles, 45.0)

Now look at the outer part — df[ ... ]. When you hand a boolean mask back to df, pandas walks down the list, keeps every row where the mask says True, and drops every row where it says False. So you get back a new table with only the rows where the price was at least 50. In our case, just Rice.

That is the whole trick. Once you see filtering as "build a True/False column, hand it back to df," every filter you ever write is the same shape:

df[df["category"] == "drinks"]      # only drinks
df[df["stock"] < 100]               # things running low
df[df["region"] == "Accra"]         # only Accra

Notice that == (two equals signs) is how you ask "is this equal to that?" A single = means "store this in that label" — different job.

Combining conditions — and, or

Sometimes one rule is not enough. You want drinks that are also cheap. You want things in Accra or Kumasi. Two rules, joined.

In pandas, and is written & and or is written | (the straight vertical bar, usually Shift + Backslash). The rule that catches everyone: wrap each condition in its own pair of round brackets. Pandas is fussy about this and the error message when you forget is not friendly.

# drinks AND cheap
df[(df["category"] == "drinks") & (df["price"] < 20)]
 
# Accra OR Kumasi
df[(df["region"] == "Accra") | (df["region"] == "Kumasi")]

If you have a long list of values to match against, .isin() is cleaner than chaining ten |s. You hand it a Python list:

target_regions = ["Accra", "Kumasi", "Tamale"]
df[df["region"].isin(target_regions)]

Reads almost like English: keep the rows where region is in this list.

Sorting — putting rows in order

A market trader at the end of the day lays out her receipts on the counter and shuffles them — sometimes by time, sometimes by amount, sometimes by product. The receipts do not change; only the order does.

That is sorting. In pandas you use sort_values:

df.sort_values("price")                          # cheapest first
df.sort_values("price", ascending=False)         # most expensive first
df.sort_values("name")                           # A to Z by name

Now here is the thing that trips up almost everyone the first time. sort_values returns a brand new DataFrame. It does not change the original df. Watch:

df.sort_values("price")
df.head()    # still in the original order — wait, what?

The first line did sort the table — but only the sorted version pandas handed back. We did not catch it, so it disappeared. The original df was never touched. If you want to keep the sorted version, store it in a label, the same way you store anything else:

df_sorted = df.sort_values("price")
df_sorted.head()

Now df_sorted is your sorted table. df is still the original. Both exist.

You can sort by more than one column at a time. Useful when the first column has ties — say two products both cost 35 and you want a tiebreaker by name:

df.sort_values(["price", "name"])

Pandas sorts by price first, and where prices match, it falls back to sorting by name.

Counting categories

Before the heavy machinery, a quick win. You will constantly want to ask how many of each kind do I have? — how many drinks, how many snacks, how many in each region. The fastest way is value_counts:

df["category"].value_counts()

That hands you back something like:

food       2
drinks     2
snacks     1

Sorted most common to least. The fastest way to get a feel for any column that holds categories.

Groupby — split, calculate, combine

This is where pandas really starts paying for itself.

Picture the same market trader at the end of the day with a pile of sales slips. She wants to know how much she made on drinks, on food, on snacks. So she sorts the slips into three small stacks — one per category — and adds up each stack on its own.

That is exactly what groupby does. groupby means splitting the table into stacks based on a column, then doing one calculation per stack. The classic shape:

df.groupby("category")["price"].mean()

Read it left to right: group the rows by category, then for each category, take the mean of price. You get back one row per category, with the average price for that category:

category
drinks    22.5
food      47.5
snacks    45.0

mean() is the average. Swap in other calculations the same way: .sum() for the total, .min() for the smallest, .max() for the largest, .count() for how many rows were in the stack.

Want more than one calculation at once? Hand a list to .agg():

df.groupby("category")["price"].agg(["mean", "min", "max", "count"])

You get a small table — one row per category, four columns of statistics. That single line replaces a frightening amount of by-hand counting.

You can group by two columns at once. That gives you one row for every combination:

df.groupby(["region", "category"]).size()

.size() counts the rows in each stack. You get back Accra-drinks: 2, Accra-snacks: 1, Kumasi-food: 1, and so on. Useful when you want to see where things cluster.

Stacking the three moves

A real analysis is almost always filter, then group, then sort. Build it up one step at a time, with a clear name for each piece. Do not try to write one clever line that does everything — you will not read it tomorrow.

Worked example. The question: which regions hold the most drink stock?

drinks_only = df[df["category"] == "drinks"]
by_region = drinks_only.groupby("region")["stock"].sum()
ranked = by_region.sort_values(ascending=False)
print(ranked)

Three steps. Each one readable on its own. Each one stored in a label so you can print it and check it before moving on. That is the pattern.

Your exercise

Use the products.csv from above — or better, a small table from something you actually care about (your phone bills, a list of books you have read, attendance at a club you run). Answer these four questions in pandas. Save your code.

How many products in the table are priced above 30? (Single number — use len(...) on the filtered table, or chain .shape[0].)
What is the average price across all rows? Then: does the average differ between categories?
Which two regions hold the most stock overall? Show the region name and the total stock.
Pick one category. List its products, sorted by price from cheapest to most expensive. Print at least the name, region, and price.

For each question, write the pandas code and then write one sentence in plain English describing what the result tells you. The sentence is the point — the code is only how you got there.

Common slip-ups

These are the same mistakes everyone makes the first week. When you hit one, recognise it from this list and fix it in ten seconds.

Forgetting parentheses around each condition. df[df["a"] == 1 & df["b"] == 2] fails with a confusing error. Write df[(df["a"] == 1) & (df["b"] == 2)]. One pair of round brackets around each side of the & or |.
Writing and / or instead of & / |. The plain English words look right but do not work on pandas columns. Use the symbols.
Ignoring the return value of sort_values. It does not change df. It hands you a new sorted table and walks away. Store the return value: df_sorted = df.sort_values(...).
Grouping on a messy column. If your column has "Accra", "accra", and " Accra " mixed in, pandas treats them as three different regions and your counts are wrong. Run df["region"].unique() first to see what is in there.

Filter, sort, group. Three moves. Run the exercise and bring your one-sentence findings to the next session.