CurriculumWeek 03 · Data and the outside worldCSVs and a friendly intro to pandas
Lesson 03.01 · ~30 min read

CSVs and a friendly intro to pandas

What a CSV is. Reading one with the standard library, then with pandas. head() and column access.

By Kelvin AmoabaUpdated May 12, 2026

What we're doing today

By the end of this session, you will open a file full of rows and columns on your laptop, hand it to Python, and ask Python questions about it. You will pull out one column and find its average. You will do in two lines what would take twenty lines of for loops with only what we learned in Week 1.

This is the week the course starts feeling like real work. Sit with it.

A file full of rows and commas

The kind of file we're working with today is called a CSV. The letters stand for "comma-separated values," and that name is the whole story. A CSV is a plain text file where each line is a row and commas split the columns. Open it in Notepad — or any text editor on your laptop — and you'll see what I mean.

Here is a small one — a teacher's class roster:

name,age,class,gpa
Adwoa,19,SHS3,3.4
Kwame,20,SHS3,3.1
Yaa,18,SHS2,3.7
Kojo,19,SHS3,2.9
Akosua,20,SHS3,3.6

That is the entire file. No colours, no formulas, no bold headers. Just text.

The first line — name,age,class,gpa — is the header. Those are the column names. Every line after that is one record: one student. The commas are the walls between cells. Where Excel would draw a vertical line, a CSV uses a comma.

Open that same file in Excel or Google Sheets and you'll see a normal-looking spreadsheet — Excel reads the commas and draws the grid for you. But the file on disk is just text. That's the trick: a CSV is something a human, Excel, and Python can all read. It travels well. That is why almost every dataset you'll ever download — from the Ghana Statistical Service, the World Bank, a school's records office — comes as a CSV.

Picture a market sales book at the end of the day. Each line is one sale: what was sold, the price, who bought it. A CSV is the same book, typed out with commas between the columns.

Reading it the hard way first

Before I show you the easy tool, let me show you why we need it. You already know how to open a file in Python. You could read this CSV line by line, split each line on the commas, and build a list yourself:

with open("students.csv") as f:
    lines = f.readlines()
 
header = lines[0].strip().split(",")
for line in lines[1:]:
    cells = line.strip().split(",")
    print(cells)

It works. For five rows, fine. But the moment somebody asks "what is the average GPA in SHS3?" you'll write a loop, an if, a running total, a counter, and a division at the end. Every time, a chance you'll get the logic slightly wrong.

When the same job keeps coming up, somebody eventually writes a tool that does it well. For tables of data, that tool is pandas.

What pandas is, and why everyone uses it

A library is a bundle of pre-written code somebody else built that you can borrow. Python on its own knows how to print, loop, open files. A library adds new abilities on top. You import it once at the top of your file, and from then on you can use whatever it provides.

pandas is the library most people reach for when they want to handle table-shaped data in Python. Researchers, journalists, data analysts at banks — all of them. By the end of this week, you too.

In Google Colab, pandas is already installed. On your own laptop, run this once in your terminal:

pip install pandas

pip is Python's tool for fetching libraries off the internet. Run it once per library per machine and the library is on your computer for good.

Loading the file in two lines

Here is the full version of what took us seven lines before:

import pandas as pd
 
df = pd.read_csv("students.csv")

Two lines. Let me unpack both, because every piece matters.

import pandas as pd does two things. It brings the pandas library into your program, and it gives it a shorter nickname — pd — so you don't have to type the full word every time. Every pandas tutorial on Earth uses pd. Don't fight the convention; use pd and your code will look like everyone else's.

pd.read_csv("students.csv") tells pandas to open the file students.csv and read the whole thing into memory. The file must be in the same folder as the program you're running, otherwise you have to give the full path — something like /Users/kelvin/Downloads/students.csv on a Mac.

What pandas hands back, we stored in a jar called df. The letters stand for DataFrame, which is pandas's word for a table. The simplest way to think about a DataFrame: it's a spreadsheet you can program. It has rows. It has columns with names. You can ask it questions and do maths on a whole column at once. An Excel sheet with a brain attached.

Why df? Same reason we call the library pd — convention. Every example you'll ever read calls it df. Use the same name.

The first four things you check on any new file

When somebody hands you a CSV, there are four things you check before you do anything else. Treat it as a routine — like an attendance register at the start of class. You don't start teaching until you've called the names.

df.head()
df.shape
df.columns
df["gpa"]

df.head() shows the first five rows. The brackets at the end are not optional — they tell Python to actually run the thing. If you write df.head without brackets, Python prints gibberish about a function instead. If you want more rows, put a number inside — df.head(10) gives ten. In a notebook, df.head() renders as a proper table with column headers and aligned rows.

df.shape tells you how big the file is. It hands back two numbers: rows and columns. For our students file, (5, 4) — five rows, four columns. Notice no brackets after shape — it's a fact about the table, not an action. If the file had 50,000 students, shape would tell you immediately, before you tried to print everything and froze your laptop.

df.columns lists the column names. Read them. Some will be obvious — name, age. Others will be cryptic short forms whoever made the file invented in a hurry. If you don't know what a column is, you can't answer questions about it.

df["gpa"] pulls out a single column. You pass the column's name in square brackets, with quotes, exactly as it appears in the header. Ask for GPA when the header says gpa and pandas won't find it — column names are case-sensitive. The result is the entire column, top to bottom.

The shift you need to make in your head

This is the most important part of today's lesson, so slow down here.

In Weeks 1 and 2, when you wanted to do something to every item in a list, you wrote a loop:

total = 0
for student in students:
    total = total + student["gpa"]
average = total / len(students)
print(average)

That works. It will always work. But pandas wants you to think differently. You stop saying "loop over the rows and do this thing to each one" and start saying "do this thing to the whole column." Watch:

df["gpa"].mean()

One line. The average GPA across every student in the file. No loop, no counter. You picked the column and asked it for its mean. pandas does the loop for you, fast, in code that has been polished for years.

The same shift works for other questions:

df["gpa"].max()        # highest GPA in the file
df["gpa"].min()        # lowest
df["age"].mean()       # average age
df["class"].unique()   # the distinct classes — SHS2, SHS3

Each of these would have been a loop in Week 1. Once you've made the shift from "loop over rows" to "do this to the column," the rest of the library starts making sense.

A bigger example to make it real

Imagine instead of five students, the file has 800. A teacher walks up with the laptop and asks three questions:

  1. How many students are in the file?
  2. What's the average GPA?
  3. What classes are represented?

In Week 1, that would have been a serious morning of code. With pandas:

import pandas as pd
 
df = pd.read_csv("students.csv")
 
print(df.shape)             # (800, 4) — 800 students, 4 columns
print(df["gpa"].mean())     # average GPA
print(df["class"].unique()) # the distinct classes

Three answers, three lines, on a file you didn't even have to open in Excel. That is what pandas buys you.

Your exercise

Grab any CSV you can find. The course materials folder has a students.csv, or download one yourself — a list of products from a shop, a sales log, a market vendor register, anything in table shape. If you have nothing, type six rows into a text editor, save it as students.csv, and use that.

Then write a program that does five things:

  1. Imports pandas as pd.
  2. Loads the file into a DataFrame called df.
  3. Prints df.shape so you know how big it is.
  4. Prints df.columns so you know the column names.
  5. Picks one numeric column and prints its .mean().

That's it. Don't try to filter, sort, or group anything yet — that's the next lesson, where this gets really useful.

Common slip-ups

These four catch every beginner. When one of them gets you, you'll recognise it from this list and fix it in twenty seconds.

  • File not found. pd.read_csv("students.csv") only works if the file is in the same folder as the program you're running. If pandas can't find it, the error will say FileNotFoundError. Either move the file into the right folder, or give pandas the full path: pd.read_csv("/Users/kelvin/Downloads/students.csv").
  • Wrong column name. Column names are case-sensitive and space-sensitive. df["GPA"] is not the same as df["gpa"]. df["class "] with a trailing space is not the same as df["class"]. If pandas complains, print df.columns and copy the name exactly.
  • Forgetting the brackets. df.head prints information about the function. df.head() actually runs it. Same with .mean, .unique, anything that ends in a verb — always with brackets.
  • Mixing up the index. Every DataFrame has a hidden first column on the left called the index that numbers your rows from 0. It's not one of your data columns. df["name"] gives you the names; df[0] errors, because there is no column literally named 0. More on the index next lesson.

That is the whole session. Load a file, check its shape, read a column, find an average. Bring your code to the next session and we'll start asking the file harder questions.

Resources

★ recommended