Introducing pandas

Last updated on 2025-07-03 | Edit this page

Overview

Questions

What data will we be working with in this lesson?
What is pandas?
Why use pandas for data analysis?
How do we read and write data using pandas?

Objectives

Learn about the dataset we’ll be working with
Look at the benefits of using pandas to analyze data
Import data from a CSV into a pandas dataframe
Learn how pandas handles different types of data
Write a dataframe to a CSV

Dataset description

The dataset description was taken from Data Management with SQL for Ecologists (CC-BY-4.0)

The data we will be using is a time-series for a small mammal community in southern Arizona. This is part of a project studying the effects of rodents and ants on the plant community that has been running for almost 40 years. The rodents are sampled on a series of 24 plots, with different experimental manipulations controlling which rodents are allowed to access which plots. Each row represents a single observation of an animal, including its species, weight, and hindfoot length.

This is a real dataset that has been used in over 100 publications. We’ve simplified it for the workshop, but you can download the full dataset and work with it using exactly the same tools we’ll learn about today.

Answering questions using data

Let’s look at some of the cleaned spreadsheets we downloaded during Setup to complete this challenge. Over the course of this lesson, we’ll be working with the following three files:

surveys.csv
species.csv
plots.csv

Discussion

Open each of these csv files and explore them. What information is contained in each file? Specifically, if we had the following research questions:

How has the hindfoot length and weight of Dipodomys species changed over time?
What is the average weight of each species, per year?
What information can I learn about Dipodomys species in the 2000s, over time?

What would we need to answer these questions? Which files have the data we need? What operations would we need to perform if we were doing these analyses by hand?

Hint: We can view CSV files by clicking on them in the left sidebar.

In order to answer the questions described above, we’ll need to do the following basic data operations:

select subsets of the data (rows and columns)
group subsets of data
do math and other calculations
combine data across spreadsheets
plot data to identify patterns

In addition, we don’t want to do this manually! Instead of searching for the right pieces of data ourselves, or clicking between spreadsheets, or manually sorting columns, we want to make the computer do the work.

In particular, we want to use a tool where it’s easy to repeat our analysis in case our data changes. We also want to do all this searching without actually modifying our source data.

Why use pandas?

The Python Data Analysis Library, or pandas, is a Python package used to work with dataframes. A dataframe is a representation of tabular data similar to a well-organized spreadsheet, consisting of rows (representing records) and columns (representing fields or variables). Tables are a very common format for representing scientific data and should be very familiar to anyone taking this course.

Pandas offers the same advantages as any well-written package: It creates a common codebase for working on a single task, in this case, analyzing data. Some benefits of this approach include:

Reliability: Provides flexible, well-tested methods for reading, querying, aggregating, grouping, and plotting data
Repeatability: Repeat the same analyses when data is added or changed
Speed: Faster in many cases than coding our own functions in Python
Reproducibility: Document and share code in narrative form using tools like Jupyter notebooks
Community: Access a large, active community for help when we run into problems

Importing data using pandas

Unlike the modules in the Python Standard Library discussed in the previous lesson, pandas is not part of the typical Python installation. Once it has been installed, however, it can be accessed using the same import command used to import built-in modules. By convention, pandas is imported using the alias “pd”. We can assign the alias using the as keyword:

PYTHON

import pandas as pd

Using aliases

Why use an alias? We will refer to pandas many, many times when writing a script, so it’s useful to abbreviate the name and save some keystrokes. But it’s also a matter of consistency with the larger community. Many of the core scientific Python packages, including pandas, recommend a specific alias, so most code shared online will use those aliases as well.

Now that pandas has been imported, we can access the function we need to load data from a CSV, pd.read_csv(). The function call has three parts:

The name (or in this case, alias) of the object that defines the function. This can be a module, package, or any other object. It can also be omitted in some cases (for example, when using a function built into Python).
The name of the method we’d like to use
A set of parentheses that tells the function to run.

Many functions include parameters that allow the user to modify the behavior of the function. Parameters may be positional or named. In Python, data passed to positional parameters are called arguments (often abbreviated as args), and data passed to named parameters are called keyword arguments (often abbreviated as kwargs). In either case, the arguments are included inside the parentheses used to call the function.

Below, we will pass a single argument to pd.read_csv(): A string that tells the function where to find the surveys.csv file.

PYTHON

pd.read_csv("data/surveys.csv")

OUTPUT

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
0	1	7	16	1977	2	NL	M	32.0	NaN
1	2	7	16	1977	3	NL	M	33.0	NaN
2	3	7	16	1977	2	DM	F	37.0	NaN
3	4	7	16	1977	7	DM	M	36.0	NaN
4	5	7	16	1977	3	DM	M	35.0	NaN
…	…	…	…	…	…	…	…	…	…
35544	35545	12	31	2002	15	AH	NaN	NaN	NaN
35545	35546	12	31	2002	15	AH	NaN	NaN	NaN
35546	35547	12	31	2002	10	RM	F	15.0	14.0
35547	35548	12	31	2002	7	DO	M	36.0	51.0
35548	35549	12	31	2002	5	NaN	NaN	NaN	NaN

35549 rows × 9 columns

Here are a few things to observe about how the dataframe is structured:

By default, the notebook displays the first and last five rows of the dataframe
Each row represents a record
Each column represents a field
The unlabeled column on the far left is called the row label
pandas has done a lot of work behind the scenes when reading the data, including:
- Assigning the row index as the row label
- Assigning each column a data type based on its contents
- Assigning certain cells the value NaN, which stands for “not a number” and is used to designate null values in the dataset. Here, those cells represent blank cells in the spreadsheet.

Much of this behavior can be controlled when the spreadsheet is first read by using keyword arguments. For example, to force pd.read_csv() to use the existing record_id column as the row label, use the index_col keyword argument:

PYTHON

pd.read_csv("data/surveys.csv", index_col="record_id")

OUTPUT

	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
record_id
1	7	16	1977	2	NL	M	32.0	NaN
2	7	16	1977	3	NL	M	33.0	NaN
3	7	16	1977	2	DM	F	37.0	NaN
4	7	16	1977	7	DM	M	36.0	NaN
5	7	16	1977	3	DM	M	35.0	NaN
…	…	…	…	…	…	…	…	…
35545	12	31	2002	15	AH	NaN	NaN	NaN
35546	12	31	2002	15	AH	NaN	NaN	NaN
35547	12	31	2002	10	RM	F	15.0	14.0
35548	12	31	2002	7	DO	M	36.0	51.0
35549	12	31	2002	5	NaN	NaN	NaN	NaN

35549 rows × 8 columns

Now record_id appears on the far left, indicating that pandas is using that column as the row label.

Challenge

Run help() on the pd.read_csv() method (note that you should omit the trailing parentheses). Alternatively, take a look at the much prettier documentation provided by the pandas developers. Based on that information, answer the following questions:

What parameter would you use to control how null values are interpreted?
What values are interpreted as NaN by default?

Show me the solution to challenge 1

na_values

Show me the solution to challenge 2

In addition to empty cells, the following values are interpreted as NaN, or null, by pandas. These strings may look familiar from programs like Excel.

#N/A
#N/A N/A
#NA
-1.#IND
-1.#QNAN
-NaN
-nan
1.#IND
1.#QNAN
<NA>
N/A
NA
NULL
NaN
n/a
nan
null

Assigning to variables

We can assign the dataframe to a variable so we don’t need to load it every time we want to access it. As we saw in the previous lesson, we use a single equals sign to assign an object to a variable. The variable name should be short and descriptive. By convention, variable names in Python use snake_case (that is, lower case with individual words separated by underscores).

PYTHON

surveys = pd.read_csv("data/surveys.csv")
surveys

OUTPUT

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
0	1	7	16	1977	2	NL	M	32.0	NaN
1	2	7	16	1977	3	NL	M	33.0	NaN
2	3	7	16	1977	2	DM	F	37.0	NaN
3	4	7	16	1977	7	DM	M	36.0	NaN
4	5	7	16	1977	3	DM	M	35.0	NaN
…	…	…	…	…	…	…	…	…	…
35544	35545	12	31	2002	15	AH	NaN	NaN	NaN
35545	35546	12	31	2002	15	AH	NaN	NaN	NaN
35546	35547	12	31	2002	10	RM	F	15.0	14.0
35547	35548	12	31	2002	7	DO	M	36.0	51.0
35548	35549	12	31	2002	5	NaN	NaN	NaN	NaN

35549 rows × 9 columns

Understanding data types in pandas

We can use the info() method to see how pandas interpreted each column in the dataset. This method gives use the name, count, and data type of each column and provides some information about the dataset as a whole (for example, memory usage, which is helpful to know when working with large datasets):

PYTHON

surveys.info()

OUTPUT

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35549 entries, 0 to 35548
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   record_id        35549 non-null  int64
 1   month            35549 non-null  int64
 2   day              35549 non-null  int64
 3   year             35549 non-null  int64
 4   plot_id          35549 non-null  int64
 5   species_id       34786 non-null  object
 6   sex              33038 non-null  object
 7   hindfoot_length  31438 non-null  float64
 8   weight           32283 non-null  float64
dtypes: float64(2), int64(5), object(2)
memory usage: 2.4+ MB

Note that the data types used by pandas look a little different than the ones used by Python itself. See the table below for the rough equivalents.

Data types in pandas

Data Type	Description	Similar To
object	Character string or mixed	str
int64	Integer numerical	int
float64	Approximate numerical	float
bool	Stores True or False values	bool
datetime64	Stores date and time values	datetime.datetime

Saving a dataframe

When analyzing a dataset, we’ll often want to save our work to a file. The to_csv() method can be used to write a dataframe to a CSV, which is a text file which places each row on a separate line and separates individual cells with commas. We will include a positional argument with the path to which we want to save the file. We will also include the index keyword argument. Setting that parameter to False tells pandas not to include the row label when writing the CSV.

PYTHON

surveys.to_csv("data/surveys_mod.csv", index=False)

This is far from the only option for writing data—pandas supports a variety of file types for both reading and writing. Try searching for “read_” in the pandas API reference to see other supported formats.

Key Points

This lesson uses real data from a decades-long survey of rodents in Arizona
pandas is a data analysis package that allows users to read, manipulate, and view tabular data using Python
pandas represents data as a dataframe consisting of rows (records) and columns (fields or variables)
We can read a dataframe from CSV using the pd.read_csv() function and write a dataframe to CSV using the to_csv() method
The behavior of a function can be modified by including arguments and keyword arguments when the function is called
pandas uses its own classes to represent text, numbers, booleans, and datetimes