Introducing pandas
Last updated on 2025-07-03 | Edit this page
Overview
Questions
- What data will we be working with in this lesson?
- What is pandas?
- Why use pandas for data analysis?
- How do we read and write data using pandas?
Objectives
- Learn about the dataset we’ll be working with
- Look at the benefits of using pandas to analyze data
- Import data from a CSV into a pandas dataframe
- Learn how pandas handles different types of data
- Write a dataframe to a CSV
Dataset description
The dataset description was taken from Data Management with SQL for Ecologists (CC-BY-4.0)
The data we will be using is a time-series for a small mammal community in southern Arizona. This is part of a project studying the effects of rodents and ants on the plant community that has been running for almost 40 years. The rodents are sampled on a series of 24 plots, with different experimental manipulations controlling which rodents are allowed to access which plots. Each row represents a single observation of an animal, including its species, weight, and hindfoot length.
This is a real dataset that has been used in over 100 publications. We’ve simplified it for the workshop, but you can download the full dataset and work with it using exactly the same tools we’ll learn about today.
Answering questions using data
Let’s look at some of the cleaned spreadsheets we downloaded during Setup to complete this challenge. Over the course of this lesson, we’ll be working with the following three files:
- surveys.csv
- species.csv
- plots.csv
Discussion
Open each of these csv files and explore them. What information is contained in each file? Specifically, if we had the following research questions:
- How has the hindfoot length and weight of Dipodomys species changed over time?
- What is the average weight of each species, per year?
- What information can I learn about Dipodomys species in the 2000s, over time?
What would we need to answer these questions? Which files have the data we need? What operations would we need to perform if we were doing these analyses by hand?
Hint: We can view CSV files by clicking on them in the left sidebar.
In order to answer the questions described above, we’ll need to do the following basic data operations:
- select subsets of the data (rows and columns)
- group subsets of data
- do math and other calculations
- combine data across spreadsheets
- plot data to identify patterns
In addition, we don’t want to do this manually! Instead of searching for the right pieces of data ourselves, or clicking between spreadsheets, or manually sorting columns, we want to make the computer do the work.
In particular, we want to use a tool where it’s easy to repeat our analysis in case our data changes. We also want to do all this searching without actually modifying our source data.
Why use pandas?
The Python Data Analysis Library, or pandas, is a Python package used to work with dataframes. A dataframe is a representation of tabular data similar to a well-organized spreadsheet, consisting of rows (representing records) and columns (representing fields or variables). Tables are a very common format for representing scientific data and should be very familiar to anyone taking this course.
Pandas offers the same advantages as any well-written package: It creates a common codebase for working on a single task, in this case, analyzing data. Some benefits of this approach include:
- Reliability: Provides flexible, well-tested methods for reading, querying, aggregating, grouping, and plotting data
- Repeatability: Repeat the same analyses when data is added or changed
- Speed: Faster in many cases than coding our own functions in Python
- Reproducibility: Document and share code in narrative form using tools like Jupyter notebooks
- Community: Access a large, active community for help when we run into problems
Importing data using pandas
Unlike the modules in the Python Standard Library discussed in the
previous lesson, pandas is not part of the typical Python installation.
Once it has been installed, however, it can be accessed using the same
import
command used to import built-in modules. By
convention, pandas is imported using the alias “pd”. We can assign the
alias using the as
keyword:
Using aliases
Why use an alias? We will refer to pandas many, many times when writing a script, so it’s useful to abbreviate the name and save some keystrokes. But it’s also a matter of consistency with the larger community. Many of the core scientific Python packages, including pandas, recommend a specific alias, so most code shared online will use those aliases as well.
Now that pandas has been imported, we can access the function we need
to load data from a CSV, pd.read_csv()
. The function call
has three parts:
- The name (or in this case, alias) of the object that defines the function. This can be a module, package, or any other object. It can also be omitted in some cases (for example, when using a function built into Python).
- The name of the method we’d like to use
- A set of parentheses that tells the function to run.
Many functions include parameters that allow the user to modify the behavior of the function. Parameters may be positional or named. In Python, data passed to positional parameters are called arguments (often abbreviated as args), and data passed to named parameters are called keyword arguments (often abbreviated as kwargs). In either case, the arguments are included inside the parentheses used to call the function.
Below, we will pass a single argument to pd.read_csv()
:
A string that tells the function where to find the surveys.csv file.
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
… | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
35549 rows × 9 columns
Here are a few things to observe about how the dataframe is structured:
- By default, the notebook displays the first and last five rows of the dataframe
- Each row represents a record
- Each column represents a field
- The unlabeled column on the far left is called the row label
- pandas has done a lot of work behind the scenes when reading the
data, including:
- Assigning the row index as the row label
- Assigning each column a data type based on its contents
- Assigning certain cells the value NaN, which stands for “not a number” and is used to designate null values in the dataset. Here, those cells represent blank cells in the spreadsheet.
Much of this behavior can be controlled when the spreadsheet is first
read by using keyword arguments. For example, to force
pd.read_csv()
to use the existing record_id column as the
row label, use the index_col keyword argument:
OUTPUT
month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|
record_id | ||||||||
1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
… | … | … | … | … | … | … | … | … |
35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
35549 rows × 8 columns
Now record_id appears on the far left, indicating that pandas is using that column as the row label.
Challenge
Run help()
on the pd.read_csv()
method
(note that you should omit the trailing parentheses). Alternatively,
take a look at the much
prettier documentation provided by the pandas developers. Based on
that information, answer the following questions:
- What parameter would you use to control how null values are interpreted?
- What values are interpreted as NaN by default?
na_values
In addition to empty cells, the following values are interpreted as NaN, or null, by pandas. These strings may look familiar from programs like Excel.
- #N/A
- #N/A N/A
- #NA
- -1.#IND
- -1.#QNAN
- -NaN
- -nan
- 1.#IND
- 1.#QNAN
- <NA>
- N/A
- NA
- NULL
- NaN
- n/a
- nan
- null
Assigning to variables
We can assign the dataframe to a variable so we don’t need to load it every time we want to access it. As we saw in the previous lesson, we use a single equals sign to assign an object to a variable. The variable name should be short and descriptive. By convention, variable names in Python use snake_case (that is, lower case with individual words separated by underscores).
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
… | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
35549 rows × 9 columns
Understanding data types in pandas
We can use the info()
method to see how pandas
interpreted each column in the dataset. This method gives use the name,
count, and data type of each column and provides some information about
the dataset as a whole (for example, memory usage, which is helpful to
know when working with large datasets):
OUTPUT
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35549 entries, 0 to 35548
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 record_id 35549 non-null int64
1 month 35549 non-null int64
2 day 35549 non-null int64
3 year 35549 non-null int64
4 plot_id 35549 non-null int64
5 species_id 34786 non-null object
6 sex 33038 non-null object
7 hindfoot_length 31438 non-null float64
8 weight 32283 non-null float64
dtypes: float64(2), int64(5), object(2)
memory usage: 2.4+ MB
Note that the data types used by pandas look a little different than the ones used by Python itself. See the table below for the rough equivalents.
Saving a dataframe
When analyzing a dataset, we’ll often want to save our work to a
file. The to_csv()
method can be used to write a dataframe
to a CSV, which is a text file which places each row on a separate line
and separates individual cells with commas. We will include a positional
argument with the path to which we want to save the file. We will also
include the index keyword argument. Setting that parameter to
False tells pandas not to include the row label when writing the
CSV.
This is far from the only option for writing data—pandas supports a variety of file types for both reading and writing. Try searching for “read_” in the pandas API reference to see other supported formats.
Key Points
- This lesson uses real data from a decades-long survey of rodents in Arizona
- pandas is a data analysis package that allows users to read, manipulate, and view tabular data using Python
- pandas represents data as a dataframe consisting of rows (records) and columns (fields or variables)
- We can read a dataframe from CSV using the
pd.read_csv()
function and write a dataframe to CSV using theto_csv()
method - The behavior of a function can be modified by including arguments and keyword arguments when the function is called
- pandas uses its own classes to represent text, numbers, booleans, and datetimes