Accessing and Filtering Data
Last updated on 2025-07-15 | Edit this page
Overview
Questions
- How can we look at individual rows and columns in a dataframe?
- How can we look at subsets of the dataset?
Objectives
- Access individual rows and columns
- Access multiple columns at once using a list
- Filter the dataframe based on the data it contains
- Sort the dataframe
In the previous lesson, we saw how to load a dataframe from a file. Now we’ll look at how to access the data within that dataframe. We’ll begin by importing pandas and reading our CSV, as we did in the previous lesson:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
… | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
35549 rows × 9 columns
Getting a column
We can get a single column from the dataframe using square brackets.
Square brackets are used in Python to access objects inside a container
like a list
, dict
, or DataFrame
.
To get a column, we pass the name of the column inside a set brackets
appended to the dataframe. For example, tp retrieve the year, use:
OUTPUT
0 1977
1 1977
2 1977
3 1977
4 1977
...
35544 2002
35545 2002
35546 2002
35547 2002
35548 2002
Name: year, Length: 35549, dtype: int64
A single column is returned as a Series
, which is why
the output for this cell is formatted differently than the other cells
in this lesson. The main difference between a Series
and a
DataFrame
is that a DataFrame
can have
multiple columns. The two classes share many but not all attributes and
methods.
Note also that this is a copy of data from the original dataframe. Changing values in the series will have no effect on the original dataframe. Most operations on a DataFrame or Series return a copy of the original object.
Getting unique values
We can get the list of unique values within a column using the
unique()
method:
OUTPUT
array(['NL', 'DM', 'PF', 'PE', 'DS', 'PP', 'SH', 'OT', 'DO', 'OX', 'SS',
'OL', 'RM', nan, 'SA', 'PM', 'AH', 'DX', 'AB', 'CB', 'CM', 'CQ',
'RF', 'PC', 'PG', 'PH', 'PU', 'CV', 'UR', 'UP', 'ZL', 'UL', 'CS',
'SC', 'BA', 'SF', 'RO', 'AS', 'SO', 'PI', 'ST', 'CU', 'SU', 'RX',
'PB', 'PL', 'PX', 'CT', 'US'], dtype=object)
Getting multiple columns
It is also possible to retrieve more than one column at a time. To do
some, we’ll use the built-in list
data type.
Python uses list
to store sequences, that is, ordered
lists of objects. Any type of Python object can be stored in a list:
strings, integers, floats, even other lists or collections. Once
created, a list can be modified by appending, inserting, or deleting
items. We will not be going into detail about these operations here, but
as always you can learn more about how to use a list using
help()
or the Python
docs.
You can create a list using square brackets. Let’s create a list of the three columns in our dataframe that together give the date of an observation:
When we pass this list to the survey dataframe using square brackets, as we did above, we get a copy of the dataframe containing just those columns. Note that, because we asked for more than one column, pandas returns a dataframe:
OUTPUT
year | month | day | |
---|---|---|---|
0 | 1977 | 7 | 16 |
1 | 1977 | 7 | 16 |
2 | 1977 | 7 | 16 |
3 | 1977 | 7 | 16 |
4 | 1977 | 7 | 16 |
… | … | … | … |
35544 | 2002 | 12 | 31 |
35545 | 2002 | 12 | 31 |
35546 | 2002 | 12 | 31 |
35547 | 2002 | 12 | 31 |
35548 | 2002 | 12 | 31 |
35549 rows × 3 columns
Suppose we want to get the unique values for multiple columns. The
unique()
method only works on a Series
, that
is, a single column. Instead, we can use the
drop_duplicates()
method on a copy of the dataframe with
the columns we’re interested in. Like any well-named method,
drop_duplicates()
does exactly what the name implies: It
returns a copy of the dataframe with all duplicate rows removed.
OUTPUT
plot_id | species_id | |
---|---|---|
0 | 2 | NL |
1 | 3 | NL |
2 | 2 | DM |
3 | 7 | DM |
4 | 3 | DM |
… | … | … |
35474 | 22 | SF |
35511 | 11 | US |
35519 | 9 | SF |
35527 | 13 | US |
35543 | 15 | US |
612 rows × 2 columns
Getting one or more rows
Pandas provides a variety of ways to view rows within a dataframe. We
can get the rows at the beginning of the dataframe using
head()
:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
By default, this methods returns the first five rows. We can provide a number inside the parentheses if we want to view a different number of rows:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
5 | 6 | 7 | 16 | 1977 | 1 | PF | M | 14.0 | NaN |
6 | 7 | 7 | 16 | 1977 | 2 | PE | F | NaN | NaN |
7 | 8 | 7 | 16 | 1977 | 1 | DM | M | 37.0 | NaN |
8 | 9 | 7 | 16 | 1977 | 1 | DM | F | 34.0 | NaN |
9 | 10 | 7 | 16 | 1977 | 6 | PF | F | 20.0 | NaN |
The tail()
method is similar, except it returns rows
from the end of the table:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
Or we can use sample()
to return a random row from
anywhere in the dataframe:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
25936 | 25937 | 5 | 11 | 1997 | 15 | OT | F | 20.0 | 25.0 |
If you’re following along, you may notice that the output of this
method on your screen differs from what’s shown here. That’s exactly
what we’d expect to see. Remember, sample()
is returnning a
random row—it would be more surprising if the outputs were the same!
Slicing the dataframe
The head()
, tail()
, and
sample()
methods are useful for getting a feel for how our
data is structured, but we may also want to look at specific rows. One
way to do so is to extract rows based on where they appear in the
dataframe. We can use square brackets to extract these slices.
A slice is a subset of the dataframe starting at one row and ending at
another. To get a slice, we pass the starting and ending indexes to the
square brackets as start:end
:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
There are three things to be aware of when slicing a dataframe:
- Row indexes are zero-based. The first row has an index of
0, not
- When slicing, the slice includes start but not the end index. In this case, that means the slice includes rows 2, 3, and 4 but not 5.
- The row label can be different from the row index. They happen to be the same here, but don’t count on that being true.
Core Python types like list
and tuple
use
the same conventions, as do most Python pacakges that work with
sequences.
Filtering data
It is often more useful to subset a dataframe based on the data itself. Pandas provides a variety of ways to filter a dataframe in this way. For example, suppose we want to look at a specific species in the surveys dataframe. We can view the rows matching a given species using the same square brackets we used above to select specific columns and rows. Here, however, instead of using a value or list of values, we will use a conditional expression.
A conditional expression is a statement that evaluates as either True or False. They often make use of inequality operators, for example:
-
==
for equals -
!=
for does not equal -
>
for greater than -
>=
for greater than or equal to -
<
for less than -
<=
for less than or equal to
Examples of conditional statements include:
-
"a" == "b"
evaluates False -
"a" != b"
evaluates True -
3 > 4
evaluates False
Note that, when comparing strings, evaluations are case sensitive:
-
"a" == "A"
evaluates False
= for assignment, == for equality
Remember that, in Python, a single equal sign is used to assign values to variables. We’ve already used the assignment operator in this lesson, for example, when we created a new column in the dataframe.
To limit the dataframe to rows matching the species “DM”, we will
again use square brackets. This time, instead of passing a string or a
number, we will include the conditional
surveys["species_id"] == "DM"
inside the square
brackets:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
7 | 8 | 7 | 16 | 1977 | 1 | DM | M | 37.0 | NaN |
8 | 9 | 7 | 16 | 1977 | 1 | DM | F | 34.0 | NaN |
… | … | … | … | … | … | … | … | … | … |
35532 | 35533 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 48.0 |
35533 | 35534 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 56.0 |
35534 | 35535 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 53.0 |
35535 | 35536 | 12 | 31 | 2002 | 14 | DM | F | 35.0 | 42.0 |
35536 | 35537 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 46.0 |
10596 rows × 9 columns
Other comparisons can be used in the same way. To limit our results to observations made in or after 2000, use:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
30158 | 30159 | 1 | 8 | 2000 | 1 | PP | F | 22.0 | 17.0 |
30159 | 30160 | 1 | 8 | 2000 | 1 | DO | M | 35.0 | 53.0 |
30160 | 30161 | 1 | 8 | 2000 | 1 | PP | F | 21.0 | 17.0 |
30161 | 30162 | 1 | 8 | 2000 | 1 | DM | M | 36.0 | 50.0 |
30162 | 30163 | 1 | 8 | 2000 | 1 | PP | M | 20.0 | 16.0 |
… | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
5391 rows × 9 columns
As when we selected columns above, each filtering operation returns a copy of the dataframe.
Using complex filters
When analyzing data, we will often need to filter on multiple columns at one time. In pandas, we can combine conditionals using bitwise operators. These work like the terms AND and OR in many search interfaces:
-
&
: True if conditions on both sides of the operator are True (and) -
|
: True if a condition on either side is True (or)
To return all observations of DM in or after 2000, we can combine the
two conditionals we used previously into a single operation. Note that,
when joining conditionals using &
or |
, we
must wrap each individual condition in parentheses. If we omit the
parentheses, pandas will not perform the comparisons in the expected
order.
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
30161 | 30162 | 1 | 8 | 2000 | 1 | DM | M | 36.0 | 50.0 |
30178 | 30179 | 1 | 8 | 2000 | 12 | DM | M | 36.0 | 60.0 |
30195 | 30196 | 1 | 8 | 2000 | 17 | DM | M | 37.0 | 52.0 |
30196 | 30197 | 1 | 8 | 2000 | 17 | DM | F | 34.0 | 43.0 |
30209 | 30210 | 1 | 8 | 2000 | 22 | DM | M | 38.0 | 56.0 |
… | … | … | … | … | … | … | … | … | … |
35532 | 35533 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 48.0 |
35533 | 35534 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 56.0 |
35534 | 35535 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 53.0 |
35535 | 35536 | 12 | 31 | 2002 | 14 | DM | F | 35.0 | 42.0 |
35536 | 35537 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 46.0 |
847 rows × 9 columns
We can also use methods to filter the dataframe. For example,
isin()
can be used to match a list of values. Methods can
be combined with other conditionals as above. The example below returns
rows from 2000 or later with either “DM”, “DO”, or “DS” in the
species_id column:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
30159 | 30160 | 1 | 8 | 2000 | 1 | DO | M | 35.0 | 53.0 |
30161 | 30162 | 1 | 8 | 2000 | 1 | DM | M | 36.0 | 50.0 |
30166 | 30167 | 1 | 8 | 2000 | 1 | DO | M | 35.0 | 41.0 |
30170 | 30171 | 1 | 8 | 2000 | 2 | DO | M | 36.0 | 52.0 |
30172 | 30173 | 1 | 8 | 2000 | 2 | DO | F | 35.0 | 54.0 |
… | … | … | … | … | … | … | … | … | … |
35533 | 35534 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 56.0 |
35534 | 35535 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 53.0 |
35535 | 35536 | 12 | 31 | 2002 | 14 | DM | F | 35.0 | 42.0 |
35536 | 35537 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 46.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
1268 rows × 9 columns
Sorting data
We can sort a dataframe using the sort_values()
method.
To sort by weight, we’ll pass the name of that column to the
sort_values()
:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
28125 | 28126 | 6 | 28 | 1998 | 15 | PF | M | NaN | 4.0 |
217 | 218 | 9 | 13 | 1977 | 1 | PF | M | 13.0 | 4.0 |
9822 | 9823 | 1 | 19 | 1985 | 23 | RM | M | 16.0 | 4.0 |
9852 | 9853 | 1 | 19 | 1985 | 17 | RM | M | 16.0 | 4.0 |
9936 | 9937 | 2 | 16 | 1985 | 21 | RM | M | 16.0 | 4.0 |
… | … | … | … | … | … | … | … | … | … |
35530 | 35531 | 12 | 31 | 2002 | 13 | PB | F | 27.0 | NaN |
35543 | 35544 | 12 | 31 | 2002 | 15 | US | NaN | NaN | NaN |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
35549 rows × 9 columns
By default, rows are sorted in ascending order (smallest to largest). We can reorder them from largest to smallest using the ascending keyword argument:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
33048 | 33049 | 11 | 17 | 2001 | 12 | NL | M | 33.0 | 280.0 |
12870 | 12871 | 5 | 28 | 1987 | 2 | NL | M | 32.0 | 278.0 |
15458 | 15459 | 1 | 11 | 1989 | 9 | NL | M | 36.0 | 275.0 |
2132 | 2133 | 10 | 25 | 1979 | 2 | NL | F | 33.0 | 274.0 |
12728 | 12729 | 4 | 26 | 1987 | 2 | NL | M | 32.0 | 270.0 |
… | … | … | … | … | … | … | … | … | … |
35530 | 35531 | 12 | 31 | 2002 | 13 | PB | F | 27.0 | NaN |
35543 | 35544 | 12 | 31 | 2002 | 15 | US | NaN | NaN | NaN |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
35549 rows × 9 columns
We can sort on multiple fields at once by passing a list of column names. We can control how each column sorts by passing a list with the same number of values (that is, one value per column) to the ascending keyword. The cell below sorts the results first by species_id (largest to smallest), then by weight (smallest to largest):
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
14249 | 14250 | 3 | 20 | 1988 | 18 | ZL | NaN | NaN | NaN |
14350 | 14351 | 4 | 17 | 1988 | 23 | ZL | NaN | NaN | NaN |
35511 | 35512 | 12 | 31 | 2002 | 11 | US | NaN | NaN | NaN |
35512 | 35513 | 12 | 31 | 2002 | 11 | US | NaN | NaN | NaN |
35527 | 35528 | 12 | 31 | 2002 | 13 | US | NaN | NaN | NaN |
… | … | … | … | … | … | … | … | … | … |
34756 | 34757 | 9 | 10 | 2002 | 23 | NaN | NaN | NaN | NaN |
34969 | 34970 | 10 | 6 | 2002 | 10 | NaN | NaN | NaN | NaN |
35187 | 35188 | 11 | 10 | 2002 | 10 | NaN | NaN | NaN | NaN |
35384 | 35385 | 12 | 8 | 2002 | 10 | NaN | NaN | NaN | NaN |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
35549 rows × 9 columns
As with the dataframe methods above, sort_values()
returns a copy of the original dataframe and leaves the original
untouched.
Challenge
Write a query that returns year, species_id, and weight from the surveys table, sorted with the largest weights at the top.
PYTHON
# Create a subset containing only year, species_id, and weight
subset = surveys[["year", "species_id", "weight"]]
# Sort the subset by weight
subset.sort_values("weight", ascending=False)
OUTPUT
year | species_id | weight | |
---|---|---|---|
33048 | 2001 | NL | 280.0 |
12870 | 1987 | NL | 278.0 |
15458 | 1989 | NL | 275.0 |
2132 | 1979 | NL | 274.0 |
12728 | 1987 | NL | 270.0 |
… | … | … | … |
35530 | 2002 | PB | NaN |
35543 | 2002 | US | NaN |
35544 | 2002 | AH | NaN |
35545 | 2002 | AH | NaN |
35548 | 2002 | NaN | NaN |
35549 rows × 3 columns
Showing data on a plot
We will discuss data visualization using plotly in depth in lesson 6 but will introduce some fundamental concepts as we go along. Like pandas, plotly is an external package installed separately from Python itself. It can be used to create interactive, highly customizable plots based on pandas dataframes using just a few lines of code. For example, to create a scatter plot of the weight and hindfoot length, we need only to import plotly:
Once plotly is loaded, we can create a scatter plot using the
px.scatter()
method. We include the dataframe as the first
argument, then x and y keyword arguments to select the columns we want
to show on our scatter plot:
OUTPUT