Accessing and Filtering Data

Last updated on 2025-07-15 | Edit this page

Overview

Questions

How can we look at individual rows and columns in a dataframe?
How can we look at subsets of the dataset?

Objectives

Access individual rows and columns
Access multiple columns at once using a list
Filter the dataframe based on the data it contains
Sort the dataframe

In the previous lesson, we saw how to load a dataframe from a file. Now we’ll look at how to access the data within that dataframe. We’ll begin by importing pandas and reading our CSV, as we did in the previous lesson:

PYTHON

import pandas as pd

surveys = pd.read_csv("data/surveys.csv")
surveys

OUTPUT

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
0	1	7	16	1977	2	NL	M	32.0	NaN
1	2	7	16	1977	3	NL	M	33.0	NaN
2	3	7	16	1977	2	DM	F	37.0	NaN
3	4	7	16	1977	7	DM	M	36.0	NaN
4	5	7	16	1977	3	DM	M	35.0	NaN
…	…	…	…	…	…	…	…	…	…
35544	35545	12	31	2002	15	AH	NaN	NaN	NaN
35545	35546	12	31	2002	15	AH	NaN	NaN	NaN
35546	35547	12	31	2002	10	RM	F	15.0	14.0
35547	35548	12	31	2002	7	DO	M	36.0	51.0
35548	35549	12	31	2002	5	NaN	NaN	NaN	NaN

35549 rows × 9 columns

Getting a column

We can get a single column from the dataframe using square brackets. Square brackets are used in Python to access objects inside a container like a list, dict, or DataFrame. To get a column, we pass the name of the column inside a set brackets appended to the dataframe. For example, tp retrieve the year, use:

PYTHON

surveys["year"]

OUTPUT

0        1977
1        1977
2        1977
3        1977
4        1977
         ...
35544    2002
35545    2002
35546    2002
35547    2002
35548    2002
Name: year, Length: 35549, dtype: int64

A single column is returned as a Series, which is why the output for this cell is formatted differently than the other cells in this lesson. The main difference between a Series and a DataFrame is that a DataFrame can have multiple columns. The two classes share many but not all attributes and methods.

Note also that this is a copy of data from the original dataframe. Changing values in the series will have no effect on the original dataframe. Most operations on a DataFrame or Series return a copy of the original object.

Getting unique values

We can get the list of unique values within a column using the unique() method:

PYTHON

surveys["species_id"].unique()

OUTPUT

array(['NL', 'DM', 'PF', 'PE', 'DS', 'PP', 'SH', 'OT', 'DO', 'OX', 'SS',
       'OL', 'RM', nan, 'SA', 'PM', 'AH', 'DX', 'AB', 'CB', 'CM', 'CQ',
       'RF', 'PC', 'PG', 'PH', 'PU', 'CV', 'UR', 'UP', 'ZL', 'UL', 'CS',
       'SC', 'BA', 'SF', 'RO', 'AS', 'SO', 'PI', 'ST', 'CU', 'SU', 'RX',
       'PB', 'PL', 'PX', 'CT', 'US'], dtype=object)

Getting multiple columns

It is also possible to retrieve more than one column at a time. To do some, we’ll use the built-in list data type.

Python uses list to store sequences, that is, ordered lists of objects. Any type of Python object can be stored in a list: strings, integers, floats, even other lists or collections. Once created, a list can be modified by appending, inserting, or deleting items. We will not be going into detail about these operations here, but as always you can learn more about how to use a list using help() or the Python docs.

You can create a list using square brackets. Let’s create a list of the three columns in our dataframe that together give the date of an observation:

PYTHON

cols = ["year", "month", "day"]

When we pass this list to the survey dataframe using square brackets, as we did above, we get a copy of the dataframe containing just those columns. Note that, because we asked for more than one column, pandas returns a dataframe:

PYTHON

surveys[cols]

OUTPUT

	year	month	day
0	1977	7	16
1	1977	7	16
2	1977	7	16
3	1977	7	16
4	1977	7	16
…	…	…	…
35544	2002	12	31
35545	2002	12	31
35546	2002	12	31
35547	2002	12	31
35548	2002	12	31

35549 rows × 3 columns

Suppose we want to get the unique values for multiple columns. The unique() method only works on a Series, that is, a single column. Instead, we can use the drop_duplicates() method on a copy of the dataframe with the columns we’re interested in. Like any well-named method, drop_duplicates() does exactly what the name implies: It returns a copy of the dataframe with all duplicate rows removed.

PYTHON

surveys[["plot_id", "species_id"]].drop_duplicates()

OUTPUT

	plot_id	species_id
0	2	NL
1	3	NL
2	2	DM
3	7	DM
4	3	DM
…	…	…
35474	22	SF
35511	11	US
35519	9	SF
35527	13	US
35543	15	US

612 rows × 2 columns

Getting one or more rows

Pandas provides a variety of ways to view rows within a dataframe. We can get the rows at the beginning of the dataframe using head():

PYTHON

surveys.head()

OUTPUT

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
0	1	7	16	1977	2	NL	M	32.0	NaN
1	2	7	16	1977	3	NL	M	33.0	NaN
2	3	7	16	1977	2	DM	F	37.0	NaN
3	4	7	16	1977	7	DM	M	36.0	NaN
4	5	7	16	1977	3	DM	M	35.0	NaN

By default, this methods returns the first five rows. We can provide a number inside the parentheses if we want to view a different number of rows:

PYTHON

surveys.head(10)

OUTPUT

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
0	1	7	16	1977	2	NL	M	32.0	NaN
1	2	7	16	1977	3	NL	M	33.0	NaN
2	3	7	16	1977	2	DM	F	37.0	NaN
3	4	7	16	1977	7	DM	M	36.0	NaN
4	5	7	16	1977	3	DM	M	35.0	NaN
5	6	7	16	1977	1	PF	M	14.0	NaN
6	7	7	16	1977	2	PE	F	NaN	NaN
7	8	7	16	1977	1	DM	M	37.0	NaN
8	9	7	16	1977	1	DM	F	34.0	NaN
9	10	7	16	1977	6	PF	F	20.0	NaN

The tail() method is similar, except it returns rows from the end of the table:

PYTHON

surveys.tail()

OUTPUT

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
35544	35545	12	31	2002	15	AH	NaN	NaN	NaN
35545	35546	12	31	2002	15	AH	NaN	NaN	NaN
35546	35547	12	31	2002	10	RM	F	15.0	14.0
35547	35548	12	31	2002	7	DO	M	36.0	51.0
35548	35549	12	31	2002	5	NaN	NaN	NaN	NaN

Or we can use sample() to return a random row from anywhere in the dataframe:

PYTHON

surveys.sample()

OUTPUT

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
25936	25937	5	11	1997	15	OT	F	20.0	25.0

If you’re following along, you may notice that the output of this method on your screen differs from what’s shown here. That’s exactly what we’d expect to see. Remember, sample() is returnning a random row—it would be more surprising if the outputs were the same!

Slicing the dataframe

The head(), tail(), and sample() methods are useful for getting a feel for how our data is structured, but we may also want to look at specific rows. One way to do so is to extract rows based on where they appear in the dataframe. We can use square brackets to extract these slices. A slice is a subset of the dataframe starting at one row and ending at another. To get a slice, we pass the starting and ending indexes to the square brackets as start:end:

PYTHON

surveys[2:5]

OUTPUT

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
2	3	7	16	1977	2	DM	F	37.0	NaN
3	4	7	16	1977	7	DM	M	36.0	NaN
4	5	7	16	1977	3	DM	M	35.0	NaN

There are three things to be aware of when slicing a dataframe:

Row indexes are zero-based. The first row has an index of 0, not
When slicing, the slice includes start but not the end index. In this case, that means the slice includes rows 2, 3, and 4 but not 5.
The row label can be different from the row index. They happen to be the same here, but don’t count on that being true.

Core Python types like list and tuple use the same conventions, as do most Python pacakges that work with sequences.

Filtering data

It is often more useful to subset a dataframe based on the data itself. Pandas provides a variety of ways to filter a dataframe in this way. For example, suppose we want to look at a specific species in the surveys dataframe. We can view the rows matching a given species using the same square brackets we used above to select specific columns and rows. Here, however, instead of using a value or list of values, we will use a conditional expression.

A conditional expression is a statement that evaluates as either True or False. They often make use of inequality operators, for example:

== for equals
!= for does not equal
> for greater than
>= for greater than or equal to
< for less than
<= for less than or equal to

Examples of conditional statements include:

"a" == "b" evaluates False
"a" != b" evaluates True
3 > 4 evaluates False

Note that, when comparing strings, evaluations are case sensitive:

"a" == "A" evaluates False

= for assignment, == for equality

Remember that, in Python, a single equal sign is used to assign values to variables. We’ve already used the assignment operator in this lesson, for example, when we created a new column in the dataframe.

To limit the dataframe to rows matching the species “DM”, we will again use square brackets. This time, instead of passing a string or a number, we will include the conditional surveys["species_id"] == "DM" inside the square brackets:

PYTHON

surveys[surveys["species_id"] == "DM"]

OUTPUT

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
2	3	7	16	1977	2	DM	F	37.0	NaN
3	4	7	16	1977	7	DM	M	36.0	NaN
4	5	7	16	1977	3	DM	M	35.0	NaN
7	8	7	16	1977	1	DM	M	37.0	NaN
8	9	7	16	1977	1	DM	F	34.0	NaN
…	…	…	…	…	…	…	…	…	…
35532	35533	12	31	2002	14	DM	F	36.0	48.0
35533	35534	12	31	2002	14	DM	M	37.0	56.0
35534	35535	12	31	2002	14	DM	M	37.0	53.0
35535	35536	12	31	2002	14	DM	F	35.0	42.0
35536	35537	12	31	2002	14	DM	F	36.0	46.0

10596 rows × 9 columns

Other comparisons can be used in the same way. To limit our results to observations made in or after 2000, use:

PYTHON

surveys[surveys["year"] >= 2000]

OUTPUT

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
30158	30159	1	8	2000	1	PP	F	22.0	17.0
30159	30160	1	8	2000	1	DO	M	35.0	53.0
30160	30161	1	8	2000	1	PP	F	21.0	17.0
30161	30162	1	8	2000	1	DM	M	36.0	50.0
30162	30163	1	8	2000	1	PP	M	20.0	16.0
…	…	…	…	…	…	…	…	…	…
35544	35545	12	31	2002	15	AH	NaN	NaN	NaN
35545	35546	12	31	2002	15	AH	NaN	NaN	NaN
35546	35547	12	31	2002	10	RM	F	15.0	14.0
35547	35548	12	31	2002	7	DO	M	36.0	51.0
35548	35549	12	31	2002	5	NaN	NaN	NaN	NaN

5391 rows × 9 columns

As when we selected columns above, each filtering operation returns a copy of the dataframe.

Using complex filters

When analyzing data, we will often need to filter on multiple columns at one time. In pandas, we can combine conditionals using bitwise operators. These work like the terms AND and OR in many search interfaces:

&: True if conditions on both sides of the operator are True (and)
|: True if a condition on either side is True (or)

To return all observations of DM in or after 2000, we can combine the two conditionals we used previously into a single operation. Note that, when joining conditionals using & or |, we must wrap each individual condition in parentheses. If we omit the parentheses, pandas will not perform the comparisons in the expected order.

PYTHON

surveys[(surveys["species_id"] == "DM") & (surveys["year"] >= 2000)]

OUTPUT

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
30161	30162	1	8	2000	1	DM	M	36.0	50.0
30178	30179	1	8	2000	12	DM	M	36.0	60.0
30195	30196	1	8	2000	17	DM	M	37.0	52.0
30196	30197	1	8	2000	17	DM	F	34.0	43.0
30209	30210	1	8	2000	22	DM	M	38.0	56.0
…	…	…	…	…	…	…	…	…	…
35532	35533	12	31	2002	14	DM	F	36.0	48.0
35533	35534	12	31	2002	14	DM	M	37.0	56.0
35534	35535	12	31	2002	14	DM	M	37.0	53.0
35535	35536	12	31	2002	14	DM	F	35.0	42.0
35536	35537	12	31	2002	14	DM	F	36.0	46.0

847 rows × 9 columns

We can also use methods to filter the dataframe. For example, isin() can be used to match a list of values. Methods can be combined with other conditionals as above. The example below returns rows from 2000 or later with either “DM”, “DO”, or “DS” in the species_id column:

PYTHON

surveys[
    surveys["species_id"].isin(["DM", "DO", "DS"]) & (surveys["year"] >= 2000)
]

OUTPUT

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
30159	30160	1	8	2000	1	DO	M	35.0	53.0
30161	30162	1	8	2000	1	DM	M	36.0	50.0
30166	30167	1	8	2000	1	DO	M	35.0	41.0
30170	30171	1	8	2000	2	DO	M	36.0	52.0
30172	30173	1	8	2000	2	DO	F	35.0	54.0
…	…	…	…	…	…	…	…	…	…
35533	35534	12	31	2002	14	DM	M	37.0	56.0
35534	35535	12	31	2002	14	DM	M	37.0	53.0
35535	35536	12	31	2002	14	DM	F	35.0	42.0
35536	35537	12	31	2002	14	DM	F	36.0	46.0
35547	35548	12	31	2002	7	DO	M	36.0	51.0

1268 rows × 9 columns

Sorting data

We can sort a dataframe using the sort_values() method. To sort by weight, we’ll pass the name of that column to the sort_values():

PYTHON

surveys.sort_values("weight")

OUTPUT

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
28125	28126	6	28	1998	15	PF	M	NaN	4.0
217	218	9	13	1977	1	PF	M	13.0	4.0
9822	9823	1	19	1985	23	RM	M	16.0	4.0
9852	9853	1	19	1985	17	RM	M	16.0	4.0
9936	9937	2	16	1985	21	RM	M	16.0	4.0
…	…	…	…	…	…	…	…	…	…
35530	35531	12	31	2002	13	PB	F	27.0	NaN
35543	35544	12	31	2002	15	US	NaN	NaN	NaN
35544	35545	12	31	2002	15	AH	NaN	NaN	NaN
35545	35546	12	31	2002	15	AH	NaN	NaN	NaN
35548	35549	12	31	2002	5	NaN	NaN	NaN	NaN

35549 rows × 9 columns

By default, rows are sorted in ascending order (smallest to largest). We can reorder them from largest to smallest using the ascending keyword argument:

PYTHON

surveys.sort_values("weight", ascending=False)

OUTPUT

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
33048	33049	11	17	2001	12	NL	M	33.0	280.0
12870	12871	5	28	1987	2	NL	M	32.0	278.0
15458	15459	1	11	1989	9	NL	M	36.0	275.0
2132	2133	10	25	1979	2	NL	F	33.0	274.0
12728	12729	4	26	1987	2	NL	M	32.0	270.0
…	…	…	…	…	…	…	…	…	…
35530	35531	12	31	2002	13	PB	F	27.0	NaN
35543	35544	12	31	2002	15	US	NaN	NaN	NaN
35544	35545	12	31	2002	15	AH	NaN	NaN	NaN
35545	35546	12	31	2002	15	AH	NaN	NaN	NaN
35548	35549	12	31	2002	5	NaN	NaN	NaN	NaN

35549 rows × 9 columns

We can sort on multiple fields at once by passing a list of column names. We can control how each column sorts by passing a list with the same number of values (that is, one value per column) to the ascending keyword. The cell below sorts the results first by species_id (largest to smallest), then by weight (smallest to largest):

PYTHON

surveys.sort_values(["species_id", "weight"], ascending=[False, True])

OUTPUT

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
14249	14250	3	20	1988	18	ZL	NaN	NaN	NaN
14350	14351	4	17	1988	23	ZL	NaN	NaN	NaN
35511	35512	12	31	2002	11	US	NaN	NaN	NaN
35512	35513	12	31	2002	11	US	NaN	NaN	NaN
35527	35528	12	31	2002	13	US	NaN	NaN	NaN
…	…	…	…	…	…	…	…	…	…
34756	34757	9	10	2002	23	NaN	NaN	NaN	NaN
34969	34970	10	6	2002	10	NaN	NaN	NaN	NaN
35187	35188	11	10	2002	10	NaN	NaN	NaN	NaN
35384	35385	12	8	2002	10	NaN	NaN	NaN	NaN
35548	35549	12	31	2002	5	NaN	NaN	NaN	NaN

35549 rows × 9 columns

As with the dataframe methods above, sort_values() returns a copy of the original dataframe and leaves the original untouched.

Challenge

Write a query that returns year, species_id, and weight from the surveys table, sorted with the largest weights at the top.

Show me the solution

PYTHON

# Create a subset containing only year, species_id, and weight
subset = surveys[["year", "species_id", "weight"]]

# Sort the subset by weight
subset.sort_values("weight", ascending=False)

OUTPUT

	year	species_id	weight
33048	2001	NL	280.0
12870	1987	NL	278.0
15458	1989	NL	275.0
2132	1979	NL	274.0
12728	1987	NL	270.0
…	…	…	…
35530	2002	PB	NaN
35543	2002	US	NaN
35544	2002	AH	NaN
35545	2002	AH	NaN
35548	2002	NaN	NaN

35549 rows × 3 columns

Showing data on a plot

We will discuss data visualization using plotly in depth in lesson 6 but will introduce some fundamental concepts as we go along. Like pandas, plotly is an external package installed separately from Python itself. It can be used to create interactive, highly customizable plots based on pandas dataframes using just a few lines of code. For example, to create a scatter plot of the weight and hindfoot length, we need only to import plotly:

PYTHON

import plotly.express as px

Once plotly is loaded, we can create a scatter plot using the px.scatter() method. We include the dataframe as the first argument, then x and y keyword arguments to select the columns we want to show on our scatter plot:

PYTHON

px.scatter(surveys, x="weight", y="hindfoot_length")

OUTPUT

This simple plot is limited in what it can tell us about the observations in the dataset. We will return to this scatter plot in later lessons to see how we can improve it to better understand the survey data.

Key Points

Use square brackets to access rows, columns, and specific cells
Sort data and get unique values in a dataframe using methods provided by pandas
By default, most dataframe operations return a copy of the original data
Scatter plots can be used to visualize how two parameters in a dataset covary