Accessing and Filtering Data

Last updated on 2025-07-15 | Edit this page

Estimated time: 60 minutes

Overview

Questions

  • How can we look at individual rows and columns in a dataframe?
  • How can we look at subsets of the dataset?

Objectives

  • Access individual rows and columns
  • Access multiple columns at once using a list
  • Filter the dataframe based on the data it contains
  • Sort the dataframe

In the previous lesson, we saw how to load a dataframe from a file. Now we’ll look at how to access the data within that dataframe. We’ll begin by importing pandas and reading our CSV, as we did in the previous lesson:

PYTHON

import pandas as pd

surveys = pd.read_csv("data/surveys.csv")
surveys

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

35549 rows × 9 columns

Getting a column


We can get a single column from the dataframe using square brackets. Square brackets are used in Python to access objects inside a container like a list, dict, or DataFrame. To get a column, we pass the name of the column inside a set brackets appended to the dataframe. For example, tp retrieve the year, use:

PYTHON

surveys["year"]

OUTPUT

0        1977
1        1977
2        1977
3        1977
4        1977
         ...
35544    2002
35545    2002
35546    2002
35547    2002
35548    2002
Name: year, Length: 35549, dtype: int64

A single column is returned as a Series, which is why the output for this cell is formatted differently than the other cells in this lesson. The main difference between a Series and a DataFrame is that a DataFrame can have multiple columns. The two classes share many but not all attributes and methods.

Note also that this is a copy of data from the original dataframe. Changing values in the series will have no effect on the original dataframe. Most operations on a DataFrame or Series return a copy of the original object.

Getting unique values


We can get the list of unique values within a column using the unique() method:

PYTHON

surveys["species_id"].unique()

OUTPUT

array(['NL', 'DM', 'PF', 'PE', 'DS', 'PP', 'SH', 'OT', 'DO', 'OX', 'SS',
       'OL', 'RM', nan, 'SA', 'PM', 'AH', 'DX', 'AB', 'CB', 'CM', 'CQ',
       'RF', 'PC', 'PG', 'PH', 'PU', 'CV', 'UR', 'UP', 'ZL', 'UL', 'CS',
       'SC', 'BA', 'SF', 'RO', 'AS', 'SO', 'PI', 'ST', 'CU', 'SU', 'RX',
       'PB', 'PL', 'PX', 'CT', 'US'], dtype=object)

Getting multiple columns


It is also possible to retrieve more than one column at a time. To do some, we’ll use the built-in list data type.

Python uses list to store sequences, that is, ordered lists of objects. Any type of Python object can be stored in a list: strings, integers, floats, even other lists or collections. Once created, a list can be modified by appending, inserting, or deleting items. We will not be going into detail about these operations here, but as always you can learn more about how to use a list using help() or the Python docs.

You can create a list using square brackets. Let’s create a list of the three columns in our dataframe that together give the date of an observation:

PYTHON

cols = ["year", "month", "day"]

When we pass this list to the survey dataframe using square brackets, as we did above, we get a copy of the dataframe containing just those columns. Note that, because we asked for more than one column, pandas returns a dataframe:

PYTHON

surveys[cols]

OUTPUT

year month day
0 1977 7 16
1 1977 7 16
2 1977 7 16
3 1977 7 16
4 1977 7 16
35544 2002 12 31
35545 2002 12 31
35546 2002 12 31
35547 2002 12 31
35548 2002 12 31

35549 rows × 3 columns

Suppose we want to get the unique values for multiple columns. The unique() method only works on a Series, that is, a single column. Instead, we can use the drop_duplicates() method on a copy of the dataframe with the columns we’re interested in. Like any well-named method, drop_duplicates() does exactly what the name implies: It returns a copy of the dataframe with all duplicate rows removed.

PYTHON

surveys[["plot_id", "species_id"]].drop_duplicates()

OUTPUT

plot_id species_id
0 2 NL
1 3 NL
2 2 DM
3 7 DM
4 3 DM
35474 22 SF
35511 11 US
35519 9 SF
35527 13 US
35543 15 US

612 rows × 2 columns

Getting one or more rows


Pandas provides a variety of ways to view rows within a dataframe. We can get the rows at the beginning of the dataframe using head():

PYTHON

surveys.head()

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN

By default, this methods returns the first five rows. We can provide a number inside the parentheses if we want to view a different number of rows:

PYTHON

surveys.head(10)

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
5 6 7 16 1977 1 PF M 14.0 NaN
6 7 7 16 1977 2 PE F NaN NaN
7 8 7 16 1977 1 DM M 37.0 NaN
8 9 7 16 1977 1 DM F 34.0 NaN
9 10 7 16 1977 6 PF F 20.0 NaN

The tail() method is similar, except it returns rows from the end of the table:

PYTHON

surveys.tail()

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

Or we can use sample() to return a random row from anywhere in the dataframe:

PYTHON

surveys.sample()

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
25936 25937 5 11 1997 15 OT F 20.0 25.0

If you’re following along, you may notice that the output of this method on your screen differs from what’s shown here. That’s exactly what we’d expect to see. Remember, sample() is returnning a random row—it would be more surprising if the outputs were the same!

Slicing the dataframe


The head(), tail(), and sample() methods are useful for getting a feel for how our data is structured, but we may also want to look at specific rows. One way to do so is to extract rows based on where they appear in the dataframe. We can use square brackets to extract these slices. A slice is a subset of the dataframe starting at one row and ending at another. To get a slice, we pass the starting and ending indexes to the square brackets as start:end:

PYTHON

surveys[2:5]

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN

There are three things to be aware of when slicing a dataframe:

  • Row indexes are zero-based. The first row has an index of 0, not
  • When slicing, the slice includes start but not the end index. In this case, that means the slice includes rows 2, 3, and 4 but not 5.
  • The row label can be different from the row index. They happen to be the same here, but don’t count on that being true.

Core Python types like list and tuple use the same conventions, as do most Python pacakges that work with sequences.

Filtering data


It is often more useful to subset a dataframe based on the data itself. Pandas provides a variety of ways to filter a dataframe in this way. For example, suppose we want to look at a specific species in the surveys dataframe. We can view the rows matching a given species using the same square brackets we used above to select specific columns and rows. Here, however, instead of using a value or list of values, we will use a conditional expression.

A conditional expression is a statement that evaluates as either True or False. They often make use of inequality operators, for example:

  • == for equals
  • != for does not equal
  • > for greater than
  • >= for greater than or equal to
  • < for less than
  • <= for less than or equal to

Examples of conditional statements include:

  • "a" == "b" evaluates False
  • "a" != b" evaluates True
  • 3 > 4 evaluates False

Note that, when comparing strings, evaluations are case sensitive:

  • "a" == "A" evaluates False

= for assignment, == for equality

Remember that, in Python, a single equal sign is used to assign values to variables. We’ve already used the assignment operator in this lesson, for example, when we created a new column in the dataframe.

To limit the dataframe to rows matching the species “DM”, we will again use square brackets. This time, instead of passing a string or a number, we will include the conditional surveys["species_id"] == "DM" inside the square brackets:

PYTHON

surveys[surveys["species_id"] == "DM"]

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
7 8 7 16 1977 1 DM M 37.0 NaN
8 9 7 16 1977 1 DM F 34.0 NaN
35532 35533 12 31 2002 14 DM F 36.0 48.0
35533 35534 12 31 2002 14 DM M 37.0 56.0
35534 35535 12 31 2002 14 DM M 37.0 53.0
35535 35536 12 31 2002 14 DM F 35.0 42.0
35536 35537 12 31 2002 14 DM F 36.0 46.0

10596 rows × 9 columns

Other comparisons can be used in the same way. To limit our results to observations made in or after 2000, use:

PYTHON

surveys[surveys["year"] >= 2000]

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
30158 30159 1 8 2000 1 PP F 22.0 17.0
30159 30160 1 8 2000 1 DO M 35.0 53.0
30160 30161 1 8 2000 1 PP F 21.0 17.0
30161 30162 1 8 2000 1 DM M 36.0 50.0
30162 30163 1 8 2000 1 PP M 20.0 16.0
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

5391 rows × 9 columns

As when we selected columns above, each filtering operation returns a copy of the dataframe.

Using complex filters


When analyzing data, we will often need to filter on multiple columns at one time. In pandas, we can combine conditionals using bitwise operators. These work like the terms AND and OR in many search interfaces:

  • &: True if conditions on both sides of the operator are True (and)
  • |: True if a condition on either side is True (or)

To return all observations of DM in or after 2000, we can combine the two conditionals we used previously into a single operation. Note that, when joining conditionals using & or |, we must wrap each individual condition in parentheses. If we omit the parentheses, pandas will not perform the comparisons in the expected order.

PYTHON

surveys[(surveys["species_id"] == "DM") & (surveys["year"] >= 2000)]

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
30161 30162 1 8 2000 1 DM M 36.0 50.0
30178 30179 1 8 2000 12 DM M 36.0 60.0
30195 30196 1 8 2000 17 DM M 37.0 52.0
30196 30197 1 8 2000 17 DM F 34.0 43.0
30209 30210 1 8 2000 22 DM M 38.0 56.0
35532 35533 12 31 2002 14 DM F 36.0 48.0
35533 35534 12 31 2002 14 DM M 37.0 56.0
35534 35535 12 31 2002 14 DM M 37.0 53.0
35535 35536 12 31 2002 14 DM F 35.0 42.0
35536 35537 12 31 2002 14 DM F 36.0 46.0

847 rows × 9 columns

We can also use methods to filter the dataframe. For example, isin() can be used to match a list of values. Methods can be combined with other conditionals as above. The example below returns rows from 2000 or later with either “DM”, “DO”, or “DS” in the species_id column:

PYTHON

surveys[
    surveys["species_id"].isin(["DM", "DO", "DS"]) & (surveys["year"] >= 2000)
]

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
30159 30160 1 8 2000 1 DO M 35.0 53.0
30161 30162 1 8 2000 1 DM M 36.0 50.0
30166 30167 1 8 2000 1 DO M 35.0 41.0
30170 30171 1 8 2000 2 DO M 36.0 52.0
30172 30173 1 8 2000 2 DO F 35.0 54.0
35533 35534 12 31 2002 14 DM M 37.0 56.0
35534 35535 12 31 2002 14 DM M 37.0 53.0
35535 35536 12 31 2002 14 DM F 35.0 42.0
35536 35537 12 31 2002 14 DM F 36.0 46.0
35547 35548 12 31 2002 7 DO M 36.0 51.0

1268 rows × 9 columns

Sorting data


We can sort a dataframe using the sort_values() method. To sort by weight, we’ll pass the name of that column to the sort_values():

PYTHON

surveys.sort_values("weight")

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
28125 28126 6 28 1998 15 PF M NaN 4.0
217 218 9 13 1977 1 PF M 13.0 4.0
9822 9823 1 19 1985 23 RM M 16.0 4.0
9852 9853 1 19 1985 17 RM M 16.0 4.0
9936 9937 2 16 1985 21 RM M 16.0 4.0
35530 35531 12 31 2002 13 PB F 27.0 NaN
35543 35544 12 31 2002 15 US NaN NaN NaN
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35548 35549 12 31 2002 5 NaN NaN NaN NaN

35549 rows × 9 columns

By default, rows are sorted in ascending order (smallest to largest). We can reorder them from largest to smallest using the ascending keyword argument:

PYTHON

surveys.sort_values("weight", ascending=False)

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
33048 33049 11 17 2001 12 NL M 33.0 280.0
12870 12871 5 28 1987 2 NL M 32.0 278.0
15458 15459 1 11 1989 9 NL M 36.0 275.0
2132 2133 10 25 1979 2 NL F 33.0 274.0
12728 12729 4 26 1987 2 NL M 32.0 270.0
35530 35531 12 31 2002 13 PB F 27.0 NaN
35543 35544 12 31 2002 15 US NaN NaN NaN
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35548 35549 12 31 2002 5 NaN NaN NaN NaN

35549 rows × 9 columns

We can sort on multiple fields at once by passing a list of column names. We can control how each column sorts by passing a list with the same number of values (that is, one value per column) to the ascending keyword. The cell below sorts the results first by species_id (largest to smallest), then by weight (smallest to largest):

PYTHON

surveys.sort_values(["species_id", "weight"], ascending=[False, True])

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
14249 14250 3 20 1988 18 ZL NaN NaN NaN
14350 14351 4 17 1988 23 ZL NaN NaN NaN
35511 35512 12 31 2002 11 US NaN NaN NaN
35512 35513 12 31 2002 11 US NaN NaN NaN
35527 35528 12 31 2002 13 US NaN NaN NaN
34756 34757 9 10 2002 23 NaN NaN NaN NaN
34969 34970 10 6 2002 10 NaN NaN NaN NaN
35187 35188 11 10 2002 10 NaN NaN NaN NaN
35384 35385 12 8 2002 10 NaN NaN NaN NaN
35548 35549 12 31 2002 5 NaN NaN NaN NaN

35549 rows × 9 columns

As with the dataframe methods above, sort_values() returns a copy of the original dataframe and leaves the original untouched.

Challenge

Write a query that returns year, species_id, and weight from the surveys table, sorted with the largest weights at the top.

PYTHON

# Create a subset containing only year, species_id, and weight
subset = surveys[["year", "species_id", "weight"]]

# Sort the subset by weight
subset.sort_values("weight", ascending=False)

OUTPUT

year species_id weight
33048 2001 NL 280.0
12870 1987 NL 278.0
15458 1989 NL 275.0
2132 1979 NL 274.0
12728 1987 NL 270.0
35530 2002 PB NaN
35543 2002 US NaN
35544 2002 AH NaN
35545 2002 AH NaN
35548 2002 NaN NaN

35549 rows × 3 columns

Showing data on a plot


We will discuss data visualization using plotly in depth in lesson 6 but will introduce some fundamental concepts as we go along. Like pandas, plotly is an external package installed separately from Python itself. It can be used to create interactive, highly customizable plots based on pandas dataframes using just a few lines of code. For example, to create a scatter plot of the weight and hindfoot length, we need only to import plotly:

PYTHON

import plotly.express as px

Once plotly is loaded, we can create a scatter plot using the px.scatter() method. We include the dataframe as the first argument, then x and y keyword arguments to select the columns we want to show on our scatter plot:

PYTHON

px.scatter(surveys, x="weight", y="hindfoot_length")

OUTPUT

This simple plot is limited in what it can tell us about the observations in the dataset. We will return to this scatter plot in later lessons to see how we can improve it to better understand the survey data.

Key Points

  • Use square brackets to access rows, columns, and specific cells
  • Sort data and get unique values in a dataframe using methods provided by pandas
  • By default, most dataframe operations return a copy of the original data
  • Scatter plots can be used to visualize how two parameters in a dataset covary