Accessing Data in a Dataframe
Last updated on 2022-07-18 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- How can we look at individual rows and columns in a dataframe?
- How can we perform calculations?
- How can we modify the table and data?
Objectives
- Access individual rows and columns
- Access multiple columns at once using a list
- Perform calculations like addition and subtraction
- Rename columns using a dictionary
- Access rows containing specific data
- Sort the data returned by a query
- Modify data using loc
We’ll begin by importing pandas and reading our CSV, as we did in the previous lesson:
PYTHON
import pandas as pd
= pd.read_csv("data/surveys.csv")
surveys surveys
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
… | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
35549 rows × 9 columns
We will now look at how to access rows and columns in the dataframe.
Getting columns
We can get the values from a single column by passing a string inside square brackets to the dataframe object. For example, to look at the year column, use:
PYTHON
"year"] surveys[
OUTPUT
0 1977
1 1977
2 1977
3 1977
4 1977
...
35544 2002
35545 2002
35546 2002
35547 2002
35548 2002
Name: year, Length: 35549, dtype: int64
A single column is returned as a Series
, which is why the output for this cell is formatted differently than the other cells in this lesson. The main difference between a Series
and a DataFrame
is that a DataFrame
can have multiple columns. The two classes share many but not all attributes and methods. Note also that this is a copy of data from the original dataframe. Changing values in the series will have no effect on the original dataframe.
Using lists to get more than one column at a time
Python uses the built-in list
data type to store sequences, that is, an ordered list of objects. Any type of Python object can be stored in a list: strings, integers, floats, even other lists or collections. Once created, a list can be modified by appending, inserting, or deleting items. We will not be going into detail about these operations here, but as always you can learn more about how to use a list using help()
or the Python docs.
Create a list using square brackets. Let’s create a list of the three columns in our dataframe that together give the date of an observation:
PYTHON
= ["year", "month", "day"] cols
When we pass this list to the survey dataframe using square brackets, as we did above, we retrieve a copy of the dataframe containing just those columns. Note that, when we get more than one column, pandas returns a dataframe:
PYTHON
surveys[cols]
year | month | day | |
---|---|---|---|
0 | 1977 | 7 | 16 |
1 | 1977 | 7 | 16 |
2 | 1977 | 7 | 16 |
3 | 1977 | 7 | 16 |
4 | 1977 | 7 | 16 |
… | … | … | … |
35544 | 2002 | 12 | 31 |
35545 | 2002 | 12 | 31 |
35546 | 2002 | 12 | 31 |
35547 | 2002 | 12 | 31 |
35548 | 2002 | 12 | 31 |
35549 rows × 3 columns
Getting rows
We can get the rows at the beginning of the table using the head method:
PYTHON
surveys.head()
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
By default, this methods returns the first five rows. We can provide a number inside the parentheses if we need a specific number of rows:
PYTHON
20) surveys.head(
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
5 | 6 | 7 | 16 | 1977 | 1 | PF | M | 14.0 | NaN |
6 | 7 | 7 | 16 | 1977 | 2 | PE | F | NaN | NaN |
7 | 8 | 7 | 16 | 1977 | 1 | DM | M | 37.0 | NaN |
8 | 9 | 7 | 16 | 1977 | 1 | DM | F | 34.0 | NaN |
9 | 10 | 7 | 16 | 1977 | 6 | PF | F | 20.0 | NaN |
10 | 11 | 7 | 16 | 1977 | 5 | DS | F | 53.0 | NaN |
11 | 12 | 7 | 16 | 1977 | 7 | DM | M | 38.0 | NaN |
12 | 13 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
13 | 14 | 7 | 16 | 1977 | 8 | DM | NaN | NaN | NaN |
14 | 15 | 7 | 16 | 1977 | 6 | DM | F | 36.0 | NaN |
15 | 16 | 7 | 16 | 1977 | 4 | DM | F | 36.0 | NaN |
16 | 17 | 7 | 16 | 1977 | 3 | DS | F | 48.0 | NaN |
17 | 18 | 7 | 16 | 1977 | 2 | PP | M | 22.0 | NaN |
18 | 19 | 7 | 16 | 1977 | 4 | PF | NaN | NaN | NaN |
19 | 20 | 7 | 17 | 1977 | 11 | DS | F | 48.0 | NaN |
The tail()
method is similar, except it returns rows from the end of the table:
PYTHON
surveys.tail()
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
The head()
and tail()
methods are useful for getting a feel for how our data is structured, but we’ll also want to be able to look at specific rows. As when we selected columns above, we can use square brackets to extract slices from the dataframe. A slice is a subset of the dataframe starting at one row and ending at another. To get a slice, we pass the starting and ending indexes to the square brackets as start:end
:
PYTHON
2:5] surveys[
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
There are three things to be aware of when slicing a dataframe:
- Row indexes are zero-based. That is, the first row has an index of 0, not 1.
- When slicing, the slice includes start but not the end index. In this case, that means the slice includes rows 2, 3, and 4 but not 5.
- The row label can be different from the row index. They happen to be the same here, but don’t count on that being true.
Core Python types like list
and tuple
use the same conventions, as do most libraries that deal with sequences.
Getting unique values
Recall that we can use square brackets to return a single column from a dataframe. We can get the list of unique values within that column using the unique()
method:
PYTHON
"species_id"].unique() surveys[
OUTPUT
array(['NL', 'DM', 'PF', 'PE', 'DS', 'PP', 'SH', 'OT', 'DO', 'OX', 'SS',
'OL', 'RM', nan, 'SA', 'PM', 'AH', 'DX', 'AB', 'CB', 'CM', 'CQ',
'RF', 'PC', 'PG', 'PH', 'PU', 'CV', 'UR', 'UP', 'ZL', 'UL', 'CS',
'SC', 'BA', 'SF', 'RO', 'AS', 'SO', 'PI', 'ST', 'CU', 'SU', 'RX',
'PB', 'PL', 'PX', 'CT', 'US'], dtype=object)
To do the same across multiple columns, we can use the drop_duplicates()
method on a copy of the dataframe containing only the columns we’re interested in. Like any well-named method, drop_duplicates()
does exactly what the name implies: It returns a copy of the dataframe with all duplicate rows removed.
PYTHON
"plot_id", "species_id"]].drop_duplicates() surveys[[
plot_id | species_id | |
---|---|---|
0 | 2 | NL |
1 | 3 | NL |
2 | 2 | DM |
3 | 7 | DM |
4 | 3 | DM |
… | … | … |
35474 | 22 | SF |
35511 | 11 | US |
35519 | 9 | SF |
35527 | 13 | US |
35543 | 15 | US |
612 rows × 2 columns
Calculating values
The survey dataset includes two columns, hindfoot_length and weight, that are stored as numbers and represent measurements. We may want to perform calculations using numbers like these in our own data. We can do so using Python’s built-in mathematical operators, including:
-
x + y
for addition -
x - y
for subtraction -
x * y
for multiplication -
x / y
for division -
x % y
for calculating remainders -
x ** y
for exponents
To make the examples in this section a little more useful, we’re going to remove all rows that contain null values using the dropna()
method. This will filter out any rows that don’t have a valid hindfoot_length or weight, as well as those that have a null value in any other cell. (This is an inelegant solution to the problem of missing data. We’ll talk about more nuanced solutions later in the lesson.)
PYTHON
= surveys.dropna().copy()
surveys_nona surveys_nona
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
62 | 63 | 8 | 19 | 1977 | 3 | DM | M | 35.0 | 40.0 |
63 | 64 | 8 | 19 | 1977 | 7 | DM | M | 37.0 | 48.0 |
64 | 65 | 8 | 19 | 1977 | 4 | DM | F | 34.0 | 29.0 |
65 | 66 | 8 | 19 | 1977 | 4 | DM | F | 35.0 | 46.0 |
66 | 67 | 8 | 19 | 1977 | 7 | DM | M | 35.0 | 36.0 |
… | … | … | … | … | … | … | … | … | … |
35540 | 35541 | 12 | 31 | 2002 | 15 | PB | F | 24.0 | 31.0 |
35541 | 35542 | 12 | 31 | 2002 | 15 | PB | F | 26.0 | 29.0 |
35542 | 35543 | 12 | 31 | 2002 | 15 | PB | F | 27.0 | 34.0 |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
30676 rows × 9 columns
Suppose we want to convert the weight column from grams to milligrams. To do so, we can multiply that column by 1000:
PYTHON
"weight"] * 1000 surveys_nona[
OUTPUT
62 40000.0
63 48000.0
64 29000.0
65 46000.0
66 36000.0
...
35540 31000.0
35541 29000.0
35542 34000.0
35546 14000.0
35547 51000.0
Name: weight, Length: 30676, dtype: float64
To convert it to kilograms, we can divide by 1000:
PYTHON
"weight"] / 1000 surveys_nona[
OUTPUT
62 0.040
63 0.048
64 0.029
65 0.046
66 0.036
...
35540 0.031
35541 0.029
35542 0.034
35546 0.014
35547 0.051
Name: weight, Length: 30676, dtype: float64
Note that calculations do not modify the original dataset. If we want to retain the result, we have to assign it to a new column:
PYTHON
"weight_mg"] = surveys_nona["weight"] * 1000
surveys_nona[ surveys_nona
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | weight_mg | |
---|---|---|---|---|---|---|---|---|---|---|
62 | 63 | 8 | 19 | 1977 | 3 | DM | M | 35.0 | 40.0 | 40000.0 |
63 | 64 | 8 | 19 | 1977 | 7 | DM | M | 37.0 | 48.0 | 48000.0 |
64 | 65 | 8 | 19 | 1977 | 4 | DM | F | 34.0 | 29.0 | 29000.0 |
65 | 66 | 8 | 19 | 1977 | 4 | DM | F | 35.0 | 46.0 | 46000.0 |
66 | 67 | 8 | 19 | 1977 | 7 | DM | M | 35.0 | 36.0 | 36000.0 |
… | … | … | … | … | … | … | … | … | … | … |
35540 | 35541 | 12 | 31 | 2002 | 15 | PB | F | 24.0 | 31.0 | 31000.0 |
35541 | 35542 | 12 | 31 | 2002 | 15 | PB | F | 26.0 | 29.0 | 29000.0 |
35542 | 35543 | 12 | 31 | 2002 | 15 | PB | F | 27.0 | 34.0 | 34000.0 |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 | 14000.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 | 51000.0 |
30676 rows × 10 columns
We can also add, subtract, multiply, and divide columns, as in the following (admittedly nonsensical) calculation, which adds together the hindfoot_length and weight columns:
PYTHON
"hindfoot_length"] + surveys_nona["weight"] surveys_nona[
OUTPUT
62 75.0
63 85.0
64 63.0
65 81.0
66 71.0
...
35540 55.0
35541 55.0
35542 61.0
35546 29.0
35547 87.0
Length: 30676, dtype: float64
Renaming columns
The hindfoot_length and weight columns don’t specify a unit, which may get confusing if we want to perform unit conversions like the one above. Fortunately, dataframes allow us to rename existing columns using the rename()
method.
The rename()
method uses a dictionary (or dict
) to map between the old and new column names. As with list
above, the dict
data type is built into Python–we don’t need to import anything to use it. A dict
maps keys to values. We can create one using curly braces:
PYTHON
= {"key1": "val1", "key2": "val2"}
dct dct
OUTPUT
{'key1': 'val1', 'key2': 'val2'}
Dictionaries are a useful and highly flexible data type. As with list
above, we’ll be giving them short shrift here, but you can learn more about them at the Python docs.
Here we’ll use a dict
to specify how we want to rename our columns. The keys will be the current column names and the values the new column names. Note that we explicitly assign the result of rename()
to the original variable–by default, rename()
returns a copy of the original dataframe instead of modifying the original dataframe.
PYTHON
# Create a dict that maps from the old to the new column name
= {
cols "hindfoot_length": "hindfoot_length_mm",
"weight": "weight_g",
}
# Assign the result of the rename method back to surveys_nona
= surveys_nona.rename(columns=cols)
surveys_nona
# View the dataframe with the new column names
surveys_nona
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length_mm | weight_g | weight_mg | |
---|---|---|---|---|---|---|---|---|---|---|
62 | 63 | 8 | 19 | 1977 | 3 | DM | M | 35.0 | 40.0 | 40000.0 |
63 | 64 | 8 | 19 | 1977 | 7 | DM | M | 37.0 | 48.0 | 48000.0 |
64 | 65 | 8 | 19 | 1977 | 4 | DM | F | 34.0 | 29.0 | 29000.0 |
65 | 66 | 8 | 19 | 1977 | 4 | DM | F | 35.0 | 46.0 | 46000.0 |
66 | 67 | 8 | 19 | 1977 | 7 | DM | M | 35.0 | 36.0 | 36000.0 |
… | … | … | … | … | … | … | … | … | … | … |
35540 | 35541 | 12 | 31 | 2002 | 15 | PB | F | 24.0 | 31.0 | 31000.0 |
35541 | 35542 | 12 | 31 | 2002 | 15 | PB | F | 26.0 | 29.0 | 29000.0 |
35542 | 35543 | 12 | 31 | 2002 | 15 | PB | F | 27.0 | 34.0 | 34000.0 |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 | 14000.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 | 51000.0 |
30676 rows × 10 columns
Challenge
Create a dataframe that returns the year, month, day, species_id and weight in mg.
PYTHON
# Assign the weight in milligrams to the weight_mg column
"weight_mg"] = surveys_nona["weight_g"] * 1000
surveys_nona[
# Display a copy of survey with only the desired columns
"year", "month", "day", "species_id", "weight_mg"]] surveys_nona[[
year | month | day | species_id | weight_mg | |
---|---|---|---|---|---|
62 | 1977 | 8 | 19 | DM | 40000.0 |
63 | 1977 | 8 | 19 | DM | 48000.0 |
64 | 1977 | 8 | 19 | DM | 29000.0 |
65 | 1977 | 8 | 19 | DM | 46000.0 |
66 | 1977 | 8 | 19 | DM | 36000.0 |
… | … | … | … | … | … |
35540 | 2002 | 12 | 31 | PB | 31000.0 |
35541 | 2002 | 12 | 31 | PB | 29000.0 |
35542 | 2002 | 12 | 31 | PB | 34000.0 |
35546 | 2002 | 12 | 31 | RM | 14000.0 |
35547 | 2002 | 12 | 31 | DO | 51000.0 |
30676 rows × 5 columns
Filtering data
pandas provides a variety of ways to filter a dataframe. For example, suppose we want to look at a specific species in the surveys_nona dataframe. We can view the rows matching a given species using the same square brackets we used above to select specific columns and rows. Here, however, instead of passing a value or list of values, we will pass a conditional expression.
A conditional expression is a statement that evaluates as either True or False. They often make use of inequality operators, for example:
-
==
for equals -
!=
for does not equal -
>
for greater than -
>=
for greater than or equal to -
<
for less than -
<=
for less than or equal to
Examples of conditional statements include:
-
"a" == "b"
evaluates False -
"a" != b"
evaluates True -
3 > 4
evaluates False
Note that, when comparing strings, evaluations are case sensitive:
-
"a" == "A"
evaluates False
= for assignment, == for equality
Remember that, in Python, a single equal sign is used to assign values to variables. We’ve already used the assignment operator in this lesson, for example, when we created a new column in the dataframe.
To limit the dataframe to rows matching the species “DM”, include the conditional surveys_nona["species_id"] == "DM"
inside the square brackets:
PYTHON
"species_id"] == "DM"] surveys_nona[surveys_nona[
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length_mm | weight_g | weight_mg | |
---|---|---|---|---|---|---|---|---|---|---|
62 | 63 | 8 | 19 | 1977 | 3 | DM | M | 35.0 | 40.0 | 40000.0 |
63 | 64 | 8 | 19 | 1977 | 7 | DM | M | 37.0 | 48.0 | 48000.0 |
64 | 65 | 8 | 19 | 1977 | 4 | DM | F | 34.0 | 29.0 | 29000.0 |
65 | 66 | 8 | 19 | 1977 | 4 | DM | F | 35.0 | 46.0 | 46000.0 |
66 | 67 | 8 | 19 | 1977 | 7 | DM | M | 35.0 | 36.0 | 36000.0 |
… | … | … | … | … | … | … | … | … | … | … |
35532 | 35533 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 48.0 | 48000.0 |
35533 | 35534 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 56.0 | 56000.0 |
35534 | 35535 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 53.0 | 53000.0 |
35535 | 35536 | 12 | 31 | 2002 | 14 | DM | F | 35.0 | 42.0 | 42000.0 |
35536 | 35537 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 46.0 | 46000.0 |
9727 rows × 10 columns
To limit our results to observations made in or after 2000, use:
PYTHON
"year"] >= 2000] surveys_nona[surveys_nona[
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length_mm | weight_g | weight_mg | |
---|---|---|---|---|---|---|---|---|---|---|
30158 | 30159 | 1 | 8 | 2000 | 1 | PP | F | 22.0 | 17.0 | 17000.0 |
30159 | 30160 | 1 | 8 | 2000 | 1 | DO | M | 35.0 | 53.0 | 53000.0 |
30160 | 30161 | 1 | 8 | 2000 | 1 | PP | F | 21.0 | 17.0 | 17000.0 |
30161 | 30162 | 1 | 8 | 2000 | 1 | DM | M | 36.0 | 50.0 | 50000.0 |
30162 | 30163 | 1 | 8 | 2000 | 1 | PP | M | 20.0 | 16.0 | 16000.0 |
… | … | … | … | … | … | … | … | … | … | … |
35540 | 35541 | 12 | 31 | 2002 | 15 | PB | F | 24.0 | 31.0 | 31000.0 |
35541 | 35542 | 12 | 31 | 2002 | 15 | PB | F | 26.0 | 29.0 | 29000.0 |
35542 | 35543 | 12 | 31 | 2002 | 15 | PB | F | 27.0 | 34.0 | 34000.0 |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 | 14000.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 | 51000.0 |
4947 rows × 10 columns
As with rename()
above, each filtering operation returns a copy of the dataframe. We will look at how to make changes to the original dataframe at the end of this lesson.
Building more complex queries
We can combine conditionals using what are called bitwise operators:
-
&
: True if conditions on both sides of the operator are True (and) -
|
: True if a condition on either side is True (or)
To return all observations of DM in or after 2000, we can combine the two conditionals we used previously. Note that, when joining conditionals using &
or |
, we must wrap each individual condition in parentheses. If we omit the parentheses, pandas will not perform the comparisons in the expected order.
PYTHON
"species_id"] == "DM") & (surveys_nona["year"] >= 2000)] surveys_nona[(surveys_nona[
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length_mm | weight_g | weight_mg | |
---|---|---|---|---|---|---|---|---|---|---|
30161 | 30162 | 1 | 8 | 2000 | 1 | DM | M | 36.0 | 50.0 | 50000.0 |
30178 | 30179 | 1 | 8 | 2000 | 12 | DM | M | 36.0 | 60.0 | 60000.0 |
30195 | 30196 | 1 | 8 | 2000 | 17 | DM | M | 37.0 | 52.0 | 52000.0 |
30196 | 30197 | 1 | 8 | 2000 | 17 | DM | F | 34.0 | 43.0 | 43000.0 |
30209 | 30210 | 1 | 8 | 2000 | 22 | DM | M | 38.0 | 56.0 | 56000.0 |
… | … | … | … | … | … | … | … | … | … | … |
35532 | 35533 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 48.0 | 48000.0 |
35533 | 35534 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 56.0 | 56000.0 |
35534 | 35535 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 53.0 | 53000.0 |
35535 | 35536 | 12 | 31 | 2002 | 14 | DM | F | 35.0 | 42.0 | 42000.0 |
35536 | 35537 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 46.0 | 46000.0 |
823 rows × 10 columns
Some column methods can also be used for filtering. One example is isin()
, which is used to match a list of values. This method can be combined with other conditionals as above. The example below returns rows from 2000 or later with either “DM”, “DO”, or “DS” in the species_id column:
PYTHON
surveys_nona["species_id"].isin(["DM", "DO", "DS"]) & (surveys_nona["year"] >= 2000)
surveys_nona[ ]
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length_mm | weight_g | weight_mg | |
---|---|---|---|---|---|---|---|---|---|---|
30159 | 30160 | 1 | 8 | 2000 | 1 | DO | M | 35.0 | 53.0 | 53000.0 |
30161 | 30162 | 1 | 8 | 2000 | 1 | DM | M | 36.0 | 50.0 | 50000.0 |
30166 | 30167 | 1 | 8 | 2000 | 1 | DO | M | 35.0 | 41.0 | 41000.0 |
30170 | 30171 | 1 | 8 | 2000 | 2 | DO | M | 36.0 | 52.0 | 52000.0 |
30172 | 30173 | 1 | 8 | 2000 | 2 | DO | F | 35.0 | 54.0 | 54000.0 |
… | … | … | … | … | … | … | … | … | … | … |
35533 | 35534 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 56.0 | 56000.0 |
35534 | 35535 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 53.0 | 53000.0 |
35535 | 35536 | 12 | 31 | 2002 | 14 | DM | F | 35.0 | 42.0 | 42000.0 |
35536 | 35537 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 46.0 | 46000.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 | 51000.0 |
1229 rows × 10 columns
Sorting data
We can sort a dataframe using the sort_values()
method. For this example, we’ll work from the subset defined above. First we need to assign that subset to a variable:
PYTHON
= surveys_nona[
results "species_id"].isin(["DM", "DO", "DS"]) & (surveys_nona["year"] >= 2000)
surveys_nona[ ]
Now we’ll sort the results by weight_g. To do so, pass that column name as an argument (that is, inside the trailing parentheses) to the sort_values()
method:
PYTHON
"weight_g") results.sort_values(
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length_mm | weight_g | weight_mg | |
---|---|---|---|---|---|---|---|---|---|---|
30614 | 30615 | 4 | 30 | 2000 | 12 | DM | M | 34.0 | 18.0 | 18000.0 |
34790 | 34791 | 10 | 5 | 2002 | 17 | DM | M | 34.0 | 20.0 | 20000.0 |
32650 | 32651 | 9 | 22 | 2001 | 17 | DM | F | 33.0 | 21.0 | 21000.0 |
32782 | 32783 | 9 | 23 | 2001 | 14 | DM | M | 34.0 | 21.0 | 21000.0 |
32968 | 32969 | 10 | 14 | 2001 | 9 | DM | M | 32.0 | 22.0 | 22000.0 |
… | … | … | … | … | … | … | … | … | … | … |
35159 | 35160 | 11 | 10 | 2002 | 13 | DO | F | 36.0 | 65.0 | 65000.0 |
30289 | 30290 | 2 | 5 | 2000 | 1 | DO | F | 36.0 | 65.0 | 65000.0 |
31942 | 31943 | 4 | 21 | 2001 | 1 | DO | F | 37.0 | 68.0 | 68000.0 |
32041 | 32042 | 5 | 26 | 2001 | 1 | DO | F | 37.0 | 68.0 | 68000.0 |
31955 | 31956 | 4 | 21 | 2001 | 24 | DO | F | 36.0 | 76.0 | 76000.0 |
1229 rows × 10 columns
By default, rows are sorted in ascending order (smallest to largest). We can modify this behavior using the ascending keyword argument:
PYTHON
"weight_g", ascending=False) results.sort_values(
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length_mm | weight_g | weight_mg | |
---|---|---|---|---|---|---|---|---|---|---|
31955 | 31956 | 4 | 21 | 2001 | 24 | DO | F | 36.0 | 76.0 | 76000.0 |
32041 | 32042 | 5 | 26 | 2001 | 1 | DO | F | 37.0 | 68.0 | 68000.0 |
31942 | 31943 | 4 | 21 | 2001 | 1 | DO | F | 37.0 | 68.0 | 68000.0 |
35159 | 35160 | 11 | 10 | 2002 | 13 | DO | F | 36.0 | 65.0 | 65000.0 |
30289 | 30290 | 2 | 5 | 2000 | 1 | DO | F | 36.0 | 65.0 | 65000.0 |
… | … | … | … | … | … | … | … | … | … | … |
32957 | 32958 | 10 | 14 | 2001 | 9 | DM | F | 33.0 | 22.0 | 22000.0 |
32650 | 32651 | 9 | 22 | 2001 | 17 | DM | F | 33.0 | 21.0 | 21000.0 |
32782 | 32783 | 9 | 23 | 2001 | 14 | DM | M | 34.0 | 21.0 | 21000.0 |
34790 | 34791 | 10 | 5 | 2002 | 17 | DM | M | 34.0 | 20.0 | 20000.0 |
30614 | 30615 | 4 | 30 | 2000 | 12 | DM | M | 34.0 | 18.0 | 18000.0 |
1229 rows × 10 columns
We can sort on multiple fields at once by passing a list of column names. We can control how each column sorts by passing a list with the same number of values (that is, one value per column) to the ascending keyword. The cell below sorts the results first by species_id (largest to smallest), then by weight (smallest to largest):
PYTHON
"species_id", "weight_g"], ascending=[False, True]) results.sort_values([
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length_mm | weight_g | weight_mg | |
---|---|---|---|---|---|---|---|---|---|---|
30687 | 30688 | 5 | 1 | 2000 | 8 | DO | M | 35.0 | 23.0 | 23000.0 |
30768 | 30769 | 6 | 3 | 2000 | 17 | DO | M | 35.0 | 24.0 | 24000.0 |
33339 | 33340 | 1 | 12 | 2002 | 12 | DO | M | 34.0 | 24.0 | 24000.0 |
34082 | 34083 | 5 | 16 | 2002 | 6 | DO | F | 33.0 | 24.0 | 24000.0 |
34096 | 34097 | 5 | 16 | 2002 | 6 | DO | M | 35.0 | 24.0 | 24000.0 |
… | … | … | … | … | … | … | … | … | … | … |
31858 | 31859 | 3 | 24 | 2001 | 2 | DM | F | 36.0 | 60.0 | 60000.0 |
32909 | 32910 | 10 | 14 | 2001 | 4 | DM | F | 37.0 | 60.0 | 60000.0 |
34604 | 34605 | 7 | 14 | 2002 | 9 | DM | F | 36.0 | 60.0 | 60000.0 |
34675 | 34676 | 9 | 8 | 2002 | 12 | DM | F | 36.0 | 60.0 | 60000.0 |
34126 | 34127 | 5 | 16 | 2002 | 9 | DM | F | 35.0 | 64.0 | 64000.0 |
1229 rows × 10 columns
As with the dataframe methods above, sort_values()
returns a copy of the original dataframe and leaves the original untouched.
Challenge
Write a query that returns year, species_id, and weight in kg from the surveys_nona table, sorted with the largest weights at the top.
PYTHON
# Create a new column with weight in kg
"weight_kg"] = surveys_nona["weight_g"] / 1000
surveys_nona[
# Create a subset containing only year, species_id, and weight_kg
= surveys_nona[["year", "species_id", "weight_kg"]]
subset
# Sort the subset by weight_kg
"weight_kg", ascending=False) subset.sort_values(
year | species_id | weight_kg | |
---|---|---|---|
33048 | 2001 | NL | 0.280 |
12870 | 1987 | NL | 0.278 |
15458 | 1989 | NL | 0.275 |
2132 | 1979 | NL | 0.274 |
12728 | 1987 | NL | 0.270 |
… | … | … | … |
29905 | 1999 | PP | 0.004 |
9908 | 1985 | RM | 0.004 |
10438 | 1985 | RM | 0.004 |
8735 | 1983 | RM | 0.004 |
7083 | 1982 | PF | 0.004 |
30676 rows × 3 columns
Modifying data
We’ve already shown how to modify an existing a dataframe by adding a new column. What if we want to modify existing cells instead? As we’ve seen, this can be a little tricky in pandas because most of its methods return a copy of the original dataframe. For example, we can get subsets of a dataframe using square brackets. The cell below returns the species_id column for rows 2 through 5:
PYTHON
2:6]["species_id"] surveys[
OUTPUT
2 DM
3 DM
4 DM
5 PF
Name: species_id, dtype: object
But trying to set new values using this syntax may not work as expected. When working with the full dataframe, we can use
Say we want to set the species_id column to a new value, “FD”. Try running the code in the cell below:
PYTHON
2:6]["species_id"] = "FD" surveys[
OUTPUT
C:\...\1234567890.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
surveys[2:6]["species_id"] = "FD"
You should have received a SettingWithCopyWarning
warning after running that cell. This warning tells us that the data in the original dataframe has not been modified. This is because the square brackets returned a copy, meaning that any changes will be reflected in the copy, not the original. We can verify that the original dataframe has not been changed by displaying the rows that would have been modified:
PYTHON
2:6] surveys[
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
5 | 6 | 7 | 16 | 1977 | 1 | PF | M | 14.0 | NaN |
Using loc to modify existing cells
One way to modify existing data in pandas is to use the loc
attribute. This attribute allows you to extract and modify cells in a DataFrame using the following syntax: df.loc[row_indexer, col_indexer]
.
The row_indexer argument is used to select one or more rows. It can be: - A row label (i.e., the bold column on the far left) - 0
returns the row with label 0
- A slice including multiple rows:
-
:
returns all rows -
:2
returns all rows from the beginning of the dataframe to the row labeled 2, inclusive -
2:
returns all rows from the row labeled 2 to the end of the dataframe, inclusive -
2:5
returns all rows between those labeled 2 and 5, inclusive
-
- A conditional, as in the examples above.
The col_indexer argument is used to select one or more columns. It will typically be a list of column names and can be omitted, in which case all columns will be returned.
Row labels and indexes
The row labels in this dataframe happen to be numeric and aligned exactly to the row’s index (so for the first row both the index and the label are 0, for the second row both the index and the label are 1, etc.) This is often but not always true in pandas. For example, if we used record_id as the label, the row labels would be one-based and the row indexes would be zero-based.
loc slicing behavior
Slices using loc
are inclusive–rows matching both the start and end values are included in the returned slice. This differs from list slices, where the start but not end value is included in the slice. loc
works this way because it is looking at the row label, not the row index.
We’ll be making some changes to our data, so let’s work from a copy instead of modifying the original. Create a copy using the copy()
method:
PYTHON
= surveys.copy() surveys_copy
To select a subset of rows and columns using loc
, use:
PYTHON
2:5, "species_id"] surveys_copy.loc[
OUTPUT
2 DM
3 DM
4 DM
5 PF
Name: species_id, dtype: object
Unlike the methods earlier in the lesson, this is a view, not a copy, of the data in the surveys_copy dataframe. That means that the object returned by loc is live and can be used to change the original dataframe. We can now assign a new value to the species_id column in the matching rows of the original dataframe:
PYTHON
2:5, "species_id"] = "FD" surveys_copy.loc[
We can see that these changes are reflected in the original surveys_copy object:
PYTHON
1:6, "species_id"] surveys_copy.loc[
OUTPUT
1 NL
2 FD
3 FD
4 FD
5 FD
6 PE
Name: species_id, dtype: object
Slicing with iloc
pandas provides another indexer, iloc
, that allows us to select and modify data using row and column indexes instead of labels. Learn more in the pandas documentation.
Key Points
- Use square brackets to access rows, columns, and specific cells
- Use operators like
+
,-
, and/
to perform arithmetic on rows and columns - Store the results of calculations in a dataframe by adding a new column or overwriting an existing column
- Sort data, rename columns, and get unique values in a dataframe using methods provided by pandas
- By default, most dataframe operations return a copy of the original data