Accessing Data in a Dataframe

Last updated on 2022-07-18 | Edit this page

Estimated time: 60 minutes

Overview

Questions

How can we look at individual rows and columns in a dataframe?
How can we perform calculations?
How can we modify the table and data?

Objectives

Access individual rows and columns
Access multiple columns at once using a list
Perform calculations like addition and subtraction
Rename columns using a dictionary
Access rows containing specific data
Sort the data returned by a query
Modify data using loc

We’ll begin by importing pandas and reading our CSV, as we did in the previous lesson:

PYTHON

import pandas as pd

surveys = pd.read_csv("data/surveys.csv")
surveys

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
0	1	7	16	1977	2	NL	M	32.0	NaN
1	2	7	16	1977	3	NL	M	33.0	NaN
2	3	7	16	1977	2	DM	F	37.0	NaN
3	4	7	16	1977	7	DM	M	36.0	NaN
4	5	7	16	1977	3	DM	M	35.0	NaN
…	…	…	…	…	…	…	…	…	…
35544	35545	12	31	2002	15	AH	NaN	NaN	NaN
35545	35546	12	31	2002	15	AH	NaN	NaN	NaN
35546	35547	12	31	2002	10	RM	F	15.0	14.0
35547	35548	12	31	2002	7	DO	M	36.0	51.0
35548	35549	12	31	2002	5	NaN	NaN	NaN	NaN

35549 rows × 9 columns

We will now look at how to access rows and columns in the dataframe.

Getting columns

We can get the values from a single column by passing a string inside square brackets to the dataframe object. For example, to look at the year column, use:

PYTHON

surveys["year"]

OUTPUT

0        1977
1        1977
2        1977
3        1977
4        1977
         ...
35544    2002
35545    2002
35546    2002
35547    2002
35548    2002
Name: year, Length: 35549, dtype: int64

A single column is returned as a Series, which is why the output for this cell is formatted differently than the other cells in this lesson. The main difference between a Series and a DataFrame is that a DataFrame can have multiple columns. The two classes share many but not all attributes and methods. Note also that this is a copy of data from the original dataframe. Changing values in the series will have no effect on the original dataframe.

Using lists to get more than one column at a time

Python uses the built-in list data type to store sequences, that is, an ordered list of objects. Any type of Python object can be stored in a list: strings, integers, floats, even other lists or collections. Once created, a list can be modified by appending, inserting, or deleting items. We will not be going into detail about these operations here, but as always you can learn more about how to use a list using help() or the Python docs.

Create a list using square brackets. Let’s create a list of the three columns in our dataframe that together give the date of an observation:

PYTHON

cols = ["year", "month", "day"]

When we pass this list to the survey dataframe using square brackets, as we did above, we retrieve a copy of the dataframe containing just those columns. Note that, when we get more than one column, pandas returns a dataframe:

PYTHON

surveys[cols]

	year	month	day
0	1977	7	16
1	1977	7	16
2	1977	7	16
3	1977	7	16
4	1977	7	16
…	…	…	…
35544	2002	12	31
35545	2002	12	31
35546	2002	12	31
35547	2002	12	31
35548	2002	12	31

35549 rows × 3 columns

Getting rows

We can get the rows at the beginning of the table using the head method:

PYTHON

surveys.head()

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
0	1	7	16	1977	2	NL	M	32.0	NaN
1	2	7	16	1977	3	NL	M	33.0	NaN
2	3	7	16	1977	2	DM	F	37.0	NaN
3	4	7	16	1977	7	DM	M	36.0	NaN
4	5	7	16	1977	3	DM	M	35.0	NaN

By default, this methods returns the first five rows. We can provide a number inside the parentheses if we need a specific number of rows:

PYTHON

surveys.head(20)

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
0	1	7	16	1977	2	NL	M	32.0	NaN
1	2	7	16	1977	3	NL	M	33.0	NaN
2	3	7	16	1977	2	DM	F	37.0	NaN
3	4	7	16	1977	7	DM	M	36.0	NaN
4	5	7	16	1977	3	DM	M	35.0	NaN
5	6	7	16	1977	1	PF	M	14.0	NaN
6	7	7	16	1977	2	PE	F	NaN	NaN
7	8	7	16	1977	1	DM	M	37.0	NaN
8	9	7	16	1977	1	DM	F	34.0	NaN
9	10	7	16	1977	6	PF	F	20.0	NaN
10	11	7	16	1977	5	DS	F	53.0	NaN
11	12	7	16	1977	7	DM	M	38.0	NaN
12	13	7	16	1977	3	DM	M	35.0	NaN
13	14	7	16	1977	8	DM	NaN	NaN	NaN
14	15	7	16	1977	6	DM	F	36.0	NaN
15	16	7	16	1977	4	DM	F	36.0	NaN
16	17	7	16	1977	3	DS	F	48.0	NaN
17	18	7	16	1977	2	PP	M	22.0	NaN
18	19	7	16	1977	4	PF	NaN	NaN	NaN
19	20	7	17	1977	11	DS	F	48.0	NaN

The tail() method is similar, except it returns rows from the end of the table:

PYTHON

surveys.tail()

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
35544	35545	12	31	2002	15	AH	NaN	NaN	NaN
35545	35546	12	31	2002	15	AH	NaN	NaN	NaN
35546	35547	12	31	2002	10	RM	F	15.0	14.0
35547	35548	12	31	2002	7	DO	M	36.0	51.0
35548	35549	12	31	2002	5	NaN	NaN	NaN	NaN

The head() and tail() methods are useful for getting a feel for how our data is structured, but we’ll also want to be able to look at specific rows. As when we selected columns above, we can use square brackets to extract slices from the dataframe. A slice is a subset of the dataframe starting at one row and ending at another. To get a slice, we pass the starting and ending indexes to the square brackets as start:end:

PYTHON

surveys[2:5]

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
2	3	7	16	1977	2	DM	F	37.0	NaN
3	4	7	16	1977	7	DM	M	36.0	NaN
4	5	7	16	1977	3	DM	M	35.0	NaN

There are three things to be aware of when slicing a dataframe:

Row indexes are zero-based. That is, the first row has an index of 0, not 1.
When slicing, the slice includes start but not the end index. In this case, that means the slice includes rows 2, 3, and 4 but not 5.
The row label can be different from the row index. They happen to be the same here, but don’t count on that being true.

Core Python types like list and tuple use the same conventions, as do most libraries that deal with sequences.

Getting unique values

Recall that we can use square brackets to return a single column from a dataframe. We can get the list of unique values within that column using the unique() method:

PYTHON

surveys["species_id"].unique()

OUTPUT

array(['NL', 'DM', 'PF', 'PE', 'DS', 'PP', 'SH', 'OT', 'DO', 'OX', 'SS',
       'OL', 'RM', nan, 'SA', 'PM', 'AH', 'DX', 'AB', 'CB', 'CM', 'CQ',
       'RF', 'PC', 'PG', 'PH', 'PU', 'CV', 'UR', 'UP', 'ZL', 'UL', 'CS',
       'SC', 'BA', 'SF', 'RO', 'AS', 'SO', 'PI', 'ST', 'CU', 'SU', 'RX',
       'PB', 'PL', 'PX', 'CT', 'US'], dtype=object)

To do the same across multiple columns, we can use the drop_duplicates() method on a copy of the dataframe containing only the columns we’re interested in. Like any well-named method, drop_duplicates() does exactly what the name implies: It returns a copy of the dataframe with all duplicate rows removed.

PYTHON

surveys[["plot_id", "species_id"]].drop_duplicates()

	plot_id	species_id
0	2	NL
1	3	NL
2	2	DM
3	7	DM
4	3	DM
…	…	…
35474	22	SF
35511	11	US
35519	9	SF
35527	13	US
35543	15	US

612 rows × 2 columns

Calculating values

The survey dataset includes two columns, hindfoot_length and weight, that are stored as numbers and represent measurements. We may want to perform calculations using numbers like these in our own data. We can do so using Python’s built-in mathematical operators, including:

x + y for addition
x - y for subtraction
x * y for multiplication
x / y for division
x % y for calculating remainders
x ** y for exponents

To make the examples in this section a little more useful, we’re going to remove all rows that contain null values using the dropna() method. This will filter out any rows that don’t have a valid hindfoot_length or weight, as well as those that have a null value in any other cell. (This is an inelegant solution to the problem of missing data. We’ll talk about more nuanced solutions later in the lesson.)

PYTHON

surveys_nona = surveys.dropna().copy()
surveys_nona

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
62	63	8	19	1977	3	DM	M	35.0	40.0
63	64	8	19	1977	7	DM	M	37.0	48.0
64	65	8	19	1977	4	DM	F	34.0	29.0
65	66	8	19	1977	4	DM	F	35.0	46.0
66	67	8	19	1977	7	DM	M	35.0	36.0
…	…	…	…	…	…	…	…	…	…
35540	35541	12	31	2002	15	PB	F	24.0	31.0
35541	35542	12	31	2002	15	PB	F	26.0	29.0
35542	35543	12	31	2002	15	PB	F	27.0	34.0
35546	35547	12	31	2002	10	RM	F	15.0	14.0
35547	35548	12	31	2002	7	DO	M	36.0	51.0

30676 rows × 9 columns

Suppose we want to convert the weight column from grams to milligrams. To do so, we can multiply that column by 1000:

PYTHON

surveys_nona["weight"] * 1000

OUTPUT

62       40000.0
63       48000.0
64       29000.0
65       46000.0
66       36000.0
          ...
35540    31000.0
35541    29000.0
35542    34000.0
35546    14000.0
35547    51000.0
Name: weight, Length: 30676, dtype: float64

To convert it to kilograms, we can divide by 1000:

PYTHON

surveys_nona["weight"] / 1000

OUTPUT

62       0.040
63       0.048
64       0.029
65       0.046
66       0.036
         ...
35540    0.031
35541    0.029
35542    0.034
35546    0.014
35547    0.051
Name: weight, Length: 30676, dtype: float64

Note that calculations do not modify the original dataset. If we want to retain the result, we have to assign it to a new column:

PYTHON

surveys_nona["weight_mg"] = surveys_nona["weight"] * 1000
surveys_nona

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight	weight_mg
62	63	8	19	1977	3	DM	M	35.0	40.0	40000.0
63	64	8	19	1977	7	DM	M	37.0	48.0	48000.0
64	65	8	19	1977	4	DM	F	34.0	29.0	29000.0
65	66	8	19	1977	4	DM	F	35.0	46.0	46000.0
66	67	8	19	1977	7	DM	M	35.0	36.0	36000.0
…	…	…	…	…	…	…	…	…	…	…
35540	35541	12	31	2002	15	PB	F	24.0	31.0	31000.0
35541	35542	12	31	2002	15	PB	F	26.0	29.0	29000.0
35542	35543	12	31	2002	15	PB	F	27.0	34.0	34000.0
35546	35547	12	31	2002	10	RM	F	15.0	14.0	14000.0
35547	35548	12	31	2002	7	DO	M	36.0	51.0	51000.0

30676 rows × 10 columns

We can also add, subtract, multiply, and divide columns, as in the following (admittedly nonsensical) calculation, which adds together the hindfoot_length and weight columns:

PYTHON

surveys_nona["hindfoot_length"] + surveys_nona["weight"]

OUTPUT

62       75.0
63       85.0
64       63.0
65       81.0
66       71.0
         ...
35540    55.0
35541    55.0
35542    61.0
35546    29.0
35547    87.0
Length: 30676, dtype: float64

Renaming columns

The hindfoot_length and weight columns don’t specify a unit, which may get confusing if we want to perform unit conversions like the one above. Fortunately, dataframes allow us to rename existing columns using the rename() method.

The rename() method uses a dictionary (or dict) to map between the old and new column names. As with list above, the dict data type is built into Python–we don’t need to import anything to use it. A dict maps keys to values. We can create one using curly braces:

PYTHON

dct = {"key1": "val1", "key2": "val2"}
dct

OUTPUT

{'key1': 'val1', 'key2': 'val2'}

Dictionaries are a useful and highly flexible data type. As with list above, we’ll be giving them short shrift here, but you can learn more about them at the Python docs.

Here we’ll use a dict to specify how we want to rename our columns. The keys will be the current column names and the values the new column names. Note that we explicitly assign the result of rename() to the original variable–by default, rename() returns a copy of the original dataframe instead of modifying the original dataframe.

PYTHON

# Create a dict that maps from the old to the new column name
cols = {
    "hindfoot_length": "hindfoot_length_mm",
    "weight": "weight_g",
}

# Assign the result of the rename method back to surveys_nona
surveys_nona = surveys_nona.rename(columns=cols)

# View the dataframe with the new column names
surveys_nona

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length_mm	weight_g	weight_mg
62	63	8	19	1977	3	DM	M	35.0	40.0	40000.0
63	64	8	19	1977	7	DM	M	37.0	48.0	48000.0
64	65	8	19	1977	4	DM	F	34.0	29.0	29000.0
65	66	8	19	1977	4	DM	F	35.0	46.0	46000.0
66	67	8	19	1977	7	DM	M	35.0	36.0	36000.0
…	…	…	…	…	…	…	…	…	…	…
35540	35541	12	31	2002	15	PB	F	24.0	31.0	31000.0
35541	35542	12	31	2002	15	PB	F	26.0	29.0	29000.0
35542	35543	12	31	2002	15	PB	F	27.0	34.0	34000.0
35546	35547	12	31	2002	10	RM	F	15.0	14.0	14000.0
35547	35548	12	31	2002	7	DO	M	36.0	51.0	51000.0

30676 rows × 10 columns

Challenge

Create a dataframe that returns the year, month, day, species_id and weight in mg.

Show me the solution

PYTHON

# Assign the weight in milligrams to the weight_mg column
surveys_nona["weight_mg"] = surveys_nona["weight_g"] * 1000

# Display a copy of survey with only the desired columns
surveys_nona[["year", "month", "day", "species_id", "weight_mg"]]

	year	month	day	species_id	weight_mg
62	1977	8	19	DM	40000.0
63	1977	8	19	DM	48000.0
64	1977	8	19	DM	29000.0
65	1977	8	19	DM	46000.0
66	1977	8	19	DM	36000.0
…	…	…	…	…	…
35540	2002	12	31	PB	31000.0
35541	2002	12	31	PB	29000.0
35542	2002	12	31	PB	34000.0
35546	2002	12	31	RM	14000.0
35547	2002	12	31	DO	51000.0

30676 rows × 5 columns

Filtering data

pandas provides a variety of ways to filter a dataframe. For example, suppose we want to look at a specific species in the surveys_nona dataframe. We can view the rows matching a given species using the same square brackets we used above to select specific columns and rows. Here, however, instead of passing a value or list of values, we will pass a conditional expression.

A conditional expression is a statement that evaluates as either True or False. They often make use of inequality operators, for example:

== for equals
!= for does not equal
> for greater than
>= for greater than or equal to
< for less than
<= for less than or equal to

Examples of conditional statements include:

"a" == "b" evaluates False
"a" != b" evaluates True
3 > 4 evaluates False

Note that, when comparing strings, evaluations are case sensitive:

"a" == "A" evaluates False

= for assignment, == for equality

Remember that, in Python, a single equal sign is used to assign values to variables. We’ve already used the assignment operator in this lesson, for example, when we created a new column in the dataframe.

To limit the dataframe to rows matching the species “DM”, include the conditional surveys_nona["species_id"] == "DM" inside the square brackets:

PYTHON

surveys_nona[surveys_nona["species_id"] == "DM"]

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length_mm	weight_g	weight_mg
62	63	8	19	1977	3	DM	M	35.0	40.0	40000.0
63	64	8	19	1977	7	DM	M	37.0	48.0	48000.0
64	65	8	19	1977	4	DM	F	34.0	29.0	29000.0
65	66	8	19	1977	4	DM	F	35.0	46.0	46000.0
66	67	8	19	1977	7	DM	M	35.0	36.0	36000.0
…	…	…	…	…	…	…	…	…	…	…
35532	35533	12	31	2002	14	DM	F	36.0	48.0	48000.0
35533	35534	12	31	2002	14	DM	M	37.0	56.0	56000.0
35534	35535	12	31	2002	14	DM	M	37.0	53.0	53000.0
35535	35536	12	31	2002	14	DM	F	35.0	42.0	42000.0
35536	35537	12	31	2002	14	DM	F	36.0	46.0	46000.0

9727 rows × 10 columns

To limit our results to observations made in or after 2000, use:

PYTHON

surveys_nona[surveys_nona["year"] >= 2000]

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length_mm	weight_g	weight_mg
30158	30159	1	8	2000	1	PP	F	22.0	17.0	17000.0
30159	30160	1	8	2000	1	DO	M	35.0	53.0	53000.0
30160	30161	1	8	2000	1	PP	F	21.0	17.0	17000.0
30161	30162	1	8	2000	1	DM	M	36.0	50.0	50000.0
30162	30163	1	8	2000	1	PP	M	20.0	16.0	16000.0
…	…	…	…	…	…	…	…	…	…	…
35540	35541	12	31	2002	15	PB	F	24.0	31.0	31000.0
35541	35542	12	31	2002	15	PB	F	26.0	29.0	29000.0
35542	35543	12	31	2002	15	PB	F	27.0	34.0	34000.0
35546	35547	12	31	2002	10	RM	F	15.0	14.0	14000.0
35547	35548	12	31	2002	7	DO	M	36.0	51.0	51000.0

4947 rows × 10 columns

As with rename() above, each filtering operation returns a copy of the dataframe. We will look at how to make changes to the original dataframe at the end of this lesson.

Building more complex queries

We can combine conditionals using what are called bitwise operators:

&: True if conditions on both sides of the operator are True (and)
|: True if a condition on either side is True (or)

To return all observations of DM in or after 2000, we can combine the two conditionals we used previously. Note that, when joining conditionals using & or |, we must wrap each individual condition in parentheses. If we omit the parentheses, pandas will not perform the comparisons in the expected order.

PYTHON

surveys_nona[(surveys_nona["species_id"] == "DM") & (surveys_nona["year"] >= 2000)]

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length_mm	weight_g	weight_mg
30161	30162	1	8	2000	1	DM	M	36.0	50.0	50000.0
30178	30179	1	8	2000	12	DM	M	36.0	60.0	60000.0
30195	30196	1	8	2000	17	DM	M	37.0	52.0	52000.0
30196	30197	1	8	2000	17	DM	F	34.0	43.0	43000.0
30209	30210	1	8	2000	22	DM	M	38.0	56.0	56000.0
…	…	…	…	…	…	…	…	…	…	…
35532	35533	12	31	2002	14	DM	F	36.0	48.0	48000.0
35533	35534	12	31	2002	14	DM	M	37.0	56.0	56000.0
35534	35535	12	31	2002	14	DM	M	37.0	53.0	53000.0
35535	35536	12	31	2002	14	DM	F	35.0	42.0	42000.0
35536	35537	12	31	2002	14	DM	F	36.0	46.0	46000.0

823 rows × 10 columns

Some column methods can also be used for filtering. One example is isin(), which is used to match a list of values. This method can be combined with other conditionals as above. The example below returns rows from 2000 or later with either “DM”, “DO”, or “DS” in the species_id column:

PYTHON

surveys_nona[
    surveys_nona["species_id"].isin(["DM", "DO", "DS"]) & (surveys_nona["year"] >= 2000)
]

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length_mm	weight_g	weight_mg
30159	30160	1	8	2000	1	DO	M	35.0	53.0	53000.0
30161	30162	1	8	2000	1	DM	M	36.0	50.0	50000.0
30166	30167	1	8	2000	1	DO	M	35.0	41.0	41000.0
30170	30171	1	8	2000	2	DO	M	36.0	52.0	52000.0
30172	30173	1	8	2000	2	DO	F	35.0	54.0	54000.0
…	…	…	…	…	…	…	…	…	…	…
35533	35534	12	31	2002	14	DM	M	37.0	56.0	56000.0
35534	35535	12	31	2002	14	DM	M	37.0	53.0	53000.0
35535	35536	12	31	2002	14	DM	F	35.0	42.0	42000.0
35536	35537	12	31	2002	14	DM	F	36.0	46.0	46000.0
35547	35548	12	31	2002	7	DO	M	36.0	51.0	51000.0

1229 rows × 10 columns

Sorting data

We can sort a dataframe using the sort_values() method. For this example, we’ll work from the subset defined above. First we need to assign that subset to a variable:

PYTHON

results = surveys_nona[
    surveys_nona["species_id"].isin(["DM", "DO", "DS"]) & (surveys_nona["year"] >= 2000)
]

Now we’ll sort the results by weight_g. To do so, pass that column name as an argument (that is, inside the trailing parentheses) to the sort_values() method:

PYTHON

results.sort_values("weight_g")

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length_mm	weight_g	weight_mg
30614	30615	4	30	2000	12	DM	M	34.0	18.0	18000.0
34790	34791	10	5	2002	17	DM	M	34.0	20.0	20000.0
32650	32651	9	22	2001	17	DM	F	33.0	21.0	21000.0
32782	32783	9	23	2001	14	DM	M	34.0	21.0	21000.0
32968	32969	10	14	2001	9	DM	M	32.0	22.0	22000.0
…	…	…	…	…	…	…	…	…	…	…
35159	35160	11	10	2002	13	DO	F	36.0	65.0	65000.0
30289	30290	2	5	2000	1	DO	F	36.0	65.0	65000.0
31942	31943	4	21	2001	1	DO	F	37.0	68.0	68000.0
32041	32042	5	26	2001	1	DO	F	37.0	68.0	68000.0
31955	31956	4	21	2001	24	DO	F	36.0	76.0	76000.0

1229 rows × 10 columns

By default, rows are sorted in ascending order (smallest to largest). We can modify this behavior using the ascending keyword argument:

PYTHON

results.sort_values("weight_g", ascending=False)

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length_mm	weight_g	weight_mg
31955	31956	4	21	2001	24	DO	F	36.0	76.0	76000.0
32041	32042	5	26	2001	1	DO	F	37.0	68.0	68000.0
31942	31943	4	21	2001	1	DO	F	37.0	68.0	68000.0
35159	35160	11	10	2002	13	DO	F	36.0	65.0	65000.0
30289	30290	2	5	2000	1	DO	F	36.0	65.0	65000.0
…	…	…	…	…	…	…	…	…	…	…
32957	32958	10	14	2001	9	DM	F	33.0	22.0	22000.0
32650	32651	9	22	2001	17	DM	F	33.0	21.0	21000.0
32782	32783	9	23	2001	14	DM	M	34.0	21.0	21000.0
34790	34791	10	5	2002	17	DM	M	34.0	20.0	20000.0
30614	30615	4	30	2000	12	DM	M	34.0	18.0	18000.0

1229 rows × 10 columns

We can sort on multiple fields at once by passing a list of column names. We can control how each column sorts by passing a list with the same number of values (that is, one value per column) to the ascending keyword. The cell below sorts the results first by species_id (largest to smallest), then by weight (smallest to largest):

PYTHON

results.sort_values(["species_id", "weight_g"], ascending=[False, True])

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length_mm	weight_g	weight_mg
30687	30688	5	1	2000	8	DO	M	35.0	23.0	23000.0
30768	30769	6	3	2000	17	DO	M	35.0	24.0	24000.0
33339	33340	1	12	2002	12	DO	M	34.0	24.0	24000.0
34082	34083	5	16	2002	6	DO	F	33.0	24.0	24000.0
34096	34097	5	16	2002	6	DO	M	35.0	24.0	24000.0
…	…	…	…	…	…	…	…	…	…	…
31858	31859	3	24	2001	2	DM	F	36.0	60.0	60000.0
32909	32910	10	14	2001	4	DM	F	37.0	60.0	60000.0
34604	34605	7	14	2002	9	DM	F	36.0	60.0	60000.0
34675	34676	9	8	2002	12	DM	F	36.0	60.0	60000.0
34126	34127	5	16	2002	9	DM	F	35.0	64.0	64000.0

1229 rows × 10 columns

As with the dataframe methods above, sort_values() returns a copy of the original dataframe and leaves the original untouched.

Challenge

Write a query that returns year, species_id, and weight in kg from the surveys_nona table, sorted with the largest weights at the top.

Show me the solution

PYTHON

# Create a new column with weight in kg
surveys_nona["weight_kg"] = surveys_nona["weight_g"] / 1000

# Create a subset containing only year, species_id, and weight_kg
subset = surveys_nona[["year", "species_id", "weight_kg"]]

# Sort the subset by weight_kg
subset.sort_values("weight_kg", ascending=False)

	year	species_id	weight_kg
33048	2001	NL	0.280
12870	1987	NL	0.278
15458	1989	NL	0.275
2132	1979	NL	0.274
12728	1987	NL	0.270
…	…	…	…
29905	1999	PP	0.004
9908	1985	RM	0.004
10438	1985	RM	0.004
8735	1983	RM	0.004
7083	1982	PF	0.004

30676 rows × 3 columns

Modifying data

We’ve already shown how to modify an existing a dataframe by adding a new column. What if we want to modify existing cells instead? As we’ve seen, this can be a little tricky in pandas because most of its methods return a copy of the original dataframe. For example, we can get subsets of a dataframe using square brackets. The cell below returns the species_id column for rows 2 through 5:

PYTHON

surveys[2:6]["species_id"]

OUTPUT

2    DM
3    DM
4    DM
5    PF
Name: species_id, dtype: object

But trying to set new values using this syntax may not work as expected. When working with the full dataframe, we can use

Say we want to set the species_id column to a new value, “FD”. Try running the code in the cell below:

PYTHON

surveys[2:6]["species_id"] = "FD"

OUTPUT

C:\...\1234567890.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  surveys[2:6]["species_id"] = "FD"

You should have received a SettingWithCopyWarning warning after running that cell. This warning tells us that the data in the original dataframe has not been modified. This is because the square brackets returned a copy, meaning that any changes will be reflected in the copy, not the original. We can verify that the original dataframe has not been changed by displaying the rows that would have been modified:

PYTHON

surveys[2:6]

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
2	3	7	16	1977	2	DM	F	37.0	NaN
3	4	7	16	1977	7	DM	M	36.0	NaN
4	5	7	16	1977	3	DM	M	35.0	NaN
5	6	7	16	1977	1	PF	M	14.0	NaN

Using loc to modify existing cells

One way to modify existing data in pandas is to use the loc attribute. This attribute allows you to extract and modify cells in a DataFrame using the following syntax: df.loc[row_indexer, col_indexer].

The row_indexer argument is used to select one or more rows. It can be: - A row label (i.e., the bold column on the far left) - 0 returns the row with label 0

A slice including multiple rows:
- : returns all rows
- :2 returns all rows from the beginning of the dataframe to the row labeled 2, inclusive
- 2: returns all rows from the row labeled 2 to the end of the dataframe, inclusive
- 2:5 returns all rows between those labeled 2 and 5, inclusive
A conditional, as in the examples above.

The col_indexer argument is used to select one or more columns. It will typically be a list of column names and can be omitted, in which case all columns will be returned.

Row labels and indexes

The row labels in this dataframe happen to be numeric and aligned exactly to the row’s index (so for the first row both the index and the label are 0, for the second row both the index and the label are 1, etc.) This is often but not always true in pandas. For example, if we used record_id as the label, the row labels would be one-based and the row indexes would be zero-based.

loc slicing behavior

Slices using loc are inclusive–rows matching both the start and end values are included in the returned slice. This differs from list slices, where the start but not end value is included in the slice. loc works this way because it is looking at the row label, not the row index.

We’ll be making some changes to our data, so let’s work from a copy instead of modifying the original. Create a copy using the copy() method:

PYTHON

surveys_copy = surveys.copy()

To select a subset of rows and columns using loc, use:

PYTHON

surveys_copy.loc[2:5, "species_id"]

OUTPUT

2    DM
3    DM
4    DM
5    PF
Name: species_id, dtype: object

Unlike the methods earlier in the lesson, this is a view, not a copy, of the data in the surveys_copy dataframe. That means that the object returned by loc is live and can be used to change the original dataframe. We can now assign a new value to the species_id column in the matching rows of the original dataframe:

PYTHON

surveys_copy.loc[2:5, "species_id"] = "FD"

We can see that these changes are reflected in the original surveys_copy object:

PYTHON

surveys_copy.loc[1:6, "species_id"]

OUTPUT

1    NL
2    FD
3    FD
4    FD
5    FD
6    PE
Name: species_id, dtype: object

Slicing with iloc

pandas provides another indexer, iloc, that allows us to select and modify data using row and column indexes instead of labels. Learn more in the pandas documentation.

Key Points

Use square brackets to access rows, columns, and specific cells
Use operators like +, -, and / to perform arithmetic on rows and columns
Store the results of calculations in a dataframe by adding a new column or overwriting an existing column
Sort data, rename columns, and get unique values in a dataframe using methods provided by pandas
By default, most dataframe operations return a copy of the original data