Accessing Data in a Dataframe

Last updated on 2022-07-18 | Edit this page

Estimated time: 60 minutes

Overview

Questions

  • How can we look at individual rows and columns in a dataframe?
  • How can we perform calculations?
  • How can we modify the table and data?

Objectives

  • Access individual rows and columns
  • Access multiple columns at once using a list
  • Perform calculations like addition and subtraction
  • Rename columns using a dictionary
  • Access rows containing specific data
  • Sort the data returned by a query
  • Modify data using loc

We’ll begin by importing pandas and reading our CSV, as we did in the previous lesson:

PYTHON

import pandas as pd

surveys = pd.read_csv("data/surveys.csv")
surveys
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

35549 rows × 9 columns

We will now look at how to access rows and columns in the dataframe.

Getting columns


We can get the values from a single column by passing a string inside square brackets to the dataframe object. For example, to look at the year column, use:

PYTHON

surveys["year"]

OUTPUT

0        1977
1        1977
2        1977
3        1977
4        1977
         ...
35544    2002
35545    2002
35546    2002
35547    2002
35548    2002
Name: year, Length: 35549, dtype: int64

A single column is returned as a Series, which is why the output for this cell is formatted differently than the other cells in this lesson. The main difference between a Series and a DataFrame is that a DataFrame can have multiple columns. The two classes share many but not all attributes and methods. Note also that this is a copy of data from the original dataframe. Changing values in the series will have no effect on the original dataframe.

Using lists to get more than one column at a time

Python uses the built-in list data type to store sequences, that is, an ordered list of objects. Any type of Python object can be stored in a list: strings, integers, floats, even other lists or collections. Once created, a list can be modified by appending, inserting, or deleting items. We will not be going into detail about these operations here, but as always you can learn more about how to use a list using help() or the Python docs.

Create a list using square brackets. Let’s create a list of the three columns in our dataframe that together give the date of an observation:

PYTHON

cols = ["year", "month", "day"]

When we pass this list to the survey dataframe using square brackets, as we did above, we retrieve a copy of the dataframe containing just those columns. Note that, when we get more than one column, pandas returns a dataframe:

PYTHON

surveys[cols]
year month day
0 1977 7 16
1 1977 7 16
2 1977 7 16
3 1977 7 16
4 1977 7 16
35544 2002 12 31
35545 2002 12 31
35546 2002 12 31
35547 2002 12 31
35548 2002 12 31

35549 rows × 3 columns

Getting rows


We can get the rows at the beginning of the table using the head method:

PYTHON

surveys.head()
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN

By default, this methods returns the first five rows. We can provide a number inside the parentheses if we need a specific number of rows:

PYTHON

surveys.head(20)
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
5 6 7 16 1977 1 PF M 14.0 NaN
6 7 7 16 1977 2 PE F NaN NaN
7 8 7 16 1977 1 DM M 37.0 NaN
8 9 7 16 1977 1 DM F 34.0 NaN
9 10 7 16 1977 6 PF F 20.0 NaN
10 11 7 16 1977 5 DS F 53.0 NaN
11 12 7 16 1977 7 DM M 38.0 NaN
12 13 7 16 1977 3 DM M 35.0 NaN
13 14 7 16 1977 8 DM NaN NaN NaN
14 15 7 16 1977 6 DM F 36.0 NaN
15 16 7 16 1977 4 DM F 36.0 NaN
16 17 7 16 1977 3 DS F 48.0 NaN
17 18 7 16 1977 2 PP M 22.0 NaN
18 19 7 16 1977 4 PF NaN NaN NaN
19 20 7 17 1977 11 DS F 48.0 NaN

The tail() method is similar, except it returns rows from the end of the table:

PYTHON

surveys.tail()
record_id month day year plot_id species_id sex hindfoot_length weight
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

The head() and tail() methods are useful for getting a feel for how our data is structured, but we’ll also want to be able to look at specific rows. As when we selected columns above, we can use square brackets to extract slices from the dataframe. A slice is a subset of the dataframe starting at one row and ending at another. To get a slice, we pass the starting and ending indexes to the square brackets as start:end:

PYTHON

surveys[2:5]
record_id month day year plot_id species_id sex hindfoot_length weight
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN

There are three things to be aware of when slicing a dataframe:

  • Row indexes are zero-based. That is, the first row has an index of 0, not 1.
  • When slicing, the slice includes start but not the end index. In this case, that means the slice includes rows 2, 3, and 4 but not 5.
  • The row label can be different from the row index. They happen to be the same here, but don’t count on that being true.

Core Python types like list and tuple use the same conventions, as do most libraries that deal with sequences.

Getting unique values


Recall that we can use square brackets to return a single column from a dataframe. We can get the list of unique values within that column using the unique() method:

PYTHON

surveys["species_id"].unique()

OUTPUT

array(['NL', 'DM', 'PF', 'PE', 'DS', 'PP', 'SH', 'OT', 'DO', 'OX', 'SS',
       'OL', 'RM', nan, 'SA', 'PM', 'AH', 'DX', 'AB', 'CB', 'CM', 'CQ',
       'RF', 'PC', 'PG', 'PH', 'PU', 'CV', 'UR', 'UP', 'ZL', 'UL', 'CS',
       'SC', 'BA', 'SF', 'RO', 'AS', 'SO', 'PI', 'ST', 'CU', 'SU', 'RX',
       'PB', 'PL', 'PX', 'CT', 'US'], dtype=object)

To do the same across multiple columns, we can use the drop_duplicates() method on a copy of the dataframe containing only the columns we’re interested in. Like any well-named method, drop_duplicates() does exactly what the name implies: It returns a copy of the dataframe with all duplicate rows removed.

PYTHON

surveys[["plot_id", "species_id"]].drop_duplicates()
plot_id species_id
0 2 NL
1 3 NL
2 2 DM
3 7 DM
4 3 DM
35474 22 SF
35511 11 US
35519 9 SF
35527 13 US
35543 15 US

612 rows × 2 columns

Calculating values


The survey dataset includes two columns, hindfoot_length and weight, that are stored as numbers and represent measurements. We may want to perform calculations using numbers like these in our own data. We can do so using Python’s built-in mathematical operators, including:

  • x + y for addition
  • x - y for subtraction
  • x * y for multiplication
  • x / y for division
  • x % y for calculating remainders
  • x ** y for exponents

To make the examples in this section a little more useful, we’re going to remove all rows that contain null values using the dropna() method. This will filter out any rows that don’t have a valid hindfoot_length or weight, as well as those that have a null value in any other cell. (This is an inelegant solution to the problem of missing data. We’ll talk about more nuanced solutions later in the lesson.)

PYTHON

surveys_nona = surveys.dropna().copy()
surveys_nona
record_id month day year plot_id species_id sex hindfoot_length weight
62 63 8 19 1977 3 DM M 35.0 40.0
63 64 8 19 1977 7 DM M 37.0 48.0
64 65 8 19 1977 4 DM F 34.0 29.0
65 66 8 19 1977 4 DM F 35.0 46.0
66 67 8 19 1977 7 DM M 35.0 36.0
35540 35541 12 31 2002 15 PB F 24.0 31.0
35541 35542 12 31 2002 15 PB F 26.0 29.0
35542 35543 12 31 2002 15 PB F 27.0 34.0
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0

30676 rows × 9 columns

Suppose we want to convert the weight column from grams to milligrams. To do so, we can multiply that column by 1000:

PYTHON

surveys_nona["weight"] * 1000

OUTPUT

62       40000.0
63       48000.0
64       29000.0
65       46000.0
66       36000.0
          ...
35540    31000.0
35541    29000.0
35542    34000.0
35546    14000.0
35547    51000.0
Name: weight, Length: 30676, dtype: float64

To convert it to kilograms, we can divide by 1000:

PYTHON

surveys_nona["weight"] / 1000

OUTPUT

62       0.040
63       0.048
64       0.029
65       0.046
66       0.036
         ...
35540    0.031
35541    0.029
35542    0.034
35546    0.014
35547    0.051
Name: weight, Length: 30676, dtype: float64

Note that calculations do not modify the original dataset. If we want to retain the result, we have to assign it to a new column:

PYTHON

surveys_nona["weight_mg"] = surveys_nona["weight"] * 1000
surveys_nona
record_id month day year plot_id species_id sex hindfoot_length weight weight_mg
62 63 8 19 1977 3 DM M 35.0 40.0 40000.0
63 64 8 19 1977 7 DM M 37.0 48.0 48000.0
64 65 8 19 1977 4 DM F 34.0 29.0 29000.0
65 66 8 19 1977 4 DM F 35.0 46.0 46000.0
66 67 8 19 1977 7 DM M 35.0 36.0 36000.0
35540 35541 12 31 2002 15 PB F 24.0 31.0 31000.0
35541 35542 12 31 2002 15 PB F 26.0 29.0 29000.0
35542 35543 12 31 2002 15 PB F 27.0 34.0 34000.0
35546 35547 12 31 2002 10 RM F 15.0 14.0 14000.0
35547 35548 12 31 2002 7 DO M 36.0 51.0 51000.0

30676 rows × 10 columns

We can also add, subtract, multiply, and divide columns, as in the following (admittedly nonsensical) calculation, which adds together the hindfoot_length and weight columns:

PYTHON

surveys_nona["hindfoot_length"] + surveys_nona["weight"]

OUTPUT

62       75.0
63       85.0
64       63.0
65       81.0
66       71.0
         ...
35540    55.0
35541    55.0
35542    61.0
35546    29.0
35547    87.0
Length: 30676, dtype: float64

Renaming columns


The hindfoot_length and weight columns don’t specify a unit, which may get confusing if we want to perform unit conversions like the one above. Fortunately, dataframes allow us to rename existing columns using the rename() method.

The rename() method uses a dictionary (or dict) to map between the old and new column names. As with list above, the dict data type is built into Python–we don’t need to import anything to use it. A dict maps keys to values. We can create one using curly braces:

PYTHON

dct = {"key1": "val1", "key2": "val2"}
dct

OUTPUT

{'key1': 'val1', 'key2': 'val2'}

Dictionaries are a useful and highly flexible data type. As with list above, we’ll be giving them short shrift here, but you can learn more about them at the Python docs.

Here we’ll use a dict to specify how we want to rename our columns. The keys will be the current column names and the values the new column names. Note that we explicitly assign the result of rename() to the original variable–by default, rename() returns a copy of the original dataframe instead of modifying the original dataframe.

PYTHON

# Create a dict that maps from the old to the new column name
cols = {
    "hindfoot_length": "hindfoot_length_mm",
    "weight": "weight_g",
}

# Assign the result of the rename method back to surveys_nona
surveys_nona = surveys_nona.rename(columns=cols)

# View the dataframe with the new column names
surveys_nona
record_id month day year plot_id species_id sex hindfoot_length_mm weight_g weight_mg
62 63 8 19 1977 3 DM M 35.0 40.0 40000.0
63 64 8 19 1977 7 DM M 37.0 48.0 48000.0
64 65 8 19 1977 4 DM F 34.0 29.0 29000.0
65 66 8 19 1977 4 DM F 35.0 46.0 46000.0
66 67 8 19 1977 7 DM M 35.0 36.0 36000.0
35540 35541 12 31 2002 15 PB F 24.0 31.0 31000.0
35541 35542 12 31 2002 15 PB F 26.0 29.0 29000.0
35542 35543 12 31 2002 15 PB F 27.0 34.0 34000.0
35546 35547 12 31 2002 10 RM F 15.0 14.0 14000.0
35547 35548 12 31 2002 7 DO M 36.0 51.0 51000.0

30676 rows × 10 columns

Challenge

Create a dataframe that returns the year, month, day, species_id and weight in mg.

PYTHON

# Assign the weight in milligrams to the weight_mg column
surveys_nona["weight_mg"] = surveys_nona["weight_g"] * 1000

# Display a copy of survey with only the desired columns
surveys_nona[["year", "month", "day", "species_id", "weight_mg"]]
year month day species_id weight_mg
62 1977 8 19 DM 40000.0
63 1977 8 19 DM 48000.0
64 1977 8 19 DM 29000.0
65 1977 8 19 DM 46000.0
66 1977 8 19 DM 36000.0
35540 2002 12 31 PB 31000.0
35541 2002 12 31 PB 29000.0
35542 2002 12 31 PB 34000.0
35546 2002 12 31 RM 14000.0
35547 2002 12 31 DO 51000.0

30676 rows × 5 columns

Filtering data


pandas provides a variety of ways to filter a dataframe. For example, suppose we want to look at a specific species in the surveys_nona dataframe. We can view the rows matching a given species using the same square brackets we used above to select specific columns and rows. Here, however, instead of passing a value or list of values, we will pass a conditional expression.

A conditional expression is a statement that evaluates as either True or False. They often make use of inequality operators, for example:

  • == for equals
  • != for does not equal
  • > for greater than
  • >= for greater than or equal to
  • < for less than
  • <= for less than or equal to

Examples of conditional statements include:

  • "a" == "b" evaluates False
  • "a" != b" evaluates True
  • 3 > 4 evaluates False

Note that, when comparing strings, evaluations are case sensitive:

  • "a" == "A" evaluates False

= for assignment, == for equality

Remember that, in Python, a single equal sign is used to assign values to variables. We’ve already used the assignment operator in this lesson, for example, when we created a new column in the dataframe.

To limit the dataframe to rows matching the species “DM”, include the conditional surveys_nona["species_id"] == "DM" inside the square brackets:

PYTHON

surveys_nona[surveys_nona["species_id"] == "DM"]
record_id month day year plot_id species_id sex hindfoot_length_mm weight_g weight_mg
62 63 8 19 1977 3 DM M 35.0 40.0 40000.0
63 64 8 19 1977 7 DM M 37.0 48.0 48000.0
64 65 8 19 1977 4 DM F 34.0 29.0 29000.0
65 66 8 19 1977 4 DM F 35.0 46.0 46000.0
66 67 8 19 1977 7 DM M 35.0 36.0 36000.0
35532 35533 12 31 2002 14 DM F 36.0 48.0 48000.0
35533 35534 12 31 2002 14 DM M 37.0 56.0 56000.0
35534 35535 12 31 2002 14 DM M 37.0 53.0 53000.0
35535 35536 12 31 2002 14 DM F 35.0 42.0 42000.0
35536 35537 12 31 2002 14 DM F 36.0 46.0 46000.0

9727 rows × 10 columns

To limit our results to observations made in or after 2000, use:

PYTHON

surveys_nona[surveys_nona["year"] >= 2000]
record_id month day year plot_id species_id sex hindfoot_length_mm weight_g weight_mg
30158 30159 1 8 2000 1 PP F 22.0 17.0 17000.0
30159 30160 1 8 2000 1 DO M 35.0 53.0 53000.0
30160 30161 1 8 2000 1 PP F 21.0 17.0 17000.0
30161 30162 1 8 2000 1 DM M 36.0 50.0 50000.0
30162 30163 1 8 2000 1 PP M 20.0 16.0 16000.0
35540 35541 12 31 2002 15 PB F 24.0 31.0 31000.0
35541 35542 12 31 2002 15 PB F 26.0 29.0 29000.0
35542 35543 12 31 2002 15 PB F 27.0 34.0 34000.0
35546 35547 12 31 2002 10 RM F 15.0 14.0 14000.0
35547 35548 12 31 2002 7 DO M 36.0 51.0 51000.0

4947 rows × 10 columns

As with rename() above, each filtering operation returns a copy of the dataframe. We will look at how to make changes to the original dataframe at the end of this lesson.

Building more complex queries


We can combine conditionals using what are called bitwise operators:

  • &: True if conditions on both sides of the operator are True (and)
  • |: True if a condition on either side is True (or)

To return all observations of DM in or after 2000, we can combine the two conditionals we used previously. Note that, when joining conditionals using & or |, we must wrap each individual condition in parentheses. If we omit the parentheses, pandas will not perform the comparisons in the expected order.

PYTHON

surveys_nona[(surveys_nona["species_id"] == "DM") & (surveys_nona["year"] >= 2000)]
record_id month day year plot_id species_id sex hindfoot_length_mm weight_g weight_mg
30161 30162 1 8 2000 1 DM M 36.0 50.0 50000.0
30178 30179 1 8 2000 12 DM M 36.0 60.0 60000.0
30195 30196 1 8 2000 17 DM M 37.0 52.0 52000.0
30196 30197 1 8 2000 17 DM F 34.0 43.0 43000.0
30209 30210 1 8 2000 22 DM M 38.0 56.0 56000.0
35532 35533 12 31 2002 14 DM F 36.0 48.0 48000.0
35533 35534 12 31 2002 14 DM M 37.0 56.0 56000.0
35534 35535 12 31 2002 14 DM M 37.0 53.0 53000.0
35535 35536 12 31 2002 14 DM F 35.0 42.0 42000.0
35536 35537 12 31 2002 14 DM F 36.0 46.0 46000.0

823 rows × 10 columns

Some column methods can also be used for filtering. One example is isin(), which is used to match a list of values. This method can be combined with other conditionals as above. The example below returns rows from 2000 or later with either “DM”, “DO”, or “DS” in the species_id column:

PYTHON

surveys_nona[
    surveys_nona["species_id"].isin(["DM", "DO", "DS"]) & (surveys_nona["year"] >= 2000)
]
record_id month day year plot_id species_id sex hindfoot_length_mm weight_g weight_mg
30159 30160 1 8 2000 1 DO M 35.0 53.0 53000.0
30161 30162 1 8 2000 1 DM M 36.0 50.0 50000.0
30166 30167 1 8 2000 1 DO M 35.0 41.0 41000.0
30170 30171 1 8 2000 2 DO M 36.0 52.0 52000.0
30172 30173 1 8 2000 2 DO F 35.0 54.0 54000.0
35533 35534 12 31 2002 14 DM M 37.0 56.0 56000.0
35534 35535 12 31 2002 14 DM M 37.0 53.0 53000.0
35535 35536 12 31 2002 14 DM F 35.0 42.0 42000.0
35536 35537 12 31 2002 14 DM F 36.0 46.0 46000.0
35547 35548 12 31 2002 7 DO M 36.0 51.0 51000.0

1229 rows × 10 columns

Sorting data


We can sort a dataframe using the sort_values() method. For this example, we’ll work from the subset defined above. First we need to assign that subset to a variable:

PYTHON

results = surveys_nona[
    surveys_nona["species_id"].isin(["DM", "DO", "DS"]) & (surveys_nona["year"] >= 2000)
]

Now we’ll sort the results by weight_g. To do so, pass that column name as an argument (that is, inside the trailing parentheses) to the sort_values() method:

PYTHON

results.sort_values("weight_g")
record_id month day year plot_id species_id sex hindfoot_length_mm weight_g weight_mg
30614 30615 4 30 2000 12 DM M 34.0 18.0 18000.0
34790 34791 10 5 2002 17 DM M 34.0 20.0 20000.0
32650 32651 9 22 2001 17 DM F 33.0 21.0 21000.0
32782 32783 9 23 2001 14 DM M 34.0 21.0 21000.0
32968 32969 10 14 2001 9 DM M 32.0 22.0 22000.0
35159 35160 11 10 2002 13 DO F 36.0 65.0 65000.0
30289 30290 2 5 2000 1 DO F 36.0 65.0 65000.0
31942 31943 4 21 2001 1 DO F 37.0 68.0 68000.0
32041 32042 5 26 2001 1 DO F 37.0 68.0 68000.0
31955 31956 4 21 2001 24 DO F 36.0 76.0 76000.0

1229 rows × 10 columns

By default, rows are sorted in ascending order (smallest to largest). We can modify this behavior using the ascending keyword argument:

PYTHON

results.sort_values("weight_g", ascending=False)
record_id month day year plot_id species_id sex hindfoot_length_mm weight_g weight_mg
31955 31956 4 21 2001 24 DO F 36.0 76.0 76000.0
32041 32042 5 26 2001 1 DO F 37.0 68.0 68000.0
31942 31943 4 21 2001 1 DO F 37.0 68.0 68000.0
35159 35160 11 10 2002 13 DO F 36.0 65.0 65000.0
30289 30290 2 5 2000 1 DO F 36.0 65.0 65000.0
32957 32958 10 14 2001 9 DM F 33.0 22.0 22000.0
32650 32651 9 22 2001 17 DM F 33.0 21.0 21000.0
32782 32783 9 23 2001 14 DM M 34.0 21.0 21000.0
34790 34791 10 5 2002 17 DM M 34.0 20.0 20000.0
30614 30615 4 30 2000 12 DM M 34.0 18.0 18000.0

1229 rows × 10 columns

We can sort on multiple fields at once by passing a list of column names. We can control how each column sorts by passing a list with the same number of values (that is, one value per column) to the ascending keyword. The cell below sorts the results first by species_id (largest to smallest), then by weight (smallest to largest):

PYTHON

results.sort_values(["species_id", "weight_g"], ascending=[False, True])
record_id month day year plot_id species_id sex hindfoot_length_mm weight_g weight_mg
30687 30688 5 1 2000 8 DO M 35.0 23.0 23000.0
30768 30769 6 3 2000 17 DO M 35.0 24.0 24000.0
33339 33340 1 12 2002 12 DO M 34.0 24.0 24000.0
34082 34083 5 16 2002 6 DO F 33.0 24.0 24000.0
34096 34097 5 16 2002 6 DO M 35.0 24.0 24000.0
31858 31859 3 24 2001 2 DM F 36.0 60.0 60000.0
32909 32910 10 14 2001 4 DM F 37.0 60.0 60000.0
34604 34605 7 14 2002 9 DM F 36.0 60.0 60000.0
34675 34676 9 8 2002 12 DM F 36.0 60.0 60000.0
34126 34127 5 16 2002 9 DM F 35.0 64.0 64000.0

1229 rows × 10 columns

As with the dataframe methods above, sort_values() returns a copy of the original dataframe and leaves the original untouched.

Challenge

Write a query that returns year, species_id, and weight in kg from the surveys_nona table, sorted with the largest weights at the top.

PYTHON

# Create a new column with weight in kg
surveys_nona["weight_kg"] = surveys_nona["weight_g"] / 1000

# Create a subset containing only year, species_id, and weight_kg
subset = surveys_nona[["year", "species_id", "weight_kg"]]

# Sort the subset by weight_kg
subset.sort_values("weight_kg", ascending=False)
year species_id weight_kg
33048 2001 NL 0.280
12870 1987 NL 0.278
15458 1989 NL 0.275
2132 1979 NL 0.274
12728 1987 NL 0.270
29905 1999 PP 0.004
9908 1985 RM 0.004
10438 1985 RM 0.004
8735 1983 RM 0.004
7083 1982 PF 0.004

30676 rows × 3 columns

Modifying data


We’ve already shown how to modify an existing a dataframe by adding a new column. What if we want to modify existing cells instead? As we’ve seen, this can be a little tricky in pandas because most of its methods return a copy of the original dataframe. For example, we can get subsets of a dataframe using square brackets. The cell below returns the species_id column for rows 2 through 5:

PYTHON

surveys[2:6]["species_id"]

OUTPUT

2    DM
3    DM
4    DM
5    PF
Name: species_id, dtype: object

But trying to set new values using this syntax may not work as expected. When working with the full dataframe, we can use

Say we want to set the species_id column to a new value, “FD”. Try running the code in the cell below:

PYTHON

surveys[2:6]["species_id"] = "FD"

OUTPUT

C:\...\1234567890.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  surveys[2:6]["species_id"] = "FD"

You should have received a SettingWithCopyWarning warning after running that cell. This warning tells us that the data in the original dataframe has not been modified. This is because the square brackets returned a copy, meaning that any changes will be reflected in the copy, not the original. We can verify that the original dataframe has not been changed by displaying the rows that would have been modified:

PYTHON

surveys[2:6]
record_id month day year plot_id species_id sex hindfoot_length weight
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
5 6 7 16 1977 1 PF M 14.0 NaN

Using loc to modify existing cells

One way to modify existing data in pandas is to use the loc attribute. This attribute allows you to extract and modify cells in a DataFrame using the following syntax: df.loc[row_indexer, col_indexer].

The row_indexer argument is used to select one or more rows. It can be: - A row label (i.e., the bold column on the far left) - 0 returns the row with label 0

  • A slice including multiple rows:
    • : returns all rows
    • :2 returns all rows from the beginning of the dataframe to the row labeled 2, inclusive
    • 2: returns all rows from the row labeled 2 to the end of the dataframe, inclusive
    • 2:5 returns all rows between those labeled 2 and 5, inclusive
  • A conditional, as in the examples above.

The col_indexer argument is used to select one or more columns. It will typically be a list of column names and can be omitted, in which case all columns will be returned.

Row labels and indexes

The row labels in this dataframe happen to be numeric and aligned exactly to the row’s index (so for the first row both the index and the label are 0, for the second row both the index and the label are 1, etc.) This is often but not always true in pandas. For example, if we used record_id as the label, the row labels would be one-based and the row indexes would be zero-based.

loc slicing behavior

Slices using loc are inclusive–rows matching both the start and end values are included in the returned slice. This differs from list slices, where the start but not end value is included in the slice. loc works this way because it is looking at the row label, not the row index.

We’ll be making some changes to our data, so let’s work from a copy instead of modifying the original. Create a copy using the copy() method:

PYTHON

surveys_copy = surveys.copy()

To select a subset of rows and columns using loc, use:

PYTHON

surveys_copy.loc[2:5, "species_id"]

OUTPUT

2    DM
3    DM
4    DM
5    PF
Name: species_id, dtype: object

Unlike the methods earlier in the lesson, this is a view, not a copy, of the data in the surveys_copy dataframe. That means that the object returned by loc is live and can be used to change the original dataframe. We can now assign a new value to the species_id column in the matching rows of the original dataframe:

PYTHON

surveys_copy.loc[2:5, "species_id"] = "FD"

We can see that these changes are reflected in the original surveys_copy object:

PYTHON

surveys_copy.loc[1:6, "species_id"]

OUTPUT

1    NL
2    FD
3    FD
4    FD
5    FD
6    PE
Name: species_id, dtype: object

Slicing with iloc

pandas provides another indexer, iloc, that allows us to select and modify data using row and column indexes instead of labels. Learn more in the pandas documentation.

Key Points

  • Use square brackets to access rows, columns, and specific cells
  • Use operators like +, -, and / to perform arithmetic on rows and columns
  • Store the results of calculations in a dataframe by adding a new column or overwriting an existing column
  • Sort data, rename columns, and get unique values in a dataframe using methods provided by pandas
  • By default, most dataframe operations return a copy of the original data