Combining Dataframes
Last updated on 2022-07-18 | Edit this page
Overview
Questions
- How do we combine data from multiple sources using pandas?
- How do we add data to an existing dataframe?
- How do we split and combine data columns?
Objectives
- Use
pd.merge()
to add species info to the survey dataset - Use
pd.concat()
to add additional rows the dataset - Use string methods to combine, split, and modify text columns using the
str
accessor
Dataframes can be used to organize and group data by common characteristics. Often, we need to combine elements from separate dataframes into one for analysis and visualization. A merge (or join) allows use to combine two dataframes using values common to each. Likewise, we may need to append data collected under different circumstances. In this chapter, we will show how to merge, concatenate, and split data using pandas.
Merging dataframes
The survey dataframe we’ve been using throughout this lesson has a column called species_id. We used this column in the previous lesson to calculate summary statistics about observations of each species. But the species_id is just a two-letter code—what does each code stand for? To find out, we’ll now load both the survey dataset and a second dataset containing more detailed information about the various species observed. Read the second dataframe from a file called species.csv:
PYTHON
import pandas as pd
= pd.read_csv("data/surveys.csv")
surveys = pd.read_csv("data/species.csv")
species
species
species_id | genus | species | taxa | |
---|---|---|---|---|
0 | AB | Amphispiza | bilineata | Bird |
1 | AH | Ammospermophilus | harrisi | Rodent |
2 | AS | Ammodramus | savannarum | Bird |
3 | BA | Baiomys | taylori | Rodent |
4 | CB | Campylorhynchus | brunneicapillus | Bird |
5 | CM | Calamospiza | melanocorys | Bird |
6 | CQ | Callipepla | squamata | Bird |
7 | CS | Crotalus | scutalatus | Reptile |
8 | CT | Cnemidophorus | tigris | Reptile |
9 | CU | Cnemidophorus | uniparens | Reptile |
10 | CV | Crotalus | viridis | Reptile |
11 | DM | Dipodomys | merriami | Rodent |
12 | DO | Dipodomys | ordii | Rodent |
13 | DS | Dipodomys | spectabilis | Rodent |
14 | DX | Dipodomys | sp. | Rodent |
15 | EO | Eumeces | obsoletus | Reptile |
16 | GS | Gambelia | silus | Reptile |
17 | NL | Neotoma | albigula | Rodent |
18 | NX | Neotoma | sp. | Rodent |
19 | OL | Onychomys | leucogaster | Rodent |
20 | OT | Onychomys | torridus | Rodent |
21 | OX | Onychomys | sp. | Rodent |
22 | PB | Chaetodipus | baileyi | Rodent |
23 | PC | Pipilo | chlorurus | Bird |
24 | PE | Peromyscus | eremicus | Rodent |
25 | PF | Perognathus | flavus | Rodent |
26 | PG | Pooecetes | gramineus | Bird |
27 | PH | Perognathus | hispidus | Rodent |
28 | PI | Chaetodipus | intermedius | Rodent |
29 | PL | Peromyscus | leucopus | Rodent |
30 | PM | Peromyscus | maniculatus | Rodent |
31 | PP | Chaetodipus | penicillatus | Rodent |
32 | PU | Pipilo | fuscus | Bird |
33 | PX | Chaetodipus | sp. | Rodent |
34 | RF | Reithrodontomys | fulvescens | Rodent |
35 | RM | Reithrodontomys | megalotis | Rodent |
36 | RO | Reithrodontomys | montanus | Rodent |
37 | RX | Reithrodontomys | sp. | Rodent |
38 | SA | Sylvilagus | audubonii | Rabbit |
39 | SB | Spizella | breweri | Bird |
40 | SC | Sceloporus | clarki | Reptile |
41 | SF | Sigmodon | fulviventer | Rodent |
42 | SH | Sigmodon | hispidus | Rodent |
43 | SO | Sigmodon | ochrognathus | Rodent |
44 | SS | Spermophilus | spilosoma | Rodent |
45 | ST | Spermophilus | tereticaudus | Rodent |
46 | SU | Sceloporus | undulatus | Reptile |
47 | SX | Sigmodon | sp. | Rodent |
48 | UL | Lizard | sp. | Reptile |
49 | UP | Pipilo | sp. | Bird |
50 | UR | Rodent | sp. | Rodent |
51 | US | Sparrow | sp. | Bird |
52 | ZL | Zonotrichia | leucophrys | Bird |
53 | ZM | Zenaida | macroura | Bird |
We can see that the species dataframe includes a genus, species, and taxon for each species_id. This is much more useful than the species_id included in the original dataframe–how can we add that data to our surveys dataframe? Adding it by hand would be tedious and error prone. Fortunately, pandas provides the pd.merge()
function to join two dataframes.
Managing repetitive data
Why store species data in a separate table in the first place? Species information is repetitive: Every observation of the same species has the same genus, species, and taxa. Storing it in the original survey table would require including that data in every record, increasing the complexity of the table and creating the possibility of errors. Storing that data in a separate table means we only have to enter and validate it once. A tool like pandas then allows us to access that data when we need it.
To merge the surveys and species dataframes, use:
PYTHON
= pd.merge(surveys, species)
merged merged
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN | Neotoma | albigula | Rodent |
2 | 22 | 7 | 17 | 1977 | 15 | NL | F | 31.0 | NaN | Neotoma | albigula | Rodent |
3 | 38 | 7 | 17 | 1977 | 17 | NL | M | 33.0 | NaN | Neotoma | albigula | Rodent |
4 | 72 | 8 | 19 | 1977 | 2 | NL | M | 31.0 | NaN | Neotoma | albigula | Rodent |
… | … | … | … | … | … | … | … | … | … | … | … | … |
34781 | 28988 | 12 | 23 | 1998 | 6 | CT | NaN | NaN | NaN | Cnemidophorus | tigris | Reptile |
34782 | 35512 | 12 | 31 | 2002 | 11 | US | NaN | NaN | NaN | Sparrow | sp. | Bird |
34783 | 35513 | 12 | 31 | 2002 | 11 | US | NaN | NaN | NaN | Sparrow | sp. | Bird |
34784 | 35528 | 12 | 31 | 2002 | 13 | US | NaN | NaN | NaN | Sparrow | sp. | Bird |
34785 | 35544 | 12 | 31 | 2002 | 15 | US | NaN | NaN | NaN | Sparrow | sp. | Bird |
34786 rows × 12 columns
Following the merge, the genus, species, and taxa columns have all been added to the survey dataframe. We can now use those columns to filter and summarize our data.
Joins
The pd.merge()
method is equivalent to the JOIN operation in SQL
Challenge
Filter the merged dataframe to show the genus, the species name, and the weight for every individual captured at the site
PYTHON
"genus", "species", "weight"]] merged[[
genus | species | weight | |
---|---|---|---|
0 | Neotoma | albigula | NaN |
1 | Neotoma | albigula | NaN |
2 | Neotoma | albigula | NaN |
3 | Neotoma | albigula | NaN |
4 | Neotoma | albigula | NaN |
… | … | … | … |
34781 | Cnemidophorus | tigris | NaN |
34782 | Sparrow | sp. | NaN |
34783 | Sparrow | sp. | NaN |
34784 | Sparrow | sp. | NaN |
34785 | Sparrow | sp. | NaN |
34786 rows × 3 columns
In the example above, we didn’t provide any information about how we wanted to merge the dataframes together, so pandas used its default arguments to make an educated guess. It looked at the columns in each of the dataframes, then merged them based on the columns that appear in both. Here, the only shared name is species_id, so that’s the column pandas used to merge. For more complex tables, we may want to specify the columns are used for merging. We can do so by passing one or more column names using the on keyword argument:
PYTHON
="species_id") pd.merge(surveys, species, on
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN | Neotoma | albigula | Rodent |
2 | 22 | 7 | 17 | 1977 | 15 | NL | F | 31.0 | NaN | Neotoma | albigula | Rodent |
3 | 38 | 7 | 17 | 1977 | 17 | NL | M | 33.0 | NaN | Neotoma | albigula | Rodent |
4 | 72 | 8 | 19 | 1977 | 2 | NL | M | 31.0 | NaN | Neotoma | albigula | Rodent |
… | … | … | … | … | … | … | … | … | … | … | … | … |
34781 | 28988 | 12 | 23 | 1998 | 6 | CT | NaN | NaN | NaN | Cnemidophorus | tigris | Reptile |
34782 | 35512 | 12 | 31 | 2002 | 11 | US | NaN | NaN | NaN | Sparrow | sp. | Bird |
34783 | 35513 | 12 | 31 | 2002 | 11 | US | NaN | NaN | NaN | Sparrow | sp. | Bird |
34784 | 35528 | 12 | 31 | 2002 | 13 | US | NaN | NaN | NaN | Sparrow | sp. | Bird |
34785 | 35544 | 12 | 31 | 2002 | 15 | US | NaN | NaN | NaN | Sparrow | sp. | Bird |
34786 rows × 12 columns
Challenge
Compare the number of rows in the original and merged survey dataframes. How do they differ? Why do you think that might be?
Hint: Use pd.unique()
method to look at the species_id column in each dataframe.
PYTHON
"species_id"].sort_values()) pd.unique(surveys[
OUTPUT
array(['AB', 'AH', 'AS', 'BA', 'CB', 'CM', 'CQ', 'CS', 'CT', 'CU', 'CV',
'DM', 'DO', 'DS', 'DX', 'NL', 'OL', 'OT', 'OX', 'PB', 'PC', 'PE',
'PF', 'PG', 'PH', 'PI', 'PL', 'PM', 'PP', 'PU', 'PX', 'RF', 'RM',
'RO', 'RX', 'SA', 'SC', 'SF', 'SH', 'SO', 'SS', 'ST', 'SU', 'UL',
'UP', 'UR', 'US', 'ZL', nan], dtype=object)
PYTHON
"species_id"].sort_values()) pd.unique(species[
OUTPUT
array(['AB', 'AH', 'AS', 'BA', 'CB', 'CM', 'CQ', 'CS', 'CT', 'CU', 'CV',
'DM', 'DO', 'DS', 'DX', 'EO', 'GS', 'NL', 'NX', 'OL', 'OT', 'OX',
'PB', 'PC', 'PE', 'PF', 'PG', 'PH', 'PI', 'PL', 'PM', 'PP', 'PU',
'PX', 'RF', 'RM', 'RO', 'RX', 'SA', 'SB', 'SC', 'SF', 'SH', 'SO',
'SS', 'ST', 'SU', 'SX', 'UL', 'UP', 'UR', 'US', 'ZL', 'ZM'],
dtype=object)
Some records in the surveys dataframe do not specify a species. By default, only records with a value that occurs in both the surveys and species dataframes appear in the merged dataframe, so rows without a species_id are excluded.
The built-in set
type can be used to quickly assess differences like this:
PYTHON
set(surveys["species_id"]) - set(species["species_id"])
OUTPUT
{nan}
In practice, the values in the columns used to join two dataframes may not align exactly. Above, the surveys dataframe contains a few hundred rows where species_id is NaN. These are the rows that were dropped when the dataframes were merged.
By default, pd.merge()
performs an inner join. This means that a row will only appear if the value in the shared column appears in both of the datasets being merged. In this case, that means that survey observations that don’t have a species_id or have a species_id that does not appear in the species dataframe will be dropped.
This is not always desirable behavior. Fortunately, pandas supports additional types of merges:
- Inner: Include all rows with common values in the join column. This is the default behavior.
- Left: Include all rows from the left dataframe. Columns from the right dataframe are populated if a common value exists and set to NaN if not.
- Right: Include all rows from the right dataframe. Columns from the left dataframe are populated if a common value exists and set to NaN if not.
- Outer: Includes all rows from both dataframes
We want to keep all of our observations, so let’s do a left join instead. To specify the type of merge, we use the how keyword argument:
PYTHON
="left") pd.merge(surveys, species, how
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN | Neotoma | albigula | Rodent |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN | Dipodomys | merriami | Rodent |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN | Dipodomys | merriami | Rodent |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN | Dipodomys | merriami | Rodent |
… | … | … | … | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | Ammospermophilus | harrisi | Rodent |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | Ammospermophilus | harrisi | Rodent |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 | Reithrodontomys | megalotis | Rodent |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 | Dipodomys | ordii | Rodent |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
35549 rows × 12 columns
Now all 35,549 rows appear in the merged dataframe.
Appending rows to a dataframe
Merges address the case where information about the same set of observations is spread across multiple files. What about when the observations themselves are split into more than one file? For a survey like the one we’ve been looking at in this lesson, we might get a new file once a year with the same columns but a completely new set of observations. How can we add those new observations to our dataframe?
We’ll simulate this operation by splitting data from two different years, 2001 and 2002, into separate dataframes. We can do this using conditionals, as we saw in lesson 3:
PYTHON
= surveys[surveys["year"] == 2001].copy()
surveys_2001 surveys_2001
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
31710 | 31711 | 1 | 21 | 2001 | 1 | PB | F | 26.0 | 25.0 |
31711 | 31712 | 1 | 21 | 2001 | 1 | DM | M | 37.0 | 43.0 |
31712 | 31713 | 1 | 21 | 2001 | 1 | PB | M | 29.0 | 44.0 |
31713 | 31714 | 1 | 21 | 2001 | 1 | DO | M | 34.0 | 53.0 |
31714 | 31715 | 1 | 21 | 2001 | 2 | OT | M | 20.0 | 27.0 |
… | … | … | … | … | … | … | … | … | … |
33315 | 33316 | 12 | 16 | 2001 | 11 | NaN | NaN | NaN | NaN |
33316 | 33317 | 12 | 16 | 2001 | 13 | NaN | NaN | NaN | NaN |
33317 | 33318 | 12 | 16 | 2001 | 14 | NaN | NaN | NaN | NaN |
33318 | 33319 | 12 | 16 | 2001 | 15 | NaN | NaN | NaN | NaN |
33319 | 33320 | 12 | 16 | 2001 | 16 | NaN | NaN | NaN | NaN |
1610 rows × 9 columns
PYTHON
= surveys[surveys["year"] == 2002].copy()
surveys_2002 surveys_2002
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
33320 | 33321 | 1 | 12 | 2002 | 1 | DM | M | 38.0 | 44.0 |
33321 | 33322 | 1 | 12 | 2002 | 1 | DO | M | 37.0 | 58.0 |
33322 | 33323 | 1 | 12 | 2002 | 1 | PB | M | 28.0 | 45.0 |
33323 | 33324 | 1 | 12 | 2002 | 1 | AB | NaN | NaN | NaN |
33324 | 33325 | 1 | 12 | 2002 | 1 | DO | M | 35.0 | 29.0 |
… | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
2229 rows × 9 columns
We now have two different dataframes with the same columns but different data, one with 1,610 rows, the other with 2,229 rows. We can combine them into a new dataframe using pd.concat()
, which stacks the dataframes vertically (that is, it appends records from the 2002 dataset to the 2001 dataset). This method accepts a list and will concatenate each item moving from left to right. We’re only combining two dataframes here but could do more if needed.
PYTHON
= pd.concat([surveys_2001, surveys_2002])
surveys_2001_2002 surveys_2001_2002
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
31710 | 31711 | 1 | 21 | 2001 | 1 | PB | F | 26.0 | 25.0 |
31711 | 31712 | 1 | 21 | 2001 | 1 | DM | M | 37.0 | 43.0 |
31712 | 31713 | 1 | 21 | 2001 | 1 | PB | M | 29.0 | 44.0 |
31713 | 31714 | 1 | 21 | 2001 | 1 | DO | M | 34.0 | 53.0 |
31714 | 31715 | 1 | 21 | 2001 | 2 | OT | M | 20.0 | 27.0 |
… | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
3839 rows × 9 columns
The combined dataframe includes all rows from both dataframes.
In some cases, the exact columns may change from year to year even within the same project. For example, researchers may decide to add an additional column to track a new piece of data or to provide a quality check. If a column is present in only one dataset, you can still concatenate the datasets. Any column that does not appear in a given dataset will be set to NaN for those rows in the combined dataframe.
To illustrate this, we’ll add a validated column to the 2002 survey, then re-run pd.concat()
:
PYTHON
"validated"] = True
surveys_2002[
= pd.concat([surveys_2001, surveys_2002])
surveys_2001_2002 surveys_2001_2002
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | validated | |
---|---|---|---|---|---|---|---|---|---|---|
31710 | 31711 | 1 | 21 | 2001 | 1 | PB | F | 26.0 | 25.0 | NaN |
31711 | 31712 | 1 | 21 | 2001 | 1 | DM | M | 37.0 | 43.0 | NaN |
31712 | 31713 | 1 | 21 | 2001 | 1 | PB | M | 29.0 | 44.0 | NaN |
31713 | 31714 | 1 | 21 | 2001 | 1 | DO | M | 34.0 | 53.0 | NaN |
31714 | 31715 | 1 | 21 | 2001 | 2 | OT | M | 20.0 | 27.0 | NaN |
… | … | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | True |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | True |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 | True |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 | True |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN | True |
3839 rows × 10 columns
As expected, the validated column has a value of NaN for the 2001 data in the combined dataframe.
Joining and splitting columns
Sometimes we’d like to combine values from two or more columns into a single column. For example, we might want to refer to the species in each record by both its genus and species names. In Python, we use the +
operator to concatenate (or join) strings, and pandas works the same way:
PYTHON
"genus_species"] = species["genus"] + " " + species["species"]
species["genus_species"] species[
OUTPUT
0 Amphispiza bilineata
1 Ammospermophilus harrisi
2 Ammodramus savannarum
3 Baiomys taylori
4 Campylorhynchus brunneicapillus
5 Calamospiza melanocorys
6 Callipepla squamata
7 Crotalus scutalatus
8 Cnemidophorus tigris
9 Cnemidophorus uniparens
10 Crotalus viridis
11 Dipodomys merriami
12 Dipodomys ordii
13 Dipodomys spectabilis
14 Dipodomys sp.
15 Eumeces obsoletus
16 Gambelia silus
17 Neotoma albigula
18 Neotoma sp.
19 Onychomys leucogaster
20 Onychomys torridus
21 Onychomys sp.
22 Chaetodipus baileyi
23 Pipilo chlorurus
24 Peromyscus eremicus
25 Perognathus flavus
26 Pooecetes gramineus
27 Perognathus hispidus
28 Chaetodipus intermedius
29 Peromyscus leucopus
30 Peromyscus maniculatus
31 Chaetodipus penicillatus
32 Pipilo fuscus
33 Chaetodipus sp.
34 Reithrodontomys fulvescens
35 Reithrodontomys megalotis
36 Reithrodontomys montanus
37 Reithrodontomys sp.
38 Sylvilagus audubonii
39 Spizella breweri
40 Sceloporus clarki
41 Sigmodon fulviventer
42 Sigmodon hispidus
43 Sigmodon ochrognathus
44 Spermophilus spilosoma
45 Spermophilus tereticaudus
46 Sceloporus undulatus
47 Sigmodon sp.
48 Lizard sp.
49 Pipilo sp.
50 Rodent sp.
51 Sparrow sp.
52 Zonotrichia leucophrys
53 Zenaida macroura
Name: genus_species, dtype: object
Note that the +
operator is also used to add numeric columns. In Python, the same operator can be used to perform different operations for different data types (but keep reading for an important caveat.)
Another common need is to join or split dates. In the ecology dataset, the date is split across year, month, and day columns. However, pandas has a special data type, datetime64
, for representing dates that is useful for plotting, comparing, and resampling time series data. To make use of that functionality, we can concatenate the date columns and convert them to a datetime object. For clarity, we’ll use an unambiguous date format: YYYY-MM-DD.
We need to perform an additional step before combining the date columns. Year, month, and day are all stored as integers in our dataframe (specifically, they use the int64
data type). If we try to concatenate them as they are, we’ll receive an error. This is because the +
operator only works when each object has a similar type, that is, all objects are either the same type or can be coerced to a common type (like int
and float
, which are distinct types that generally play well together). In this case, the columns are integers and the hyphens are strings. pandas cannot determine exactly how the user wants to combine the values, so it gives an error.
To resolve the error, we can use the astype()
method to convert each column to a string before combining the columns:
PYTHON
= surveys["year"].astype(str)
year = surveys["month"].astype(str)
month = surveys["day"].astype(str)
day
"date"] = year + "-" + month + "-" + day
surveys["date"] surveys[
OUTPUT
0 1977-7-16
1 1977-7-16
2 1977-7-16
3 1977-7-16
4 1977-7-16
...
35544 2002-12-31
35545 2002-12-31
35546 2002-12-31
35547 2002-12-31
35548 2002-12-31
Name: date, Length: 35549, dtype: object
Note that some of the dates look a little funny because single-digit days and months do not include a leading zero. For example, in the first row we have 1977-7-16 instead of 1977-07-16. This is usually not a big deal but can be neatened up using the str
accessor.
In pandas, an accessor is an attribute that provides additional functionality to an object. Here, the str
accessor allows us to access many of the methods from the built-in str
data type, including zfill()
, which pads a string to a given length by adding zeroes to the start of the string:
PYTHON
= "2"
text 3) text.zfill(
OUTPUT
'002'
Using the str
accessor, we can use that method to zero-pad the data in a Series
:
PYTHON
# Pad month and day to two characters
= month.str.zfill(2)
month = day.str.zfill(2)
day
"date"] = year + "-" + month + "-" + day
surveys["date"] surveys[
OUTPUT
0 1977-07-16
1 1977-07-16
2 1977-07-16
3 1977-07-16
4 1977-07-16
...
35544 2002-12-31
35545 2002-12-31
35546 2002-12-31
35547 2002-12-31
35548 2002-12-31
Name: date, Length: 35549, dtype: object
The month and date values in date are now padded to a length of two, allowing us to create a well-formed YYYY-MM-DD date string. Other string methods, like upper()
and lower()
, can be used in the same way.
Before we convert the date column to a datetime, we’re going to use the date string to show the opposite operation: Splitting a value stored in one column into multiple columns. One way to do this in pandas is to use str.split()
, which splits each value in a series based on a delimiter, a character used as a boundary between parts of a string. Here, a hyphen is used to delimit the year, month, and date in each date. By splitting the column on a hyphen, we can extract each of those components into its own column. We also pass True
to the expand keyword argument, which makes the str.split()
method return a dataframe:
PYTHON
"date"].str.split("-", expand=True) surveys[
0 | 1 | 2 | |
---|---|---|---|
0 | 1977 | 07 | 16 |
1 | 1977 | 07 | 16 |
2 | 1977 | 07 | 16 |
3 | 1977 | 07 | 16 |
4 | 1977 | 07 | 16 |
… | … | … | … |
35544 | 2002 | 12 | 31 |
35545 | 2002 | 12 | 31 |
35546 | 2002 | 12 | 31 |
35547 | 2002 | 12 | 31 |
35548 | 2002 | 12 | 31 |
35549 rows × 3 columns
Now let’s go ahead and convert our date column into a datetime object using pd.to_datetime()
:
PYTHON
"date"] = pd.to_datetime(surveys["date"])
surveys["date"] surveys[
OUTPUT
0 1977-07-16
1 1977-07-16
2 1977-07-16
3 1977-07-16
4 1977-07-16
...
35544 2002-12-31
35545 2002-12-31
35546 2002-12-31
35547 2002-12-31
35548 2002-12-31
Name: date, Length: 35549, dtype: datetime64[ns]
Challenge
pandas can help us ask specific questions which we want to answer about our data. The real skill is to know how to translate our scientific questions into a sensible approach (and subsequently visualize and interpret our results).
Try using pandas to answer the following questions.
- How many specimens of each sex are there for each year, including those whose sex is unknown?
- What is the average weight of each taxa?
- What are the minimum, maximum and average weight for each species of Rodent?
- What is the average hindfoot length for male and female rodent of each species? Is there a male/female difference?
- What is the average weight of each rodent species over the course of the years? Is there any noticeable trend for any of the species?
How many specimens of each sex are there for each year, including those whose sex is unknown?
PYTHON
# Fill in NaN values in sex with U
"sex"] = surveys["sex"].fillna("U")
surveys[
# Count records by sex
= surveys.groupby(["year", "sex"])["record_id"].count()
result
# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
print(result)
OUTPUT
year sex
1977 F 204
M 214
U 85
1978 F 503
M 433
U 112
1979 F 327
M 324
U 68
1980 F 605
M 727
U 83
1981 F 631
M 745
U 96
1982 F 823
M 1027
U 128
1983 F 771
M 797
U 105
1984 F 445
M 443
U 93
1985 F 636
M 716
U 86
1986 F 414
M 455
U 73
1987 F 677
M 862
U 132
1988 F 604
M 737
U 128
1989 F 678
M 780
U 111
1990 F 581
M 636
U 94
1991 F 606
M 637
U 104
1992 F 443
M 477
U 118
1993 F 285
M 380
U 85
1994 F 243
M 327
U 98
1995 F 574
M 534
U 114
1996 F 725
M 910
U 71
1997 F 1071
M 1357
U 65
1998 F 717
M 798
U 95
1999 F 545
M 530
U 60
2000 F 690
M 779
U 83
2001 F 743
M 744
U 123
2002 F 1149
M 979
U 101
Name: record_id, dtype: int64
What is the average weight of each taxa?
PYTHON
# Create the merged dataframe
= pd.merge(surveys, species, how="left")
merged
# Group by taxa
= merged.groupby("taxa")
grouped
# Calculate the min, max, and mean weight
"weight"].mean() grouped[
OUTPUT
taxa
Bird NaN
Rabbit NaN
Reptile NaN
Rodent 42.672428
Name: weight, dtype: float64
What are the minimum, maximum and average weight for each species of Rodent?
PYTHON
# Create the merged dataframe
= pd.merge(surveys, species, how="left")
merged
# Limit merged dataframe to rodents
= merged[merged["taxa"] == "Rodent"]
rodents
# Group rodents by species
= rodents.groupby("species_id")
grouped
# Calculate the min, max, and mean weight
"weight": ["min", "max", "mean"]}) grouped.agg({
weight | |||
---|---|---|---|
min | max | mean | |
species_id | |||
AH | NaN | NaN | NaN |
BA | 6.0 | 18.0 | 8.600000 |
DM | 10.0 | 66.0 | 43.157864 |
DO | 12.0 | 76.0 | 48.870523 |
DS | 12.0 | 190.0 | 120.130546 |
DX | NaN | NaN | NaN |
NL | 30.0 | 280.0 | 159.245660 |
OL | 10.0 | 56.0 | 31.575258 |
OT | 5.0 | 46.0 | 24.230556 |
OX | 18.0 | 24.0 | 21.000000 |
PB | 12.0 | 55.0 | 31.735943 |
PE | 8.0 | 40.0 | 21.586508 |
PF | 4.0 | 25.0 | 7.923127 |
PH | 18.0 | 48.0 | 31.064516 |
PI | 17.0 | 21.0 | 19.250000 |
PL | 8.0 | 27.0 | 19.138889 |
PM | 7.0 | 49.0 | 21.364155 |
PP | 4.0 | 74.0 | 17.173942 |
PX | 18.0 | 20.0 | 19.000000 |
RF | 9.0 | 20.0 | 13.386667 |
RM | 4.0 | 29.0 | 10.585010 |
RO | 8.0 | 13.0 | 10.250000 |
RX | 11.0 | 20.0 | 15.500000 |
SF | 24.0 | 199.0 | 58.878049 |
SH | 16.0 | 140.0 | 73.148936 |
SO | 15.0 | 105.0 | 55.414634 |
SS | 57.0 | 130.0 | 93.500000 |
ST | NaN | NaN | NaN |
UR | NaN | NaN | NaN |
What is the average hindfoot length for male and female rodent of each species? Is there a male/female difference?
PYTHON
# Create the merged dataframe
= pd.merge(surveys, species, how="left")
merged
# Limit merged dataframe to rodents
= merged[merged["taxa"] == "Rodent"]
rodents
# Group rodents by species and sex
= rodents.groupby(["species_id", "sex"])
grouped
# Calculate the mean hindfoot length, plus count and standard deviation
# to better assess the question
with pd.option_context("display.max_rows", None):
print(grouped["hindfoot_length"].agg(["count", "mean", "std"]))
OUTPUT
count mean std
species_id sex
AH M 0 NaN NaN
U 2 33.000000 2.828427
BA F 31 13.161290 1.529636
M 14 12.642857 2.097880
U 0 NaN NaN
DM F 4302 35.712692 1.433067
M 5658 36.188229 1.455396
U 12 35.583333 0.996205
DO F 1244 35.486334 1.726414
M 1640 35.698780 1.611656
U 3 36.000000 0.000000
DS F 1046 49.583174 1.973573
M 1083 50.301939 2.128120
U 3 50.000000 1.732051
DX U 0 NaN NaN
NL F 620 32.024194 1.669181
M 452 32.577434 1.903603
U 2 52.000000 25.455844
OL F 438 20.287671 1.101022
M 480 20.758333 1.651195
U 2 20.000000 0.000000
OT F 1019 20.281649 1.519412
M 1115 20.257399 1.087354
U 5 19.600000 1.140175
OX F 3 19.666667 1.154701
M 5 18.800000 3.346640
U 0 NaN NaN
PB F 1645 25.882675 0.928842
M 1213 26.431987 1.444130
U 6 26.166667 0.983192
PE F 544 20.272059 1.156084
M 666 20.132132 1.177721
U 2 20.500000 0.707107
PF F 721 15.550624 1.381627
M 770 15.620779 1.142208
U 2 13.000000 4.242641
PH F 20 25.900000 1.293709
M 11 25.545455 1.752920
PI M 8 22.500000 0.534522
U 1 20.000000 NaN
PL F 16 19.750000 2.720294
M 19 20.263158 0.933459
U 1 20.000000 NaN
PM F 361 20.279778 1.162782
M 483 20.530021 1.275621
U 3 21.333333 4.041452
PP F 1574 21.676620 1.316529
M 1444 21.838643 1.201006
U 9 20.888889 2.027588
PX F 1 19.000000 NaN
M 1 20.000000 NaN
U 0 NaN NaN
RF F 55 17.527273 0.813191
M 18 17.500000 0.985184
RM F 1133 16.391880 1.123946
M 1296 16.491512 1.134202
U 13 15.846154 2.444250
RO F 4 14.750000 0.957427
M 4 16.000000 1.154701
RX M 2 18.500000 2.121320
SF F 16 27.500000 2.683282
M 23 26.347826 3.524056
U 2 24.500000 0.707107
SH F 71 29.028169 2.298893
M 60 27.983333 2.801281
U 0 NaN NaN
SO F 30 25.633333 4.139660
M 11 25.727273 1.954017
SS F 0 NaN NaN
M 0 NaN NaN
U 0 NaN NaN
ST U 0 NaN NaN
UR U 0 NaN NaN
What is the average weight of each rodent species over the course of the years? Is there any noticeable trend for any of the species?
PYTHON
# Create the merged dataframe
= pd.merge(surveys, species, how="left")
merged
# Limit merged dataframe to rodents
= merged[merged["taxa"] == "Rodent"]
rodents
# Group rodents by species and year
= rodents.groupby(["species_id", "year"])
grouped
# Calculate the mean weight by year
= grouped["weight"].mean()
result
# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
print(result)
OUTPUT
species_id year
AH 1978 NaN
1982 NaN
1983 NaN
1984 NaN
1985 NaN
1986 NaN
1987 NaN
1988 NaN
1989 NaN
1990 NaN
1991 NaN
1992 NaN
1993 NaN
1994 NaN
1995 NaN
1996 NaN
1997 NaN
1998 NaN
1999 NaN
2000 NaN
2001 NaN
2002 NaN
BA 1989 7.000000
1990 8.000000
1991 9.240000
1992 7.833333
DM 1977 41.141304
1978 40.795455
1979 43.507317
1980 44.136082
1981 44.032907
1982 41.975042
1983 40.944551
1984 40.765306
1985 41.507645
1986 43.361596
1987 43.232609
1988 43.397790
1989 44.349206
1990 41.769912
1991 43.148338
1992 43.877966
1993 43.373984
1994 42.373288
1995 44.806147
1996 44.506173
1997 44.708551
1998 43.131403
1999 43.945402
2000 43.126638
2001 45.442177
2002 46.168317
DO 1977 42.666667
1978 45.000000
1979 45.862069
1980 48.058824
1981 49.111111
1982 47.919643
1983 47.150602
1984 48.415094
1985 47.956731
1986 49.372727
1987 50.087379
1988 51.463768
1989 51.025641
1990 48.512048
1991 49.695652
1992 48.367816
1993 48.461538
1994 47.750000
1995 49.592593
1996 47.234940
1997 48.177866
1998 49.731183
1999 48.012048
2000 49.224719
2001 52.233766
2002 49.514403
DS 1977 121.437500
1978 115.198653
1979 111.796954
1980 120.725664
1981 125.818444
1982 115.647929
1983 122.033088
1984 124.082192
1985 124.326316
1986 128.071429
1987 126.383838
1988 129.490566
1989 121.896552
1990 121.187500
1991 113.000000
1992 112.352941
1993 105.722222
1994 106.625000
1997 111.000000
1998 116.000000
1999 120.714286
DX 1980 NaN
1983 NaN
1986 NaN
1988 NaN
1991 NaN
1992 NaN
1993 NaN
1994 NaN
1995 NaN
1996 NaN
1998 NaN
1999 NaN
2000 NaN
2001 NaN
2002 NaN
NL 1977 NaN
1978 184.656250
1979 138.000000
1980 158.982143
1981 165.903226
1982 160.613208
1983 156.691489
1984 150.095238
1985 148.829268
1986 159.724138
1987 158.840000
1988 163.104167
1989 151.278689
1990 154.275862
1991 148.785714
1992 139.000000
1993 127.391304
1994 186.875000
1995 155.833333
1996 148.000000
1997 150.688889
1998 159.466667
1999 182.200000
2000 179.307692
2001 167.851064
2002 182.159091
OL 1977 21.666667
1978 31.027027
1979 33.717391
1980 33.107143
1981 33.296296
1982 35.590361
1983 34.584615
1984 32.550725
1985 32.108696
1986 30.880952
1987 30.779221
1988 30.306122
1989 31.947368
1990 31.200000
1991 28.171429
1992 28.454545
1993 27.545455
1994 21.900000
1995 27.296296
1996 27.769231
1997 33.625000
2002 25.428571
OT 1977 21.000000
1978 23.220000
1979 23.075758
1980 24.083333
1981 24.415385
1982 24.925532
1983 24.697674
1984 22.416667
1985 22.064516
1986 21.500000
1987 23.069767
1988 24.070588
1989 24.663366
1990 23.675325
1991 24.588235
1992 24.928571
1993 22.363636
1994 25.285714
1995 24.868421
1996 23.453704
1997 24.785156
1998 24.675676
1999 25.723810
2000 25.303448
2001 23.707792
2002 23.833333
OX 1977 21.333333
1978 NaN
1982 24.000000
1984 18.000000
1989 20.000000
1993 NaN
PB 1995 34.000000
1996 32.578947
1997 31.085603
1998 30.082237
1999 31.710037
2000 30.878899
2001 32.798851
2002 32.359447
PE 1977 19.500000
1978 20.428571
1979 20.529412
1980 22.450000
1981 20.558140
1982 21.173077
1983 20.864865
1984 20.210526
1985 20.360000
1986 22.520000
1987 21.625899
1988 22.625668
1989 21.935673
1990 21.671233
1991 21.435484
1992 22.750000
1993 19.230769
1994 18.000000
1995 21.230769
1996 22.027778
1997 20.929825
1998 20.304348
1999 25.428571
2000 21.615385
2001 20.400000
2002 21.719298
PF 1977 7.173913
1978 7.088235
1979 7.529412
1980 7.455556
1981 7.152542
1982 6.918919
1983 6.833333
1984 7.428571
1985 NaN
1986 8.000000
1987 7.500000
1988 8.000000
1989 7.230769
1990 7.117647
1991 7.827586
1992 8.767857
1993 8.238806
1994 7.855072
1995 8.780645
1996 8.393846
1997 8.448649
1998 8.720000
2001 8.538462
2002 7.388889
PH 1984 28.000000
1985 32.666667
1986 39.000000
1988 33.000000
1989 30.500000
1990 31.142857
1992 31.142857
1995 35.333333
1997 22.000000
PI 1992 18.000000
1995 NaN
2001 18.000000
2002 20.000000
PL 1995 22.333333
1996 21.166667
1997 18.789474
1998 16.714286
2000 21.000000
PM 1978 20.666667
1979 23.666667
1980 21.000000
1981 19.933333
1982 21.391304
1983 22.023810
1984 19.545455
1985 19.000000
1986 21.696970
1987 22.297710
1988 21.759259
1989 20.222222
1995 27.375000
1996 20.224490
1997 21.126394
1998 20.591398
1999 22.523810
2000 20.500000
2001 26.666667
2002 19.000000
PP 1977 15.333333
1978 14.869565
1979 15.105263
1980 14.176471
1981 13.950000
1982 16.125000
1983 15.468750
1984 15.307692
1985 15.764706
1986 16.750000
1987 17.840909
1988 18.280000
1989 17.409091
1990 16.029412
1991 17.904762
1992 17.479675
1993 17.954023
1994 17.585714
1995 16.844444
1996 18.095563
1997 18.175000
1998 16.266990
1999 16.521212
2000 16.788618
2001 17.752896
2002 17.018617
PX 1996 NaN
1997 19.000000
1998 NaN
2000 NaN
RF 1982 11.500000
1988 13.818182
1989 13.346939
1990 12.916667
1997 20.000000
RM 1977 10.000000
1978 7.500000
1979 8.333333
1980 10.227273
1981 11.178571
1982 10.436242
1983 9.872000
1984 11.152542
1985 8.371429
1986 10.750000
1987 10.657609
1988 10.322115
1989 10.411311
1990 10.305677
1991 10.498305
1992 10.904762
1993 10.747475
1994 10.675000
1995 12.653846
1996 11.455556
1997 11.230769
1998 13.100000
1999 10.555556
2000 11.400000
2001 11.333333
2002 10.000000
RO 1991 11.000000
2002 10.142857
RX 1995 20.000000
1997 11.000000
SF 1989 54.800000
1990 52.611111
1991 93.166667
1992 43.000000
1993 44.000000
2002 62.166667
SH 1977 NaN
1978 89.000000
1982 79.000000
1983 NaN
1986 55.000000
1987 73.800000
1988 72.806452
1989 76.345455
1990 76.888889
1991 63.000000
1997 50.857143
1999 54.000000
2000 73.375000
2001 79.900000
2002 64.666667
SO 1991 53.909091
1992 55.263158
1993 55.600000
1994 62.333333
1997 54.666667
SS 1977 NaN
1978 130.000000
1979 NaN
1980 NaN
1981 57.000000
1982 NaN
1983 NaN
1984 NaN
1985 NaN
1986 NaN
1987 NaN
1988 NaN
1989 NaN
1990 NaN
1991 NaN
1992 NaN
1993 NaN
1994 NaN
1995 NaN
1996 NaN
1997 NaN
1998 NaN
1999 NaN
2000 NaN
2001 NaN
2002 NaN
ST 1993 NaN
UR 1987 NaN
1988 NaN
1991 NaN
1992 NaN
1994 NaN
2001 NaN
2002 NaN
Name: weight, dtype: float64
Key Points
- Combine two dataframes on one or more common values using
pd.merge()
- Append rows from one dataframe to another using
pd.concat()
- Combine multiple text columns into one using the
+
operator - Use the
str
accessor to use string methods likesplit()
andzfill()
on text columns - Convert date strings to datetime objects using
pd.to_datetime()