Combining Dataframes

Last updated on 2022-07-18 | Edit this page

Overview

Questions

How do we combine data from multiple sources using pandas?
How do we add data to an existing dataframe?
How do we split and combine data columns?

Objectives

Use pd.merge() to add species info to the survey dataset
Use pd.concat() to add additional rows the dataset
Use string methods to combine, split, and modify text columns using the str accessor

Dataframes can be used to organize and group data by common characteristics. Often, we need to combine elements from separate dataframes into one for analysis and visualization. A merge (or join) allows use to combine two dataframes using values common to each. Likewise, we may need to append data collected under different circumstances. In this chapter, we will show how to merge, concatenate, and split data using pandas.

Merging dataframes

The survey dataframe we’ve been using throughout this lesson has a column called species_id. We used this column in the previous lesson to calculate summary statistics about observations of each species. But the species_id is just a two-letter code—what does each code stand for? To find out, we’ll now load both the survey dataset and a second dataset containing more detailed information about the various species observed. Read the second dataframe from a file called species.csv:

PYTHON

import pandas as pd

surveys = pd.read_csv("data/surveys.csv")
species = pd.read_csv("data/species.csv")

species

	species_id	genus	species	taxa
0	AB	Amphispiza	bilineata	Bird
1	AH	Ammospermophilus	harrisi	Rodent
2	AS	Ammodramus	savannarum	Bird
3	BA	Baiomys	taylori	Rodent
4	CB	Campylorhynchus	brunneicapillus	Bird
5	CM	Calamospiza	melanocorys	Bird
6	CQ	Callipepla	squamata	Bird
7	CS	Crotalus	scutalatus	Reptile
8	CT	Cnemidophorus	tigris	Reptile
9	CU	Cnemidophorus	uniparens	Reptile
10	CV	Crotalus	viridis	Reptile
11	DM	Dipodomys	merriami	Rodent
12	DO	Dipodomys	ordii	Rodent
13	DS	Dipodomys	spectabilis	Rodent
14	DX	Dipodomys	sp.	Rodent
15	EO	Eumeces	obsoletus	Reptile
16	GS	Gambelia	silus	Reptile
17	NL	Neotoma	albigula	Rodent
18	NX	Neotoma	sp.	Rodent
19	OL	Onychomys	leucogaster	Rodent
20	OT	Onychomys	torridus	Rodent
21	OX	Onychomys	sp.	Rodent
22	PB	Chaetodipus	baileyi	Rodent
23	PC	Pipilo	chlorurus	Bird
24	PE	Peromyscus	eremicus	Rodent
25	PF	Perognathus	flavus	Rodent
26	PG	Pooecetes	gramineus	Bird
27	PH	Perognathus	hispidus	Rodent
28	PI	Chaetodipus	intermedius	Rodent
29	PL	Peromyscus	leucopus	Rodent
30	PM	Peromyscus	maniculatus	Rodent
31	PP	Chaetodipus	penicillatus	Rodent
32	PU	Pipilo	fuscus	Bird
33	PX	Chaetodipus	sp.	Rodent
34	RF	Reithrodontomys	fulvescens	Rodent
35	RM	Reithrodontomys	megalotis	Rodent
36	RO	Reithrodontomys	montanus	Rodent
37	RX	Reithrodontomys	sp.	Rodent
38	SA	Sylvilagus	audubonii	Rabbit
39	SB	Spizella	breweri	Bird
40	SC	Sceloporus	clarki	Reptile
41	SF	Sigmodon	fulviventer	Rodent
42	SH	Sigmodon	hispidus	Rodent
43	SO	Sigmodon	ochrognathus	Rodent
44	SS	Spermophilus	spilosoma	Rodent
45	ST	Spermophilus	tereticaudus	Rodent
46	SU	Sceloporus	undulatus	Reptile
47	SX	Sigmodon	sp.	Rodent
48	UL	Lizard	sp.	Reptile
49	UP	Pipilo	sp.	Bird
50	UR	Rodent	sp.	Rodent
51	US	Sparrow	sp.	Bird
52	ZL	Zonotrichia	leucophrys	Bird
53	ZM	Zenaida	macroura	Bird

We can see that the species dataframe includes a genus, species, and taxon for each species_id. This is much more useful than the species_id included in the original dataframe–how can we add that data to our surveys dataframe? Adding it by hand would be tedious and error prone. Fortunately, pandas provides the pd.merge() function to join two dataframes.

Managing repetitive data

Why store species data in a separate table in the first place? Species information is repetitive: Every observation of the same species has the same genus, species, and taxa. Storing it in the original survey table would require including that data in every record, increasing the complexity of the table and creating the possibility of errors. Storing that data in a separate table means we only have to enter and validate it once. A tool like pandas then allows us to access that data when we need it.

To merge the surveys and species dataframes, use:

PYTHON

merged = pd.merge(surveys, species)
merged

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight	genus	species	taxa
0	1	7	16	1977	2	NL	M	32.0	NaN	Neotoma	albigula	Rodent
1	2	7	16	1977	3	NL	M	33.0	NaN	Neotoma	albigula	Rodent
2	22	7	17	1977	15	NL	F	31.0	NaN	Neotoma	albigula	Rodent
3	38	7	17	1977	17	NL	M	33.0	NaN	Neotoma	albigula	Rodent
4	72	8	19	1977	2	NL	M	31.0	NaN	Neotoma	albigula	Rodent
…	…	…	…	…	…	…	…	…	…	…	…	…
34781	28988	12	23	1998	6	CT	NaN	NaN	NaN	Cnemidophorus	tigris	Reptile
34782	35512	12	31	2002	11	US	NaN	NaN	NaN	Sparrow	sp.	Bird
34783	35513	12	31	2002	11	US	NaN	NaN	NaN	Sparrow	sp.	Bird
34784	35528	12	31	2002	13	US	NaN	NaN	NaN	Sparrow	sp.	Bird
34785	35544	12	31	2002	15	US	NaN	NaN	NaN	Sparrow	sp.	Bird

34786 rows × 12 columns

Following the merge, the genus, species, and taxa columns have all been added to the survey dataframe. We can now use those columns to filter and summarize our data.

Joins

The pd.merge() method is equivalent to the JOIN operation in SQL

Challenge

Filter the merged dataframe to show the genus, the species name, and the weight for every individual captured at the site

Show me the solution

PYTHON

merged[["genus", "species", "weight"]]

	genus	species	weight
0	Neotoma	albigula	NaN
1	Neotoma	albigula	NaN
2	Neotoma	albigula	NaN
3	Neotoma	albigula	NaN
4	Neotoma	albigula	NaN
…	…	…	…
34781	Cnemidophorus	tigris	NaN
34782	Sparrow	sp.	NaN
34783	Sparrow	sp.	NaN
34784	Sparrow	sp.	NaN
34785	Sparrow	sp.	NaN

34786 rows × 3 columns

In the example above, we didn’t provide any information about how we wanted to merge the dataframes together, so pandas used its default arguments to make an educated guess. It looked at the columns in each of the dataframes, then merged them based on the columns that appear in both. Here, the only shared name is species_id, so that’s the column pandas used to merge. For more complex tables, we may want to specify the columns are used for merging. We can do so by passing one or more column names using the on keyword argument:

PYTHON

pd.merge(surveys, species, on="species_id")

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight	genus	species	taxa
0	1	7	16	1977	2	NL	M	32.0	NaN	Neotoma	albigula	Rodent
1	2	7	16	1977	3	NL	M	33.0	NaN	Neotoma	albigula	Rodent
2	22	7	17	1977	15	NL	F	31.0	NaN	Neotoma	albigula	Rodent
3	38	7	17	1977	17	NL	M	33.0	NaN	Neotoma	albigula	Rodent
4	72	8	19	1977	2	NL	M	31.0	NaN	Neotoma	albigula	Rodent
…	…	…	…	…	…	…	…	…	…	…	…	…
34781	28988	12	23	1998	6	CT	NaN	NaN	NaN	Cnemidophorus	tigris	Reptile
34782	35512	12	31	2002	11	US	NaN	NaN	NaN	Sparrow	sp.	Bird
34783	35513	12	31	2002	11	US	NaN	NaN	NaN	Sparrow	sp.	Bird
34784	35528	12	31	2002	13	US	NaN	NaN	NaN	Sparrow	sp.	Bird
34785	35544	12	31	2002	15	US	NaN	NaN	NaN	Sparrow	sp.	Bird

34786 rows × 12 columns

Challenge

Compare the number of rows in the original and merged survey dataframes. How do they differ? Why do you think that might be?

Hint: Use pd.unique() method to look at the species_id column in each dataframe.

Show me the solution

PYTHON

pd.unique(surveys["species_id"].sort_values())

OUTPUT

array(['AB', 'AH', 'AS', 'BA', 'CB', 'CM', 'CQ', 'CS', 'CT', 'CU', 'CV',
       'DM', 'DO', 'DS', 'DX', 'NL', 'OL', 'OT', 'OX', 'PB', 'PC', 'PE',
       'PF', 'PG', 'PH', 'PI', 'PL', 'PM', 'PP', 'PU', 'PX', 'RF', 'RM',
       'RO', 'RX', 'SA', 'SC', 'SF', 'SH', 'SO', 'SS', 'ST', 'SU', 'UL',
       'UP', 'UR', 'US', 'ZL', nan], dtype=object)

PYTHON

pd.unique(species["species_id"].sort_values())

OUTPUT

array(['AB', 'AH', 'AS', 'BA', 'CB', 'CM', 'CQ', 'CS', 'CT', 'CU', 'CV',
       'DM', 'DO', 'DS', 'DX', 'EO', 'GS', 'NL', 'NX', 'OL', 'OT', 'OX',
       'PB', 'PC', 'PE', 'PF', 'PG', 'PH', 'PI', 'PL', 'PM', 'PP', 'PU',
       'PX', 'RF', 'RM', 'RO', 'RX', 'SA', 'SB', 'SC', 'SF', 'SH', 'SO',
       'SS', 'ST', 'SU', 'SX', 'UL', 'UP', 'UR', 'US', 'ZL', 'ZM'],
      dtype=object)

Some records in the surveys dataframe do not specify a species. By default, only records with a value that occurs in both the surveys and species dataframes appear in the merged dataframe, so rows without a species_id are excluded.

The built-in set type can be used to quickly assess differences like this:

PYTHON

set(surveys["species_id"]) - set(species["species_id"])

OUTPUT

{nan}

In practice, the values in the columns used to join two dataframes may not align exactly. Above, the surveys dataframe contains a few hundred rows where species_id is NaN. These are the rows that were dropped when the dataframes were merged.

By default, pd.merge() performs an inner join. This means that a row will only appear if the value in the shared column appears in both of the datasets being merged. In this case, that means that survey observations that don’t have a species_id or have a species_id that does not appear in the species dataframe will be dropped.

This is not always desirable behavior. Fortunately, pandas supports additional types of merges:

Inner: Include all rows with common values in the join column. This is the default behavior.
Left: Include all rows from the left dataframe. Columns from the right dataframe are populated if a common value exists and set to NaN if not.
Right: Include all rows from the right dataframe. Columns from the left dataframe are populated if a common value exists and set to NaN if not.
Outer: Includes all rows from both dataframes

We want to keep all of our observations, so let’s do a left join instead. To specify the type of merge, we use the how keyword argument:

PYTHON

pd.merge(surveys, species, how="left")

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight	genus	species	taxa
0	1	7	16	1977	2	NL	M	32.0	NaN	Neotoma	albigula	Rodent
1	2	7	16	1977	3	NL	M	33.0	NaN	Neotoma	albigula	Rodent
2	3	7	16	1977	2	DM	F	37.0	NaN	Dipodomys	merriami	Rodent
3	4	7	16	1977	7	DM	M	36.0	NaN	Dipodomys	merriami	Rodent
4	5	7	16	1977	3	DM	M	35.0	NaN	Dipodomys	merriami	Rodent
…	…	…	…	…	…	…	…	…	…	…	…	…
35544	35545	12	31	2002	15	AH	NaN	NaN	NaN	Ammospermophilus	harrisi	Rodent
35545	35546	12	31	2002	15	AH	NaN	NaN	NaN	Ammospermophilus	harrisi	Rodent
35546	35547	12	31	2002	10	RM	F	15.0	14.0	Reithrodontomys	megalotis	Rodent
35547	35548	12	31	2002	7	DO	M	36.0	51.0	Dipodomys	ordii	Rodent
35548	35549	12	31	2002	5	NaN	NaN	NaN	NaN	NaN	NaN	NaN

35549 rows × 12 columns

Now all 35,549 rows appear in the merged dataframe.

Appending rows to a dataframe

Merges address the case where information about the same set of observations is spread across multiple files. What about when the observations themselves are split into more than one file? For a survey like the one we’ve been looking at in this lesson, we might get a new file once a year with the same columns but a completely new set of observations. How can we add those new observations to our dataframe?

We’ll simulate this operation by splitting data from two different years, 2001 and 2002, into separate dataframes. We can do this using conditionals, as we saw in lesson 3:

PYTHON

surveys_2001 = surveys[surveys["year"] == 2001].copy()
surveys_2001

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
31710	31711	1	21	2001	1	PB	F	26.0	25.0
31711	31712	1	21	2001	1	DM	M	37.0	43.0
31712	31713	1	21	2001	1	PB	M	29.0	44.0
31713	31714	1	21	2001	1	DO	M	34.0	53.0
31714	31715	1	21	2001	2	OT	M	20.0	27.0
…	…	…	…	…	…	…	…	…	…
33315	33316	12	16	2001	11	NaN	NaN	NaN	NaN
33316	33317	12	16	2001	13	NaN	NaN	NaN	NaN
33317	33318	12	16	2001	14	NaN	NaN	NaN	NaN
33318	33319	12	16	2001	15	NaN	NaN	NaN	NaN
33319	33320	12	16	2001	16	NaN	NaN	NaN	NaN

1610 rows × 9 columns

PYTHON

surveys_2002 = surveys[surveys["year"] == 2002].copy()
surveys_2002

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
33320	33321	1	12	2002	1	DM	M	38.0	44.0
33321	33322	1	12	2002	1	DO	M	37.0	58.0
33322	33323	1	12	2002	1	PB	M	28.0	45.0
33323	33324	1	12	2002	1	AB	NaN	NaN	NaN
33324	33325	1	12	2002	1	DO	M	35.0	29.0
…	…	…	…	…	…	…	…	…	…
35544	35545	12	31	2002	15	AH	NaN	NaN	NaN
35545	35546	12	31	2002	15	AH	NaN	NaN	NaN
35546	35547	12	31	2002	10	RM	F	15.0	14.0
35547	35548	12	31	2002	7	DO	M	36.0	51.0
35548	35549	12	31	2002	5	NaN	NaN	NaN	NaN

2229 rows × 9 columns

We now have two different dataframes with the same columns but different data, one with 1,610 rows, the other with 2,229 rows. We can combine them into a new dataframe using pd.concat(), which stacks the dataframes vertically (that is, it appends records from the 2002 dataset to the 2001 dataset). This method accepts a list and will concatenate each item moving from left to right. We’re only combining two dataframes here but could do more if needed.

PYTHON

surveys_2001_2002 = pd.concat([surveys_2001, surveys_2002])
surveys_2001_2002

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
31710	31711	1	21	2001	1	PB	F	26.0	25.0
31711	31712	1	21	2001	1	DM	M	37.0	43.0
31712	31713	1	21	2001	1	PB	M	29.0	44.0
31713	31714	1	21	2001	1	DO	M	34.0	53.0
31714	31715	1	21	2001	2	OT	M	20.0	27.0
…	…	…	…	…	…	…	…	…	…
35544	35545	12	31	2002	15	AH	NaN	NaN	NaN
35545	35546	12	31	2002	15	AH	NaN	NaN	NaN
35546	35547	12	31	2002	10	RM	F	15.0	14.0
35547	35548	12	31	2002	7	DO	M	36.0	51.0
35548	35549	12	31	2002	5	NaN	NaN	NaN	NaN

3839 rows × 9 columns

The combined dataframe includes all rows from both dataframes.

In some cases, the exact columns may change from year to year even within the same project. For example, researchers may decide to add an additional column to track a new piece of data or to provide a quality check. If a column is present in only one dataset, you can still concatenate the datasets. Any column that does not appear in a given dataset will be set to NaN for those rows in the combined dataframe.

To illustrate this, we’ll add a validated column to the 2002 survey, then re-run pd.concat():

PYTHON

surveys_2002["validated"] = True

surveys_2001_2002 = pd.concat([surveys_2001, surveys_2002])
surveys_2001_2002

	record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight	validated
31710	31711	1	21	2001	1	PB	F	26.0	25.0	NaN
31711	31712	1	21	2001	1	DM	M	37.0	43.0	NaN
31712	31713	1	21	2001	1	PB	M	29.0	44.0	NaN
31713	31714	1	21	2001	1	DO	M	34.0	53.0	NaN
31714	31715	1	21	2001	2	OT	M	20.0	27.0	NaN
…	…	…	…	…	…	…	…	…	…	…
35544	35545	12	31	2002	15	AH	NaN	NaN	NaN	True
35545	35546	12	31	2002	15	AH	NaN	NaN	NaN	True
35546	35547	12	31	2002	10	RM	F	15.0	14.0	True
35547	35548	12	31	2002	7	DO	M	36.0	51.0	True
35548	35549	12	31	2002	5	NaN	NaN	NaN	NaN	True

3839 rows × 10 columns

As expected, the validated column has a value of NaN for the 2001 data in the combined dataframe.

Joining and splitting columns

Sometimes we’d like to combine values from two or more columns into a single column. For example, we might want to refer to the species in each record by both its genus and species names. In Python, we use the + operator to concatenate (or join) strings, and pandas works the same way:

PYTHON

species["genus_species"] = species["genus"] + " " + species["species"]
species["genus_species"]

OUTPUT

0                Amphispiza bilineata
1            Ammospermophilus harrisi
2               Ammodramus savannarum
3                     Baiomys taylori
4     Campylorhynchus brunneicapillus
5             Calamospiza melanocorys
6                 Callipepla squamata
7                 Crotalus scutalatus
8                Cnemidophorus tigris
9             Cnemidophorus uniparens
10                   Crotalus viridis
11                 Dipodomys merriami
12                    Dipodomys ordii
13              Dipodomys spectabilis
14                      Dipodomys sp.
15                  Eumeces obsoletus
16                     Gambelia silus
17                   Neotoma albigula
18                        Neotoma sp.
19              Onychomys leucogaster
20                 Onychomys torridus
21                      Onychomys sp.
22                Chaetodipus baileyi
23                   Pipilo chlorurus
24                Peromyscus eremicus
25                 Perognathus flavus
26                Pooecetes gramineus
27               Perognathus hispidus
28            Chaetodipus intermedius
29                Peromyscus leucopus
30             Peromyscus maniculatus
31           Chaetodipus penicillatus
32                      Pipilo fuscus
33                    Chaetodipus sp.
34         Reithrodontomys fulvescens
35          Reithrodontomys megalotis
36           Reithrodontomys montanus
37                Reithrodontomys sp.
38               Sylvilagus audubonii
39                   Spizella breweri
40                  Sceloporus clarki
41               Sigmodon fulviventer
42                  Sigmodon hispidus
43              Sigmodon ochrognathus
44             Spermophilus spilosoma
45          Spermophilus tereticaudus
46               Sceloporus undulatus
47                       Sigmodon sp.
48                         Lizard sp.
49                         Pipilo sp.
50                         Rodent sp.
51                        Sparrow sp.
52             Zonotrichia leucophrys
53                   Zenaida macroura
Name: genus_species, dtype: object

Note that the + operator is also used to add numeric columns. In Python, the same operator can be used to perform different operations for different data types (but keep reading for an important caveat.)

Another common need is to join or split dates. In the ecology dataset, the date is split across year, month, and day columns. However, pandas has a special data type, datetime64, for representing dates that is useful for plotting, comparing, and resampling time series data. To make use of that functionality, we can concatenate the date columns and convert them to a datetime object. For clarity, we’ll use an unambiguous date format: YYYY-MM-DD.

We need to perform an additional step before combining the date columns. Year, month, and day are all stored as integers in our dataframe (specifically, they use the int64 data type). If we try to concatenate them as they are, we’ll receive an error. This is because the + operator only works when each object has a similar type, that is, all objects are either the same type or can be coerced to a common type (like int and float, which are distinct types that generally play well together). In this case, the columns are integers and the hyphens are strings. pandas cannot determine exactly how the user wants to combine the values, so it gives an error.

To resolve the error, we can use the astype() method to convert each column to a string before combining the columns:

PYTHON

year = surveys["year"].astype(str)
month = surveys["month"].astype(str)
day = surveys["day"].astype(str)

surveys["date"] = year + "-" + month + "-" + day
surveys["date"]

OUTPUT

0         1977-7-16
1         1977-7-16
2         1977-7-16
3         1977-7-16
4         1977-7-16
            ...
35544    2002-12-31
35545    2002-12-31
35546    2002-12-31
35547    2002-12-31
35548    2002-12-31
Name: date, Length: 35549, dtype: object

Note that some of the dates look a little funny because single-digit days and months do not include a leading zero. For example, in the first row we have 1977-7-16 instead of 1977-07-16. This is usually not a big deal but can be neatened up using the str accessor.

In pandas, an accessor is an attribute that provides additional functionality to an object. Here, the str accessor allows us to access many of the methods from the built-in str data type, including zfill(), which pads a string to a given length by adding zeroes to the start of the string:

PYTHON

text = "2"
text.zfill(3)

OUTPUT

'002'

Using the str accessor, we can use that method to zero-pad the data in a Series:

PYTHON

# Pad month and day to two characters
month = month.str.zfill(2)
day = day.str.zfill(2)

surveys["date"] = year + "-" + month + "-" + day
surveys["date"]

OUTPUT

0        1977-07-16
1        1977-07-16
2        1977-07-16
3        1977-07-16
4        1977-07-16
            ...
35544    2002-12-31
35545    2002-12-31
35546    2002-12-31
35547    2002-12-31
35548    2002-12-31
Name: date, Length: 35549, dtype: object

The month and date values in date are now padded to a length of two, allowing us to create a well-formed YYYY-MM-DD date string. Other string methods, like upper() and lower(), can be used in the same way.

Before we convert the date column to a datetime, we’re going to use the date string to show the opposite operation: Splitting a value stored in one column into multiple columns. One way to do this in pandas is to use str.split(), which splits each value in a series based on a delimiter, a character used as a boundary between parts of a string. Here, a hyphen is used to delimit the year, month, and date in each date. By splitting the column on a hyphen, we can extract each of those components into its own column. We also pass True to the expand keyword argument, which makes the str.split() method return a dataframe:

PYTHON

surveys["date"].str.split("-", expand=True)

	0	1	2
0	1977	07	16
1	1977	07	16
2	1977	07	16
3	1977	07	16
4	1977	07	16
…	…	…	…
35544	2002	12	31
35545	2002	12	31
35546	2002	12	31
35547	2002	12	31
35548	2002	12	31

35549 rows × 3 columns

Now let’s go ahead and convert our date column into a datetime object using pd.to_datetime():

PYTHON

surveys["date"] = pd.to_datetime(surveys["date"])
surveys["date"]

OUTPUT

0       1977-07-16
1       1977-07-16
2       1977-07-16
3       1977-07-16
4       1977-07-16
           ...
35544   2002-12-31
35545   2002-12-31
35546   2002-12-31
35547   2002-12-31
35548   2002-12-31
Name: date, Length: 35549, dtype: datetime64[ns]

Challenge

pandas can help us ask specific questions which we want to answer about our data. The real skill is to know how to translate our scientific questions into a sensible approach (and subsequently visualize and interpret our results).

Try using pandas to answer the following questions.

How many specimens of each sex are there for each year, including those whose sex is unknown?
What is the average weight of each taxa?
What are the minimum, maximum and average weight for each species of Rodent?
What is the average hindfoot length for male and female rodent of each species? Is there a male/female difference?
What is the average weight of each rodent species over the course of the years? Is there any noticeable trend for any of the species?

Show me the solution to challenge 1

How many specimens of each sex are there for each year, including those whose sex is unknown?

PYTHON

# Fill in NaN values in sex with U
surveys["sex"] = surveys["sex"].fillna("U")

# Count records by sex
result = surveys.groupby(["year", "sex"])["record_id"].count()

# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
    print(result)

OUTPUT

year  sex
1977  F       204
      M       214
      U        85
1978  F       503
      M       433
      U       112
1979  F       327
      M       324
      U        68
1980  F       605
      M       727
      U        83
1981  F       631
      M       745
      U        96
1982  F       823
      M      1027
      U       128
1983  F       771
      M       797
      U       105
1984  F       445
      M       443
      U        93
1985  F       636
      M       716
      U        86
1986  F       414
      M       455
      U        73
1987  F       677
      M       862
      U       132
1988  F       604
      M       737
      U       128
1989  F       678
      M       780
      U       111
1990  F       581
      M       636
      U        94
1991  F       606
      M       637
      U       104
1992  F       443
      M       477
      U       118
1993  F       285
      M       380
      U        85
1994  F       243
      M       327
      U        98
1995  F       574
      M       534
      U       114
1996  F       725
      M       910
      U        71
1997  F      1071
      M      1357
      U        65
1998  F       717
      M       798
      U        95
1999  F       545
      M       530
      U        60
2000  F       690
      M       779
      U        83
2001  F       743
      M       744
      U       123
2002  F      1149
      M       979
      U       101
Name: record_id, dtype: int64

Show me the solution to challenge 2

What is the average weight of each taxa?

PYTHON

# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")

# Group by taxa
grouped = merged.groupby("taxa")

# Calculate the min, max, and mean weight
grouped["weight"].mean()

OUTPUT

taxa
Bird             NaN
Rabbit           NaN
Reptile          NaN
Rodent     42.672428
Name: weight, dtype: float64

Show me the solution to challenge 3

What are the minimum, maximum and average weight for each species of Rodent?

PYTHON

# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")

# Limit merged dataframe to rodents
rodents = merged[merged["taxa"] == "Rodent"]

# Group rodents by species
grouped = rodents.groupby("species_id")

# Calculate the min, max, and mean weight
grouped.agg({"weight": ["min", "max", "mean"]})

	weight
	min	max	mean
species_id
AH	NaN	NaN	NaN
BA	6.0	18.0	8.600000
DM	10.0	66.0	43.157864
DO	12.0	76.0	48.870523
DS	12.0	190.0	120.130546
DX	NaN	NaN	NaN
NL	30.0	280.0	159.245660
OL	10.0	56.0	31.575258
OT	5.0	46.0	24.230556
OX	18.0	24.0	21.000000
PB	12.0	55.0	31.735943
PE	8.0	40.0	21.586508
PF	4.0	25.0	7.923127
PH	18.0	48.0	31.064516
PI	17.0	21.0	19.250000
PL	8.0	27.0	19.138889
PM	7.0	49.0	21.364155
PP	4.0	74.0	17.173942
PX	18.0	20.0	19.000000
RF	9.0	20.0	13.386667
RM	4.0	29.0	10.585010
RO	8.0	13.0	10.250000
RX	11.0	20.0	15.500000
SF	24.0	199.0	58.878049
SH	16.0	140.0	73.148936
SO	15.0	105.0	55.414634
SS	57.0	130.0	93.500000
ST	NaN	NaN	NaN
UR	NaN	NaN	NaN

Show me the solution to challenge 4

What is the average hindfoot length for male and female rodent of each species? Is there a male/female difference?

PYTHON

# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")

# Limit merged dataframe to rodents
rodents = merged[merged["taxa"] == "Rodent"]

# Group rodents by species and sex
grouped = rodents.groupby(["species_id", "sex"])

# Calculate the mean hindfoot length, plus count and standard deviation
# to better assess the question
with pd.option_context("display.max_rows", None):
    print(grouped["hindfoot_length"].agg(["count", "mean", "std"]))

OUTPUT

                count       mean        std
species_id sex
AH         M        0        NaN        NaN
           U        2  33.000000   2.828427
BA         F       31  13.161290   1.529636
           M       14  12.642857   2.097880
           U        0        NaN        NaN
DM         F     4302  35.712692   1.433067
           M     5658  36.188229   1.455396
           U       12  35.583333   0.996205
DO         F     1244  35.486334   1.726414
           M     1640  35.698780   1.611656
           U        3  36.000000   0.000000
DS         F     1046  49.583174   1.973573
           M     1083  50.301939   2.128120
           U        3  50.000000   1.732051
DX         U        0        NaN        NaN
NL         F      620  32.024194   1.669181
           M      452  32.577434   1.903603
           U        2  52.000000  25.455844
OL         F      438  20.287671   1.101022
           M      480  20.758333   1.651195
           U        2  20.000000   0.000000
OT         F     1019  20.281649   1.519412
           M     1115  20.257399   1.087354
           U        5  19.600000   1.140175
OX         F        3  19.666667   1.154701
           M        5  18.800000   3.346640
           U        0        NaN        NaN
PB         F     1645  25.882675   0.928842
           M     1213  26.431987   1.444130
           U        6  26.166667   0.983192
PE         F      544  20.272059   1.156084
           M      666  20.132132   1.177721
           U        2  20.500000   0.707107
PF         F      721  15.550624   1.381627
           M      770  15.620779   1.142208
           U        2  13.000000   4.242641
PH         F       20  25.900000   1.293709
           M       11  25.545455   1.752920
PI         M        8  22.500000   0.534522
           U        1  20.000000        NaN
PL         F       16  19.750000   2.720294
           M       19  20.263158   0.933459
           U        1  20.000000        NaN
PM         F      361  20.279778   1.162782
           M      483  20.530021   1.275621
           U        3  21.333333   4.041452
PP         F     1574  21.676620   1.316529
           M     1444  21.838643   1.201006
           U        9  20.888889   2.027588
PX         F        1  19.000000        NaN
           M        1  20.000000        NaN
           U        0        NaN        NaN
RF         F       55  17.527273   0.813191
           M       18  17.500000   0.985184
RM         F     1133  16.391880   1.123946
           M     1296  16.491512   1.134202
           U       13  15.846154   2.444250
RO         F        4  14.750000   0.957427
           M        4  16.000000   1.154701
RX         M        2  18.500000   2.121320
SF         F       16  27.500000   2.683282
           M       23  26.347826   3.524056
           U        2  24.500000   0.707107
SH         F       71  29.028169   2.298893
           M       60  27.983333   2.801281
           U        0        NaN        NaN
SO         F       30  25.633333   4.139660
           M       11  25.727273   1.954017
SS         F        0        NaN        NaN
           M        0        NaN        NaN
           U        0        NaN        NaN
ST         U        0        NaN        NaN
UR         U        0        NaN        NaN

Show me the solution to challenge 5

What is the average weight of each rodent species over the course of the years? Is there any noticeable trend for any of the species?

PYTHON

# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")

# Limit merged dataframe to rodents
rodents = merged[merged["taxa"] == "Rodent"]

# Group rodents by species and year
grouped = rodents.groupby(["species_id", "year"])

# Calculate the mean weight by year
result = grouped["weight"].mean()

# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
    print(result)

OUTPUT

species_id  year
AH          1978           NaN
            1982           NaN
            1983           NaN
            1984           NaN
            1985           NaN
            1986           NaN
            1987           NaN
            1988           NaN
            1989           NaN
            1990           NaN
            1991           NaN
            1992           NaN
            1993           NaN
            1994           NaN
            1995           NaN
            1996           NaN
            1997           NaN
            1998           NaN
            1999           NaN
            2000           NaN
            2001           NaN
            2002           NaN
BA          1989      7.000000
            1990      8.000000
            1991      9.240000
            1992      7.833333
DM          1977     41.141304
            1978     40.795455
            1979     43.507317
            1980     44.136082
            1981     44.032907
            1982     41.975042
            1983     40.944551
            1984     40.765306
            1985     41.507645
            1986     43.361596
            1987     43.232609
            1988     43.397790
            1989     44.349206
            1990     41.769912
            1991     43.148338
            1992     43.877966
            1993     43.373984
            1994     42.373288
            1995     44.806147
            1996     44.506173
            1997     44.708551
            1998     43.131403
            1999     43.945402
            2000     43.126638
            2001     45.442177
            2002     46.168317
DO          1977     42.666667
            1978     45.000000
            1979     45.862069
            1980     48.058824
            1981     49.111111
            1982     47.919643
            1983     47.150602
            1984     48.415094
            1985     47.956731
            1986     49.372727
            1987     50.087379
            1988     51.463768
            1989     51.025641
            1990     48.512048
            1991     49.695652
            1992     48.367816
            1993     48.461538
            1994     47.750000
            1995     49.592593
            1996     47.234940
            1997     48.177866
            1998     49.731183
            1999     48.012048
            2000     49.224719
            2001     52.233766
            2002     49.514403
DS          1977    121.437500
            1978    115.198653
            1979    111.796954
            1980    120.725664
            1981    125.818444
            1982    115.647929
            1983    122.033088
            1984    124.082192
            1985    124.326316
            1986    128.071429
            1987    126.383838
            1988    129.490566
            1989    121.896552
            1990    121.187500
            1991    113.000000
            1992    112.352941
            1993    105.722222
            1994    106.625000
            1997    111.000000
            1998    116.000000
            1999    120.714286
DX          1980           NaN
            1983           NaN
            1986           NaN
            1988           NaN
            1991           NaN
            1992           NaN
            1993           NaN
            1994           NaN
            1995           NaN
            1996           NaN
            1998           NaN
            1999           NaN
            2000           NaN
            2001           NaN
            2002           NaN
NL          1977           NaN
            1978    184.656250
            1979    138.000000
            1980    158.982143
            1981    165.903226
            1982    160.613208
            1983    156.691489
            1984    150.095238
            1985    148.829268
            1986    159.724138
            1987    158.840000
            1988    163.104167
            1989    151.278689
            1990    154.275862
            1991    148.785714
            1992    139.000000
            1993    127.391304
            1994    186.875000
            1995    155.833333
            1996    148.000000
            1997    150.688889
            1998    159.466667
            1999    182.200000
            2000    179.307692
            2001    167.851064
            2002    182.159091
OL          1977     21.666667
            1978     31.027027
            1979     33.717391
            1980     33.107143
            1981     33.296296
            1982     35.590361
            1983     34.584615
            1984     32.550725
            1985     32.108696
            1986     30.880952
            1987     30.779221
            1988     30.306122
            1989     31.947368
            1990     31.200000
            1991     28.171429
            1992     28.454545
            1993     27.545455
            1994     21.900000
            1995     27.296296
            1996     27.769231
            1997     33.625000
            2002     25.428571
OT          1977     21.000000
            1978     23.220000
            1979     23.075758
            1980     24.083333
            1981     24.415385
            1982     24.925532
            1983     24.697674
            1984     22.416667
            1985     22.064516
            1986     21.500000
            1987     23.069767
            1988     24.070588
            1989     24.663366
            1990     23.675325
            1991     24.588235
            1992     24.928571
            1993     22.363636
            1994     25.285714
            1995     24.868421
            1996     23.453704
            1997     24.785156
            1998     24.675676
            1999     25.723810
            2000     25.303448
            2001     23.707792
            2002     23.833333
OX          1977     21.333333
            1978           NaN
            1982     24.000000
            1984     18.000000
            1989     20.000000
            1993           NaN
PB          1995     34.000000
            1996     32.578947
            1997     31.085603
            1998     30.082237
            1999     31.710037
            2000     30.878899
            2001     32.798851
            2002     32.359447
PE          1977     19.500000
            1978     20.428571
            1979     20.529412
            1980     22.450000
            1981     20.558140
            1982     21.173077
            1983     20.864865
            1984     20.210526
            1985     20.360000
            1986     22.520000
            1987     21.625899
            1988     22.625668
            1989     21.935673
            1990     21.671233
            1991     21.435484
            1992     22.750000
            1993     19.230769
            1994     18.000000
            1995     21.230769
            1996     22.027778
            1997     20.929825
            1998     20.304348
            1999     25.428571
            2000     21.615385
            2001     20.400000
            2002     21.719298
PF          1977      7.173913
            1978      7.088235
            1979      7.529412
            1980      7.455556
            1981      7.152542
            1982      6.918919
            1983      6.833333
            1984      7.428571
            1985           NaN
            1986      8.000000
            1987      7.500000
            1988      8.000000
            1989      7.230769
            1990      7.117647
            1991      7.827586
            1992      8.767857
            1993      8.238806
            1994      7.855072
            1995      8.780645
            1996      8.393846
            1997      8.448649
            1998      8.720000
            2001      8.538462
            2002      7.388889
PH          1984     28.000000
            1985     32.666667
            1986     39.000000
            1988     33.000000
            1989     30.500000
            1990     31.142857
            1992     31.142857
            1995     35.333333
            1997     22.000000
PI          1992     18.000000
            1995           NaN
            2001     18.000000
            2002     20.000000
PL          1995     22.333333
            1996     21.166667
            1997     18.789474
            1998     16.714286
            2000     21.000000
PM          1978     20.666667
            1979     23.666667
            1980     21.000000
            1981     19.933333
            1982     21.391304
            1983     22.023810
            1984     19.545455
            1985     19.000000
            1986     21.696970
            1987     22.297710
            1988     21.759259
            1989     20.222222
            1995     27.375000
            1996     20.224490
            1997     21.126394
            1998     20.591398
            1999     22.523810
            2000     20.500000
            2001     26.666667
            2002     19.000000
PP          1977     15.333333
            1978     14.869565
            1979     15.105263
            1980     14.176471
            1981     13.950000
            1982     16.125000
            1983     15.468750
            1984     15.307692
            1985     15.764706
            1986     16.750000
            1987     17.840909
            1988     18.280000
            1989     17.409091
            1990     16.029412
            1991     17.904762
            1992     17.479675
            1993     17.954023
            1994     17.585714
            1995     16.844444
            1996     18.095563
            1997     18.175000
            1998     16.266990
            1999     16.521212
            2000     16.788618
            2001     17.752896
            2002     17.018617
PX          1996           NaN
            1997     19.000000
            1998           NaN
            2000           NaN
RF          1982     11.500000
            1988     13.818182
            1989     13.346939
            1990     12.916667
            1997     20.000000
RM          1977     10.000000
            1978      7.500000
            1979      8.333333
            1980     10.227273
            1981     11.178571
            1982     10.436242
            1983      9.872000
            1984     11.152542
            1985      8.371429
            1986     10.750000
            1987     10.657609
            1988     10.322115
            1989     10.411311
            1990     10.305677
            1991     10.498305
            1992     10.904762
            1993     10.747475
            1994     10.675000
            1995     12.653846
            1996     11.455556
            1997     11.230769
            1998     13.100000
            1999     10.555556
            2000     11.400000
            2001     11.333333
            2002     10.000000
RO          1991     11.000000
            2002     10.142857
RX          1995     20.000000
            1997     11.000000
SF          1989     54.800000
            1990     52.611111
            1991     93.166667
            1992     43.000000
            1993     44.000000
            2002     62.166667
SH          1977           NaN
            1978     89.000000
            1982     79.000000
            1983           NaN
            1986     55.000000
            1987     73.800000
            1988     72.806452
            1989     76.345455
            1990     76.888889
            1991     63.000000
            1997     50.857143
            1999     54.000000
            2000     73.375000
            2001     79.900000
            2002     64.666667
SO          1991     53.909091
            1992     55.263158
            1993     55.600000
            1994     62.333333
            1997     54.666667
SS          1977           NaN
            1978    130.000000
            1979           NaN
            1980           NaN
            1981     57.000000
            1982           NaN
            1983           NaN
            1984           NaN
            1985           NaN
            1986           NaN
            1987           NaN
            1988           NaN
            1989           NaN
            1990           NaN
            1991           NaN
            1992           NaN
            1993           NaN
            1994           NaN
            1995           NaN
            1996           NaN
            1997           NaN
            1998           NaN
            1999           NaN
            2000           NaN
            2001           NaN
            2002           NaN
ST          1993           NaN
UR          1987           NaN
            1988           NaN
            1991           NaN
            1992           NaN
            1994           NaN
            2001           NaN
            2002           NaN
Name: weight, dtype: float64

Key Points

Combine two dataframes on one or more common values using pd.merge()
Append rows from one dataframe to another using pd.concat()
Combine multiple text columns into one using the + operator
Use the str accessor to use string methods like split() and zfill() on text columns
Convert date strings to datetime objects using pd.to_datetime()