Combining Dataframes
Last updated on 2025-07-03 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- How do we combine data from multiple sources using pandas?
- How do we add data to an existing dataframe?
- How do we combine columns within a single dataframe?
Objectives
- Use
pd.merge()
to add related columns to a dataframe - Use
pd.concat()
to add rows to a dataframe - Combine text columns
- Combine year, month, and day columns into a
datetime64
object
Dataframes can be used to organize and group data by common characteristics. Often, we need to combine elements from separate dataframes into one for analysis and visualization. Likewise, we may need to append data collected under different circumstances. In this chapter, we will show how to merge, concatenate, and combine data using pandas.
Merging dataframes
The survey dataframe we’ve been using throughout this lesson has a column called species_id. We used this column in the previous lesson to calculate summary statistics about observations of each species. But the species_id is just a two-letter code. Without knowing what each code represents, we can’t really understand, describe, or even properly label the observations. We need a way to go from the code to more detailed species information.
Managing repetitive data
Why store species data in a separate table in the first place? Species information is repetitive: Every observation of the same species has the same genus, species, and taxa. Storing it in the original survey table would require including that data in every record, increasing both the size of the table and the likelihood of errors. Storing that data in a separate table means we only have to enter and validate it once. A tool like pandas then allows us to access the authoratative data when we need it.
We can enhance the species data in the dataframe using an
authoratative dataset that contains the taxonomic information associated
with each species. This data is stored in the species.csv file, which
can be loaded using pd.read_csv()
:
PYTHON
import pandas as pd
surveys = pd.read_csv("data/surveys.csv")
species = pd.read_csv("data/species.csv")
species
OUTPUT
species_id | genus | species | taxa | |
---|---|---|---|---|
0 | AB | Amphispiza | bilineata | Bird |
1 | AH | Ammospermophilus | harrisi | Rodent |
2 | AS | Ammodramus | savannarum | Bird |
3 | BA | Baiomys | taylori | Rodent |
4 | CB | Campylorhynchus | brunneicapillus | Bird |
5 | CM | Calamospiza | melanocorys | Bird |
6 | CQ | Callipepla | squamata | Bird |
7 | CS | Crotalus | scutalatus | Reptile |
8 | CT | Cnemidophorus | tigris | Reptile |
9 | CU | Cnemidophorus | uniparens | Reptile |
10 | CV | Crotalus | viridis | Reptile |
11 | DM | Dipodomys | merriami | Rodent |
12 | DO | Dipodomys | ordii | Rodent |
13 | DS | Dipodomys | spectabilis | Rodent |
14 | DX | Dipodomys | sp. | Rodent |
15 | EO | Eumeces | obsoletus | Reptile |
16 | GS | Gambelia | silus | Reptile |
17 | NL | Neotoma | albigula | Rodent |
18 | NX | Neotoma | sp. | Rodent |
19 | OL | Onychomys | leucogaster | Rodent |
20 | OT | Onychomys | torridus | Rodent |
21 | OX | Onychomys | sp. | Rodent |
22 | PB | Chaetodipus | baileyi | Rodent |
23 | PC | Pipilo | chlorurus | Bird |
24 | PE | Peromyscus | eremicus | Rodent |
25 | PF | Perognathus | flavus | Rodent |
26 | PG | Pooecetes | gramineus | Bird |
27 | PH | Perognathus | hispidus | Rodent |
28 | PI | Chaetodipus | intermedius | Rodent |
29 | PL | Peromyscus | leucopus | Rodent |
30 | PM | Peromyscus | maniculatus | Rodent |
31 | PP | Chaetodipus | penicillatus | Rodent |
32 | PU | Pipilo | fuscus | Bird |
33 | PX | Chaetodipus | sp. | Rodent |
34 | RF | Reithrodontomys | fulvescens | Rodent |
35 | RM | Reithrodontomys | megalotis | Rodent |
36 | RO | Reithrodontomys | montanus | Rodent |
37 | RX | Reithrodontomys | sp. | Rodent |
38 | SA | Sylvilagus | audubonii | Rabbit |
39 | SB | Spizella | breweri | Bird |
40 | SC | Sceloporus | clarki | Reptile |
41 | SF | Sigmodon | fulviventer | Rodent |
42 | SH | Sigmodon | hispidus | Rodent |
43 | SO | Sigmodon | ochrognathus | Rodent |
44 | SS | Spermophilus | spilosoma | Rodent |
45 | ST | Spermophilus | tereticaudus | Rodent |
46 | SU | Sceloporus | undulatus | Reptile |
47 | SX | Sigmodon | sp. | Rodent |
48 | UL | Lizard | sp. | Reptile |
49 | UP | Pipilo | sp. | Bird |
50 | UR | Rodent | sp. | Rodent |
51 | US | Sparrow | sp. | Bird |
52 | ZL | Zonotrichia | leucophrys | Bird |
53 | ZM | Zenaida | macroura | Bird |
We can see that the species dataframe includes the genus, species,
and taxon for each species_id. This is much more useful than the
species_id included in the original dataframe—how can we integrate that
data into our surveys dataframe? Adding it by hand would be tedious and
error prone. Fortunately, pandas provides the pd.merge()
function to join two dataframes. We’ll assigned the new, merged
dataframe to the variable merged:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN | Neotoma | albigula | Rodent |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN | Dipodomys | merriami | Rodent |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN | Dipodomys | merriami | Rodent |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN | Dipodomys | merriami | Rodent |
… | … | … | … | … | … | … | … | … | … | … | … | … |
34781 | 35544 | 12 | 31 | 2002 | 15 | US | NaN | NaN | NaN | Sparrow | sp. | Bird |
34782 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | Ammospermophilus | harrisi | Rodent |
34783 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | Ammospermophilus | harrisi | Rodent |
34784 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 | Reithrodontomys | megalotis | Rodent |
34785 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 | Dipodomys | ordii | Rodent |
34786 rows × 12 columns
Following the merge, the genus, species, and taxa columns have all been added to the survey dataframe. We can now use those columns to filter and summarize our data.
Joins
The merge()
method performs the same function as joins
in SQL
Challenge
We previously described the dataset as consisting of observations of rodents. That’s mostly true, but observations of other animals are also included. Based on the merged dataframe, what other types of animals are present in the dataframe? We are interesed in the type of animal, not the Latin name.
In the example above, we didn’t provide any information about how we wanted to merge the dataframes together, so pandas made an educated guess by looking at the columns in each of the dataframes and merging them on the only column that appeared in both datasets, species_id. For more complex tables, we may need more control over the operation, for example, by specifiying the exact columns we want to merge on. We can do that by passing one or more column names using the on keyword argument:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN | Neotoma | albigula | Rodent |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN | Dipodomys | merriami | Rodent |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN | Dipodomys | merriami | Rodent |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN | Dipodomys | merriami | Rodent |
… | … | … | … | … | … | … | … | … | … | … | … | … |
34781 | 35544 | 12 | 31 | 2002 | 15 | US | NaN | NaN | NaN | Sparrow | sp. | Bird |
34782 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | Ammospermophilus | harrisi | Rodent |
34783 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | Ammospermophilus | harrisi | Rodent |
34784 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 | Reithrodontomys | megalotis | Rodent |
34785 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 | Dipodomys | ordii | Rodent |
34786 rows × 12 columns
It’s not necessary here, but we can also merge datasets where the common data is stored in columns with different names. To do so, we’d use the left_on (to specify the column name in the first dataframe) and right_on (to specify the column name in the second dataframe).
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN | Neotoma | albigula | Rodent |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN | Dipodomys | merriami | Rodent |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN | Dipodomys | merriami | Rodent |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN | Dipodomys | merriami | Rodent |
… | … | … | … | … | … | … | … | … | … | … | … | … |
34781 | 35544 | 12 | 31 | 2002 | 15 | US | NaN | NaN | NaN | Sparrow | sp. | Bird |
34782 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | Ammospermophilus | harrisi | Rodent |
34783 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | Ammospermophilus | harrisi | Rodent |
34784 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 | Reithrodontomys | megalotis | Rodent |
34785 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 | Dipodomys | ordii | Rodent |
34786 rows × 12 columns
Challenge
Compare the number of rows in the original and merged survey dataframes. How do they differ? Why might that be?
Hint: Use pd.unique()
method to look at
the species_id column in each dataframe.
OUTPUT
array(['AB', 'AH', 'AS', 'BA', 'CB', 'CM', 'CQ', 'CS', 'CT', 'CU', 'CV',
'DM', 'DO', 'DS', 'DX', 'NL', 'OL', 'OT', 'OX', 'PB', 'PC', 'PE',
'PF', 'PG', 'PH', 'PI', 'PL', 'PM', 'PP', 'PU', 'PX', 'RF', 'RM',
'RO', 'RX', 'SA', 'SC', 'SF', 'SH', 'SO', 'SS', 'ST', 'SU', 'UL',
'UP', 'UR', 'US', 'ZL', nan], dtype=object)
The merged dataframe omits rows with no value in the species_id column.
The number of rows in the merged dataframe is lower than in the
original surveys dataframe. By default, merge()
performs an
inner join. This means that a row will only appear in
the merged dataframe if the value(s) in the join column(s) appear in
both dataframes. Here, observations that did not include a species_id or
that included a species_id that was not definedin the species dataframe
were dropped.
This is not always desirable behavior. Fortunately, pandas supports other kinds of merges:
- Inner: Include all rows with common values in the join columns. This is the default behavior.
- Left: Include all rows from the left dataframe. Columns from the right dataframe are populated if a common value exists and set to NaN if not.
- Right: Include all rows from the right dataframe. Columns from the left dataframe are populated if a common value exists and set to NaN if not.
- Outer: Include all rows from both dataframes
We want to keep all of the observations, so let’s do a left join instead. To specify the type of merge, we use the how keyword argument:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN | Neotoma | albigula | Rodent |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN | Dipodomys | merriami | Rodent |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN | Dipodomys | merriami | Rodent |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN | Dipodomys | merriami | Rodent |
… | … | … | … | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | Ammospermophilus | harrisi | Rodent |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | Ammospermophilus | harrisi | Rodent |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 | Reithrodontomys | megalotis | Rodent |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 | Dipodomys | ordii | Rodent |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
35549 rows × 12 columns
Now all 35,549 rows appear in the merged dataframe.
Appending rows to a dataframe
Merges address the case where information for a set of observations is spread across multiple files. What about when the observations themselves are in more than one file? For a survey like the one we’ve been looking at in this lesson, we might get a new file once a year with the same columns but a completely new set of observations. How can we add those new observations to our dataframe?
We’ll simulate this operation by splitting data from two different years, 2001 and 2002, into separate dataframes. We can filter the dataset this using conditionals, as we saw in lesson 3:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
31710 | 31711 | 1 | 21 | 2001 | 1 | PB | F | 26.0 | 25.0 |
31711 | 31712 | 1 | 21 | 2001 | 1 | DM | M | 37.0 | 43.0 |
31712 | 31713 | 1 | 21 | 2001 | 1 | PB | M | 29.0 | 44.0 |
31713 | 31714 | 1 | 21 | 2001 | 1 | DO | M | 34.0 | 53.0 |
31714 | 31715 | 1 | 21 | 2001 | 2 | OT | M | 20.0 | 27.0 |
… | … | … | … | … | … | … | … | … | … |
33315 | 33316 | 12 | 16 | 2001 | 11 | NaN | NaN | NaN | NaN |
33316 | 33317 | 12 | 16 | 2001 | 13 | NaN | NaN | NaN | NaN |
33317 | 33318 | 12 | 16 | 2001 | 14 | NaN | NaN | NaN | NaN |
33318 | 33319 | 12 | 16 | 2001 | 15 | NaN | NaN | NaN | NaN |
33319 | 33320 | 12 | 16 | 2001 | 16 | NaN | NaN | NaN | NaN |
1610 rows × 9 columns
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
33320 | 33321 | 1 | 12 | 2002 | 1 | DM | M | 38.0 | 44.0 |
33321 | 33322 | 1 | 12 | 2002 | 1 | DO | M | 37.0 | 58.0 |
33322 | 33323 | 1 | 12 | 2002 | 1 | PB | M | 28.0 | 45.0 |
33323 | 33324 | 1 | 12 | 2002 | 1 | AB | NaN | NaN | NaN |
33324 | 33325 | 1 | 12 | 2002 | 1 | DO | M | 35.0 | 29.0 |
… | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
2229 rows × 9 columns
We now have two different dataframes with the same columns but
different data, one with 1,610 rows, the other with 2,229 rows, for a
total of 3,839 records. We can combine them into a new dataframe using
pd.concat()
, which stacks the dataframes vertically (that
is, it adds records from one dataset to the end of another). This method
accepts a list of dataframes and works from left to right (so the
leftmost dataframe will be at the top of the new dataframe and the
rightmost will be at the bottom). We’re only combining two dataframes
here but could include more if necessary.
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
31710 | 31711 | 1 | 21 | 2001 | 1 | PB | F | 26.0 | 25.0 |
31711 | 31712 | 1 | 21 | 2001 | 1 | DM | M | 37.0 | 43.0 |
31712 | 31713 | 1 | 21 | 2001 | 1 | PB | M | 29.0 | 44.0 |
31713 | 31714 | 1 | 21 | 2001 | 1 | DO | M | 34.0 | 53.0 |
31714 | 31715 | 1 | 21 | 2001 | 2 | OT | M | 20.0 | 27.0 |
… | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
3839 rows × 9 columns
The combined dataframe includes all rows from both dataframes.
In some cases, the exact columns may change from year to year even within the same project. For example, researchers may decide to add an additional column to track a new piece of data or to provide a quality check. If a column is present in only one dataset, you can still concatenate the datasets. Any column that does not appear in a given dataset will have the value NaN for those rows in the combined dataframe.
To illustrate this, we’ll add a column called test to the 2002
survey, then re-run pd.concat()
:
PYTHON
surveys_2002["test"] = True
surveys_2001_2002 = pd.concat([surveys_2001, surveys_2002])
surveys_2001_2002
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | test | |
---|---|---|---|---|---|---|---|---|---|---|
31710 | 31711 | 1 | 21 | 2001 | 1 | PB | F | 26.0 | 25.0 | NaN |
31711 | 31712 | 1 | 21 | 2001 | 1 | DM | M | 37.0 | 43.0 | NaN |
31712 | 31713 | 1 | 21 | 2001 | 1 | PB | M | 29.0 | 44.0 | NaN |
31713 | 31714 | 1 | 21 | 2001 | 1 | DO | M | 34.0 | 53.0 | NaN |
31714 | 31715 | 1 | 21 | 2001 | 2 | OT | M | 20.0 | 27.0 | NaN |
… | … | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | True |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | True |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 | True |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 | True |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN | True |
3839 rows × 10 columns
As expected, the test column has a value of NaN for the 2001 data in the combined dataframe.
Combining data in multiple columns
Sometimes we want to combine values from two or more columns into a
single column. For example, we might want to refer to the species in
each record by both its genus and species names. In Python, we use the
+
operator to concatenate (or join) strings, and pandas
works the same way:
PYTHON
species["genus_species"] = species["genus"] + " " + species["species"]
species["genus_species"]
OUTPUT
0 Amphispiza bilineata
1 Ammospermophilus harrisi
2 Ammodramus savannarum
3 Baiomys taylori
4 Campylorhynchus brunneicapillus
5 Calamospiza melanocorys
6 Callipepla squamata
7 Crotalus scutalatus
8 Cnemidophorus tigris
9 Cnemidophorus uniparens
10 Crotalus viridis
11 Dipodomys merriami
12 Dipodomys ordii
13 Dipodomys spectabilis
14 Dipodomys sp.
15 Eumeces obsoletus
16 Gambelia silus
17 Neotoma albigula
18 Neotoma sp.
19 Onychomys leucogaster
20 Onychomys torridus
21 Onychomys sp.
22 Chaetodipus baileyi
23 Pipilo chlorurus
24 Peromyscus eremicus
25 Perognathus flavus
26 Pooecetes gramineus
27 Perognathus hispidus
28 Chaetodipus intermedius
29 Peromyscus leucopus
30 Peromyscus maniculatus
31 Chaetodipus penicillatus
32 Pipilo fuscus
33 Chaetodipus sp.
34 Reithrodontomys fulvescens
35 Reithrodontomys megalotis
36 Reithrodontomys montanus
37 Reithrodontomys sp.
38 Sylvilagus audubonii
39 Spizella breweri
40 Sceloporus clarki
41 Sigmodon fulviventer
42 Sigmodon hispidus
43 Sigmodon ochrognathus
44 Spermophilus spilosoma
45 Spermophilus tereticaudus
46 Sceloporus undulatus
47 Sigmodon sp.
48 Lizard sp.
49 Pipilo sp.
50 Rodent sp.
51 Sparrow sp.
52 Zonotrichia leucophrys
53 Zenaida macroura
Name: genus_species, dtype: object
Note that the +
operator can also be used to add numeric
columns. In Python, the same operator can be used to perform different
operations for different data types.
Combining dates
Another common need is to join or split dates. In the ecology
dataset, the date of each observation is split across year, month, and
day columns. However, pandas has a special data type,
datetime64
, for representing dates that can be useful for
analyzing time series data. To make use of that functionality, we can
use pd.to_datetime()
method to create a datetime
object:
OUTPUT
0 1977-07-16
1 1977-07-16
2 1977-07-16
3 1977-07-16
4 1977-07-16
...
35544 2002-12-31
35545 2002-12-31
35546 2002-12-31
35547 2002-12-31
35548 2002-12-31
Name: date, Length: 35549, dtype: datetime64[ns]
Challenge
Pandas allows us to ask specific questions about our data. A key skill is knowing how to translate our scientific questions into a sensible approach (and subsequently visualize and interpret our results).
Try using pandas to answer the following questions.
- How many specimens of each sex are there for each year, including those whose sex is unknown?
- What is the average weight of each taxa?
- What are the minimum, maximum and average weight for each species of Rodent?
- What is the average hindfoot length for male and female rodent of each species? Is there a male/female difference?
- What is the average weight of each rodent species over the course of the years? Is there any noticeable trend for any of the species?
How many specimens of each sex are there for each year, including those whose sex is unknown?
PYTHON
# Fill in NaN values in sex with U
surveys["sex"] = surveys["sex"].fillna("U")
# Count records by sex
result = surveys.groupby(["year", "sex"])["record_id"].count()
# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
print(result)
OUTPUT
year sex
1977 F 204
M 214
U 85
1978 F 503
M 433
U 112
1979 F 327
M 324
U 68
1980 F 605
M 727
U 83
1981 F 631
M 745
U 96
1982 F 823
M 1027
U 128
1983 F 771
M 797
U 105
1984 F 445
M 443
U 93
1985 F 636
M 716
U 86
1986 F 414
M 455
U 73
1987 F 677
M 862
U 132
1988 F 604
M 737
U 128
1989 F 678
M 780
U 111
1990 F 581
M 636
U 94
1991 F 606
M 637
U 104
1992 F 443
M 477
U 118
1993 F 285
M 380
U 85
1994 F 243
M 327
U 98
1995 F 574
M 534
U 114
1996 F 725
M 910
U 71
1997 F 1071
M 1357
U 65
1998 F 717
M 798
U 95
1999 F 545
M 530
U 60
2000 F 690
M 779
U 83
2001 F 743
M 744
U 123
2002 F 1149
M 979
U 101
Name: record_id, dtype: int64
What is the average weight of each taxa?
PYTHON
# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")
# Group by taxa
grouped = merged.groupby("taxa")
# Calculate the min, max, and mean weight
grouped["weight"].mean()
OUTPUT
taxa
Bird NaN
Rabbit NaN
Reptile NaN
Rodent 42.672428
Name: weight, dtype: float64
What are the minimum, maximum and average weight for each species of Rodent?
PYTHON
# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")
# Limit merged dataframe to rodents
rodents = merged[merged["taxa"] == "Rodent"]
# Group rodents by species
grouped = rodents.groupby("species_id")
# Calculate the min, max, and mean weight
grouped.agg({"weight": ["min", "max", "mean"]})
OUTPUT
weight | |||
---|---|---|---|
min | max | mean | |
species_id | |||
AH | NaN | NaN | NaN |
BA | 6.0 | 18.0 | 8.600000 |
DM | 10.0 | 66.0 | 43.157864 |
DO | 12.0 | 76.0 | 48.870523 |
DS | 12.0 | 190.0 | 120.130546 |
DX | NaN | NaN | NaN |
NL | 30.0 | 280.0 | 159.245660 |
OL | 10.0 | 56.0 | 31.575258 |
OT | 5.0 | 46.0 | 24.230556 |
OX | 18.0 | 24.0 | 21.000000 |
PB | 12.0 | 55.0 | 31.735943 |
PE | 8.0 | 40.0 | 21.586508 |
PF | 4.0 | 25.0 | 7.923127 |
PH | 18.0 | 48.0 | 31.064516 |
PI | 17.0 | 21.0 | 19.250000 |
PL | 8.0 | 27.0 | 19.138889 |
PM | 7.0 | 49.0 | 21.364155 |
PP | 4.0 | 74.0 | 17.173942 |
PX | 18.0 | 20.0 | 19.000000 |
RF | 9.0 | 20.0 | 13.386667 |
RM | 4.0 | 29.0 | 10.585010 |
RO | 8.0 | 13.0 | 10.250000 |
RX | 11.0 | 20.0 | 15.500000 |
SF | 24.0 | 199.0 | 58.878049 |
SH | 16.0 | 140.0 | 73.148936 |
SO | 15.0 | 105.0 | 55.414634 |
SS | 57.0 | 130.0 | 93.500000 |
ST | NaN | NaN | NaN |
UR | NaN | NaN | NaN |
What is the average hindfoot length for male and female rodent of each species? Is there a male/female difference?
PYTHON
# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")
# Limit merged dataframe to rodents
rodents = merged[merged["taxa"] == "Rodent"]
# Group rodents by species and sex
grouped = rodents.groupby(["species_id", "sex"])
# Calculate the mean hindfoot length, plus count and standard deviation
# to better assess the question
with pd.option_context("display.max_rows", None):
print(grouped["hindfoot_length"].agg(["count", "mean", "std"]))
OUTPUT
count mean std
species_id sex
AH M 0 NaN NaN
U 2 33.000000 2.828427
BA F 31 13.161290 1.529636
M 14 12.642857 2.097880
U 0 NaN NaN
DM F 4302 35.712692 1.433067
M 5658 36.188229 1.455396
U 12 35.583333 0.996205
DO F 1244 35.486334 1.726414
M 1640 35.698780 1.611656
U 3 36.000000 0.000000
DS F 1046 49.583174 1.973573
M 1083 50.301939 2.128120
U 3 50.000000 1.732051
DX U 0 NaN NaN
NL F 620 32.024194 1.669181
M 452 32.577434 1.903603
U 2 52.000000 25.455844
OL F 438 20.287671 1.101022
M 480 20.758333 1.651195
U 2 20.000000 0.000000
OT F 1019 20.281649 1.519412
M 1115 20.257399 1.087354
U 5 19.600000 1.140175
OX F 3 19.666667 1.154701
M 5 18.800000 3.346640
U 0 NaN NaN
PB F 1645 25.882675 0.928842
M 1213 26.431987 1.444130
U 6 26.166667 0.983192
PE F 544 20.272059 1.156084
M 666 20.132132 1.177721
U 2 20.500000 0.707107
PF F 721 15.550624 1.381627
M 770 15.620779 1.142208
U 2 13.000000 4.242641
PH F 20 25.900000 1.293709
M 11 25.545455 1.752920
PI M 8 22.500000 0.534522
U 1 20.000000 NaN
PL F 16 19.750000 2.720294
M 19 20.263158 0.933459
U 1 20.000000 NaN
PM F 361 20.279778 1.162782
M 483 20.530021 1.275621
U 3 21.333333 4.041452
PP F 1574 21.676620 1.316529
M 1444 21.838643 1.201006
U 9 20.888889 2.027588
PX F 1 19.000000 NaN
M 1 20.000000 NaN
U 0 NaN NaN
RF F 55 17.527273 0.813191
M 18 17.500000 0.985184
RM F 1133 16.391880 1.123946
M 1296 16.491512 1.134202
U 13 15.846154 2.444250
RO F 4 14.750000 0.957427
M 4 16.000000 1.154701
RX M 2 18.500000 2.121320
SF F 16 27.500000 2.683282
M 23 26.347826 3.524056
U 2 24.500000 0.707107
SH F 71 29.028169 2.298893
M 60 27.983333 2.801281
U 0 NaN NaN
SO F 30 25.633333 4.139660
M 11 25.727273 1.954017
SS F 0 NaN NaN
M 0 NaN NaN
U 0 NaN NaN
ST U 0 NaN NaN
UR U 0 NaN NaN
What is the average weight of each rodent species over the course of the years? Is there any noticeable trend for any of the species?
PYTHON
# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")
# Limit merged dataframe to rodents
rodents = merged[merged["taxa"] == "Rodent"]
# Group rodents by species and year
grouped = rodents.groupby(["species_id", "year"])
# Calculate the mean weight by year
result = grouped["weight"].mean()
# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
print(result)
OUTPUT
species_id year
AH 1978 NaN
1982 NaN
1983 NaN
1984 NaN
1985 NaN
1986 NaN
1987 NaN
1988 NaN
1989 NaN
1990 NaN
1991 NaN
1992 NaN
1993 NaN
1994 NaN
1995 NaN
1996 NaN
1997 NaN
1998 NaN
1999 NaN
2000 NaN
2001 NaN
2002 NaN
BA 1989 7.000000
1990 8.000000
1991 9.240000
1992 7.833333
DM 1977 41.141304
1978 40.795455
1979 43.507317
1980 44.136082
1981 44.032907
1982 41.975042
1983 40.944551
1984 40.765306
1985 41.507645
1986 43.361596
1987 43.232609
1988 43.397790
1989 44.349206
1990 41.769912
1991 43.148338
1992 43.877966
1993 43.373984
1994 42.373288
1995 44.806147
1996 44.506173
1997 44.708551
1998 43.131403
1999 43.945402
2000 43.126638
2001 45.442177
2002 46.168317
DO 1977 42.666667
1978 45.000000
1979 45.862069
1980 48.058824
1981 49.111111
1982 47.919643
1983 47.150602
1984 48.415094
1985 47.956731
1986 49.372727
1987 50.087379
1988 51.463768
1989 51.025641
1990 48.512048
1991 49.695652
1992 48.367816
1993 48.461538
1994 47.750000
1995 49.592593
1996 47.234940
1997 48.177866
1998 49.731183
1999 48.012048
2000 49.224719
2001 52.233766
2002 49.514403
DS 1977 121.437500
1978 115.198653
1979 111.796954
1980 120.725664
1981 125.818444
1982 115.647929
1983 122.033088
1984 124.082192
1985 124.326316
1986 128.071429
1987 126.383838
1988 129.490566
1989 121.896552
1990 121.187500
1991 113.000000
1992 112.352941
1993 105.722222
1994 106.625000
1997 111.000000
1998 116.000000
1999 120.714286
DX 1980 NaN
1983 NaN
1986 NaN
1988 NaN
1991 NaN
1992 NaN
1993 NaN
1994 NaN
1995 NaN
1996 NaN
1998 NaN
1999 NaN
2000 NaN
2001 NaN
2002 NaN
NL 1977 NaN
1978 184.656250
1979 138.000000
1980 158.982143
1981 165.903226
1982 160.613208
1983 156.691489
1984 150.095238
1985 148.829268
1986 159.724138
1987 158.840000
1988 163.104167
1989 151.278689
1990 154.275862
1991 148.785714
1992 139.000000
1993 127.391304
1994 186.875000
1995 155.833333
1996 148.000000
1997 150.688889
1998 159.466667
1999 182.200000
2000 179.307692
2001 167.851064
2002 182.159091
OL 1977 21.666667
1978 31.027027
1979 33.717391
1980 33.107143
1981 33.296296
1982 35.590361
1983 34.584615
1984 32.550725
1985 32.108696
1986 30.880952
1987 30.779221
1988 30.306122
1989 31.947368
1990 31.200000
1991 28.171429
1992 28.454545
1993 27.545455
1994 21.900000
1995 27.296296
1996 27.769231
1997 33.625000
2002 25.428571
OT 1977 21.000000
1978 23.220000
1979 23.075758
1980 24.083333
1981 24.415385
1982 24.925532
1983 24.697674
1984 22.416667
1985 22.064516
1986 21.500000
1987 23.069767
1988 24.070588
1989 24.663366
1990 23.675325
1991 24.588235
1992 24.928571
1993 22.363636
1994 25.285714
1995 24.868421
1996 23.453704
1997 24.785156
1998 24.675676
1999 25.723810
2000 25.303448
2001 23.707792
2002 23.833333
OX 1977 21.333333
1978 NaN
1982 24.000000
1984 18.000000
1989 20.000000
1993 NaN
PB 1995 34.000000
1996 32.578947
1997 31.085603
1998 30.082237
1999 31.710037
2000 30.878899
2001 32.798851
2002 32.359447
PE 1977 19.500000
1978 20.428571
1979 20.529412
1980 22.450000
1981 20.558140
1982 21.173077
1983 20.864865
1984 20.210526
1985 20.360000
1986 22.520000
1987 21.625899
1988 22.625668
1989 21.935673
1990 21.671233
1991 21.435484
1992 22.750000
1993 19.230769
1994 18.000000
1995 21.230769
1996 22.027778
1997 20.929825
1998 20.304348
1999 25.428571
2000 21.615385
2001 20.400000
2002 21.719298
PF 1977 7.173913
1978 7.088235
1979 7.529412
1980 7.455556
1981 7.152542
1982 6.918919
1983 6.833333
1984 7.428571
1985 NaN
1986 8.000000
1987 7.500000
1988 8.000000
1989 7.230769
1990 7.117647
1991 7.827586
1992 8.767857
1993 8.238806
1994 7.855072
1995 8.780645
1996 8.393846
1997 8.448649
1998 8.720000
2001 8.538462
2002 7.388889
PH 1984 28.000000
1985 32.666667
1986 39.000000
1988 33.000000
1989 30.500000
1990 31.142857
1992 31.142857
1995 35.333333
1997 22.000000
PI 1992 18.000000
1995 NaN
2001 18.000000
2002 20.000000
PL 1995 22.333333
1996 21.166667
1997 18.789474
1998 16.714286
2000 21.000000
PM 1978 20.666667
1979 23.666667
1980 21.000000
1981 19.933333
1982 21.391304
1983 22.023810
1984 19.545455
1985 19.000000
1986 21.696970
1987 22.297710
1988 21.759259
1989 20.222222
1995 27.375000
1996 20.224490
1997 21.126394
1998 20.591398
1999 22.523810
2000 20.500000
2001 26.666667
2002 19.000000
PP 1977 15.333333
1978 14.869565
1979 15.105263
1980 14.176471
1981 13.950000
1982 16.125000
1983 15.468750
1984 15.307692
1985 15.764706
1986 16.750000
1987 17.840909
1988 18.280000
1989 17.409091
1990 16.029412
1991 17.904762
1992 17.479675
1993 17.954023
1994 17.585714
1995 16.844444
1996 18.095563
1997 18.175000
1998 16.266990
1999 16.521212
2000 16.788618
2001 17.752896
2002 17.018617
PX 1996 NaN
1997 19.000000
1998 NaN
2000 NaN
RF 1982 11.500000
1988 13.818182
1989 13.346939
1990 12.916667
1997 20.000000
RM 1977 10.000000
1978 7.500000
1979 8.333333
1980 10.227273
1981 11.178571
1982 10.436242
1983 9.872000
1984 11.152542
1985 8.371429
1986 10.750000
1987 10.657609
1988 10.322115
1989 10.411311
1990 10.305677
1991 10.498305
1992 10.904762
1993 10.747475
1994 10.675000
1995 12.653846
1996 11.455556
1997 11.230769
1998 13.100000
1999 10.555556
2000 11.400000
2001 11.333333
2002 10.000000
RO 1991 11.000000
2002 10.142857
RX 1995 20.000000
1997 11.000000
SF 1989 54.800000
1990 52.611111
1991 93.166667
1992 43.000000
1993 44.000000
2002 62.166667
SH 1977 NaN
1978 89.000000
1982 79.000000
1983 NaN
1986 55.000000
1987 73.800000
1988 72.806452
1989 76.345455
1990 76.888889
1991 63.000000
1997 50.857143
1999 54.000000
2000 73.375000
2001 79.900000
2002 64.666667
SO 1991 53.909091
1992 55.263158
1993 55.600000
1994 62.333333
1997 54.666667
SS 1977 NaN
1978 130.000000
1979 NaN
1980 NaN
1981 57.000000
1982 NaN
1983 NaN
1984 NaN
1985 NaN
1986 NaN
1987 NaN
1988 NaN
1989 NaN
1990 NaN
1991 NaN
1992 NaN
1993 NaN
1994 NaN
1995 NaN
1996 NaN
1997 NaN
1998 NaN
1999 NaN
2000 NaN
2001 NaN
2002 NaN
ST 1993 NaN
UR 1987 NaN
1988 NaN
1991 NaN
1992 NaN
1994 NaN
2001 NaN
2002 NaN
Name: weight, dtype: float64
Key Points
- Combine two dataframes on one or more common values using
pd.merge()
- Append rows from one dataframe to another using
pd.concat()
- Combine multiple text columns into one using the
+
operator - Convert date info to datetime objects using
pd.to_datetime()