Combining Dataframes

Last updated on 2022-07-18 | Edit this page

Estimated time: 60 minutes

Overview

Questions

  • How do we combine data from multiple sources using pandas?
  • How do we add data to an existing dataframe?
  • How do we split and combine data columns?

Objectives

  • Use pd.merge() to add species info to the survey dataset
  • Use pd.concat() to add additional rows the dataset
  • Use string methods to combine, split, and modify text columns using the str accessor

Dataframes can be used to organize and group data by common characteristics. Often, we need to combine elements from separate dataframes into one for analysis and visualization. A merge (or join) allows use to combine two dataframes using values common to each. Likewise, we may need to append data collected under different circumstances. In this chapter, we will show how to merge, concatenate, and split data using pandas.

Merging dataframes


The survey dataframe we’ve been using throughout this lesson has a column called species_id. We used this column in the previous lesson to calculate summary statistics about observations of each species. But the species_id is just a two-letter code—what does each code stand for? To find out, we’ll now load both the survey dataset and a second dataset containing more detailed information about the various species observed. Read the second dataframe from a file called species.csv:

PYTHON

import pandas as pd

surveys = pd.read_csv("data/surveys.csv")
species = pd.read_csv("data/species.csv")

species
species_id genus species taxa
0 AB Amphispiza bilineata Bird
1 AH Ammospermophilus harrisi Rodent
2 AS Ammodramus savannarum Bird
3 BA Baiomys taylori Rodent
4 CB Campylorhynchus brunneicapillus Bird
5 CM Calamospiza melanocorys Bird
6 CQ Callipepla squamata Bird
7 CS Crotalus scutalatus Reptile
8 CT Cnemidophorus tigris Reptile
9 CU Cnemidophorus uniparens Reptile
10 CV Crotalus viridis Reptile
11 DM Dipodomys merriami Rodent
12 DO Dipodomys ordii Rodent
13 DS Dipodomys spectabilis Rodent
14 DX Dipodomys sp. Rodent
15 EO Eumeces obsoletus Reptile
16 GS Gambelia silus Reptile
17 NL Neotoma albigula Rodent
18 NX Neotoma sp. Rodent
19 OL Onychomys leucogaster Rodent
20 OT Onychomys torridus Rodent
21 OX Onychomys sp. Rodent
22 PB Chaetodipus baileyi Rodent
23 PC Pipilo chlorurus Bird
24 PE Peromyscus eremicus Rodent
25 PF Perognathus flavus Rodent
26 PG Pooecetes gramineus Bird
27 PH Perognathus hispidus Rodent
28 PI Chaetodipus intermedius Rodent
29 PL Peromyscus leucopus Rodent
30 PM Peromyscus maniculatus Rodent
31 PP Chaetodipus penicillatus Rodent
32 PU Pipilo fuscus Bird
33 PX Chaetodipus sp. Rodent
34 RF Reithrodontomys fulvescens Rodent
35 RM Reithrodontomys megalotis Rodent
36 RO Reithrodontomys montanus Rodent
37 RX Reithrodontomys sp. Rodent
38 SA Sylvilagus audubonii Rabbit
39 SB Spizella breweri Bird
40 SC Sceloporus clarki Reptile
41 SF Sigmodon fulviventer Rodent
42 SH Sigmodon hispidus Rodent
43 SO Sigmodon ochrognathus Rodent
44 SS Spermophilus spilosoma Rodent
45 ST Spermophilus tereticaudus Rodent
46 SU Sceloporus undulatus Reptile
47 SX Sigmodon sp. Rodent
48 UL Lizard sp. Reptile
49 UP Pipilo sp. Bird
50 UR Rodent sp. Rodent
51 US Sparrow sp. Bird
52 ZL Zonotrichia leucophrys Bird
53 ZM Zenaida macroura Bird

We can see that the species dataframe includes a genus, species, and taxon for each species_id. This is much more useful than the species_id included in the original dataframe–how can we add that data to our surveys dataframe? Adding it by hand would be tedious and error prone. Fortunately, pandas provides the pd.merge() function to join two dataframes.

Managing repetitive data

Why store species data in a separate table in the first place? Species information is repetitive: Every observation of the same species has the same genus, species, and taxa. Storing it in the original survey table would require including that data in every record, increasing the complexity of the table and creating the possibility of errors. Storing that data in a separate table means we only have to enter and validate it once. A tool like pandas then allows us to access that data when we need it.

To merge the surveys and species dataframes, use:

PYTHON

merged = pd.merge(surveys, species)
merged
record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa
0 1 7 16 1977 2 NL M 32.0 NaN Neotoma albigula Rodent
1 2 7 16 1977 3 NL M 33.0 NaN Neotoma albigula Rodent
2 22 7 17 1977 15 NL F 31.0 NaN Neotoma albigula Rodent
3 38 7 17 1977 17 NL M 33.0 NaN Neotoma albigula Rodent
4 72 8 19 1977 2 NL M 31.0 NaN Neotoma albigula Rodent
34781 28988 12 23 1998 6 CT NaN NaN NaN Cnemidophorus tigris Reptile
34782 35512 12 31 2002 11 US NaN NaN NaN Sparrow sp. Bird
34783 35513 12 31 2002 11 US NaN NaN NaN Sparrow sp. Bird
34784 35528 12 31 2002 13 US NaN NaN NaN Sparrow sp. Bird
34785 35544 12 31 2002 15 US NaN NaN NaN Sparrow sp. Bird

34786 rows × 12 columns

Following the merge, the genus, species, and taxa columns have all been added to the survey dataframe. We can now use those columns to filter and summarize our data.

Joins

The pd.merge() method is equivalent to the JOIN operation in SQL

Challenge

Filter the merged dataframe to show the genus, the species name, and the weight for every individual captured at the site

PYTHON

merged[["genus", "species", "weight"]]
genus species weight
0 Neotoma albigula NaN
1 Neotoma albigula NaN
2 Neotoma albigula NaN
3 Neotoma albigula NaN
4 Neotoma albigula NaN
34781 Cnemidophorus tigris NaN
34782 Sparrow sp. NaN
34783 Sparrow sp. NaN
34784 Sparrow sp. NaN
34785 Sparrow sp. NaN

34786 rows × 3 columns

In the example above, we didn’t provide any information about how we wanted to merge the dataframes together, so pandas used its default arguments to make an educated guess. It looked at the columns in each of the dataframes, then merged them based on the columns that appear in both. Here, the only shared name is species_id, so that’s the column pandas used to merge. For more complex tables, we may want to specify the columns are used for merging. We can do so by passing one or more column names using the on keyword argument:

PYTHON

pd.merge(surveys, species, on="species_id")
record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa
0 1 7 16 1977 2 NL M 32.0 NaN Neotoma albigula Rodent
1 2 7 16 1977 3 NL M 33.0 NaN Neotoma albigula Rodent
2 22 7 17 1977 15 NL F 31.0 NaN Neotoma albigula Rodent
3 38 7 17 1977 17 NL M 33.0 NaN Neotoma albigula Rodent
4 72 8 19 1977 2 NL M 31.0 NaN Neotoma albigula Rodent
34781 28988 12 23 1998 6 CT NaN NaN NaN Cnemidophorus tigris Reptile
34782 35512 12 31 2002 11 US NaN NaN NaN Sparrow sp. Bird
34783 35513 12 31 2002 11 US NaN NaN NaN Sparrow sp. Bird
34784 35528 12 31 2002 13 US NaN NaN NaN Sparrow sp. Bird
34785 35544 12 31 2002 15 US NaN NaN NaN Sparrow sp. Bird

34786 rows × 12 columns

Challenge

Compare the number of rows in the original and merged survey dataframes. How do they differ? Why do you think that might be?

Hint: Use pd.unique() method to look at the species_id column in each dataframe.

PYTHON

pd.unique(surveys["species_id"].sort_values())

OUTPUT

array(['AB', 'AH', 'AS', 'BA', 'CB', 'CM', 'CQ', 'CS', 'CT', 'CU', 'CV',
       'DM', 'DO', 'DS', 'DX', 'NL', 'OL', 'OT', 'OX', 'PB', 'PC', 'PE',
       'PF', 'PG', 'PH', 'PI', 'PL', 'PM', 'PP', 'PU', 'PX', 'RF', 'RM',
       'RO', 'RX', 'SA', 'SC', 'SF', 'SH', 'SO', 'SS', 'ST', 'SU', 'UL',
       'UP', 'UR', 'US', 'ZL', nan], dtype=object)

PYTHON

pd.unique(species["species_id"].sort_values())

OUTPUT

array(['AB', 'AH', 'AS', 'BA', 'CB', 'CM', 'CQ', 'CS', 'CT', 'CU', 'CV',
       'DM', 'DO', 'DS', 'DX', 'EO', 'GS', 'NL', 'NX', 'OL', 'OT', 'OX',
       'PB', 'PC', 'PE', 'PF', 'PG', 'PH', 'PI', 'PL', 'PM', 'PP', 'PU',
       'PX', 'RF', 'RM', 'RO', 'RX', 'SA', 'SB', 'SC', 'SF', 'SH', 'SO',
       'SS', 'ST', 'SU', 'SX', 'UL', 'UP', 'UR', 'US', 'ZL', 'ZM'],
      dtype=object)

Some records in the surveys dataframe do not specify a species. By default, only records with a value that occurs in both the surveys and species dataframes appear in the merged dataframe, so rows without a species_id are excluded.

The built-in set type can be used to quickly assess differences like this:

PYTHON

set(surveys["species_id"]) - set(species["species_id"])

OUTPUT

{nan}

In practice, the values in the columns used to join two dataframes may not align exactly. Above, the surveys dataframe contains a few hundred rows where species_id is NaN. These are the rows that were dropped when the dataframes were merged.

By default, pd.merge() performs an inner join. This means that a row will only appear if the value in the shared column appears in both of the datasets being merged. In this case, that means that survey observations that don’t have a species_id or have a species_id that does not appear in the species dataframe will be dropped.

This is not always desirable behavior. Fortunately, pandas supports additional types of merges:

  • Inner: Include all rows with common values in the join column. This is the default behavior.
  • Left: Include all rows from the left dataframe. Columns from the right dataframe are populated if a common value exists and set to NaN if not.
  • Right: Include all rows from the right dataframe. Columns from the left dataframe are populated if a common value exists and set to NaN if not.
  • Outer: Includes all rows from both dataframes

We want to keep all of our observations, so let’s do a left join instead. To specify the type of merge, we use the how keyword argument:

PYTHON

pd.merge(surveys, species, how="left")
record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa
0 1 7 16 1977 2 NL M 32.0 NaN Neotoma albigula Rodent
1 2 7 16 1977 3 NL M 33.0 NaN Neotoma albigula Rodent
2 3 7 16 1977 2 DM F 37.0 NaN Dipodomys merriami Rodent
3 4 7 16 1977 7 DM M 36.0 NaN Dipodomys merriami Rodent
4 5 7 16 1977 3 DM M 35.0 NaN Dipodomys merriami Rodent
35544 35545 12 31 2002 15 AH NaN NaN NaN Ammospermophilus harrisi Rodent
35545 35546 12 31 2002 15 AH NaN NaN NaN Ammospermophilus harrisi Rodent
35546 35547 12 31 2002 10 RM F 15.0 14.0 Reithrodontomys megalotis Rodent
35547 35548 12 31 2002 7 DO M 36.0 51.0 Dipodomys ordii Rodent
35548 35549 12 31 2002 5 NaN NaN NaN NaN NaN NaN NaN

35549 rows × 12 columns

Now all 35,549 rows appear in the merged dataframe.

Appending rows to a dataframe


Merges address the case where information about the same set of observations is spread across multiple files. What about when the observations themselves are split into more than one file? For a survey like the one we’ve been looking at in this lesson, we might get a new file once a year with the same columns but a completely new set of observations. How can we add those new observations to our dataframe?

We’ll simulate this operation by splitting data from two different years, 2001 and 2002, into separate dataframes. We can do this using conditionals, as we saw in lesson 3:

PYTHON

surveys_2001 = surveys[surveys["year"] == 2001].copy()
surveys_2001
record_id month day year plot_id species_id sex hindfoot_length weight
31710 31711 1 21 2001 1 PB F 26.0 25.0
31711 31712 1 21 2001 1 DM M 37.0 43.0
31712 31713 1 21 2001 1 PB M 29.0 44.0
31713 31714 1 21 2001 1 DO M 34.0 53.0
31714 31715 1 21 2001 2 OT M 20.0 27.0
33315 33316 12 16 2001 11 NaN NaN NaN NaN
33316 33317 12 16 2001 13 NaN NaN NaN NaN
33317 33318 12 16 2001 14 NaN NaN NaN NaN
33318 33319 12 16 2001 15 NaN NaN NaN NaN
33319 33320 12 16 2001 16 NaN NaN NaN NaN

1610 rows × 9 columns

PYTHON

surveys_2002 = surveys[surveys["year"] == 2002].copy()
surveys_2002
record_id month day year plot_id species_id sex hindfoot_length weight
33320 33321 1 12 2002 1 DM M 38.0 44.0
33321 33322 1 12 2002 1 DO M 37.0 58.0
33322 33323 1 12 2002 1 PB M 28.0 45.0
33323 33324 1 12 2002 1 AB NaN NaN NaN
33324 33325 1 12 2002 1 DO M 35.0 29.0
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

2229 rows × 9 columns

We now have two different dataframes with the same columns but different data, one with 1,610 rows, the other with 2,229 rows. We can combine them into a new dataframe using pd.concat(), which stacks the dataframes vertically (that is, it appends records from the 2002 dataset to the 2001 dataset). This method accepts a list and will concatenate each item moving from left to right. We’re only combining two dataframes here but could do more if needed.

PYTHON

surveys_2001_2002 = pd.concat([surveys_2001, surveys_2002])
surveys_2001_2002
record_id month day year plot_id species_id sex hindfoot_length weight
31710 31711 1 21 2001 1 PB F 26.0 25.0
31711 31712 1 21 2001 1 DM M 37.0 43.0
31712 31713 1 21 2001 1 PB M 29.0 44.0
31713 31714 1 21 2001 1 DO M 34.0 53.0
31714 31715 1 21 2001 2 OT M 20.0 27.0
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

3839 rows × 9 columns

The combined dataframe includes all rows from both dataframes.

In some cases, the exact columns may change from year to year even within the same project. For example, researchers may decide to add an additional column to track a new piece of data or to provide a quality check. If a column is present in only one dataset, you can still concatenate the datasets. Any column that does not appear in a given dataset will be set to NaN for those rows in the combined dataframe.

To illustrate this, we’ll add a validated column to the 2002 survey, then re-run pd.concat():

PYTHON

surveys_2002["validated"] = True

surveys_2001_2002 = pd.concat([surveys_2001, surveys_2002])
surveys_2001_2002
record_id month day year plot_id species_id sex hindfoot_length weight validated
31710 31711 1 21 2001 1 PB F 26.0 25.0 NaN
31711 31712 1 21 2001 1 DM M 37.0 43.0 NaN
31712 31713 1 21 2001 1 PB M 29.0 44.0 NaN
31713 31714 1 21 2001 1 DO M 34.0 53.0 NaN
31714 31715 1 21 2001 2 OT M 20.0 27.0 NaN
35544 35545 12 31 2002 15 AH NaN NaN NaN True
35545 35546 12 31 2002 15 AH NaN NaN NaN True
35546 35547 12 31 2002 10 RM F 15.0 14.0 True
35547 35548 12 31 2002 7 DO M 36.0 51.0 True
35548 35549 12 31 2002 5 NaN NaN NaN NaN True

3839 rows × 10 columns

As expected, the validated column has a value of NaN for the 2001 data in the combined dataframe.

Joining and splitting columns


Sometimes we’d like to combine values from two or more columns into a single column. For example, we might want to refer to the species in each record by both its genus and species names. In Python, we use the + operator to concatenate (or join) strings, and pandas works the same way:

PYTHON

species["genus_species"] = species["genus"] + " " + species["species"]
species["genus_species"]

OUTPUT

0                Amphispiza bilineata
1            Ammospermophilus harrisi
2               Ammodramus savannarum
3                     Baiomys taylori
4     Campylorhynchus brunneicapillus
5             Calamospiza melanocorys
6                 Callipepla squamata
7                 Crotalus scutalatus
8                Cnemidophorus tigris
9             Cnemidophorus uniparens
10                   Crotalus viridis
11                 Dipodomys merriami
12                    Dipodomys ordii
13              Dipodomys spectabilis
14                      Dipodomys sp.
15                  Eumeces obsoletus
16                     Gambelia silus
17                   Neotoma albigula
18                        Neotoma sp.
19              Onychomys leucogaster
20                 Onychomys torridus
21                      Onychomys sp.
22                Chaetodipus baileyi
23                   Pipilo chlorurus
24                Peromyscus eremicus
25                 Perognathus flavus
26                Pooecetes gramineus
27               Perognathus hispidus
28            Chaetodipus intermedius
29                Peromyscus leucopus
30             Peromyscus maniculatus
31           Chaetodipus penicillatus
32                      Pipilo fuscus
33                    Chaetodipus sp.
34         Reithrodontomys fulvescens
35          Reithrodontomys megalotis
36           Reithrodontomys montanus
37                Reithrodontomys sp.
38               Sylvilagus audubonii
39                   Spizella breweri
40                  Sceloporus clarki
41               Sigmodon fulviventer
42                  Sigmodon hispidus
43              Sigmodon ochrognathus
44             Spermophilus spilosoma
45          Spermophilus tereticaudus
46               Sceloporus undulatus
47                       Sigmodon sp.
48                         Lizard sp.
49                         Pipilo sp.
50                         Rodent sp.
51                        Sparrow sp.
52             Zonotrichia leucophrys
53                   Zenaida macroura
Name: genus_species, dtype: object

Note that the + operator is also used to add numeric columns. In Python, the same operator can be used to perform different operations for different data types (but keep reading for an important caveat.)

Another common need is to join or split dates. In the ecology dataset, the date is split across year, month, and day columns. However, pandas has a special data type, datetime64, for representing dates that is useful for plotting, comparing, and resampling time series data. To make use of that functionality, we can concatenate the date columns and convert them to a datetime object. For clarity, we’ll use an unambiguous date format: YYYY-MM-DD.

We need to perform an additional step before combining the date columns. Year, month, and day are all stored as integers in our dataframe (specifically, they use the int64 data type). If we try to concatenate them as they are, we’ll receive an error. This is because the + operator only works when each object has a similar type, that is, all objects are either the same type or can be coerced to a common type (like int and float, which are distinct types that generally play well together). In this case, the columns are integers and the hyphens are strings. pandas cannot determine exactly how the user wants to combine the values, so it gives an error.

To resolve the error, we can use the astype() method to convert each column to a string before combining the columns:

PYTHON

year = surveys["year"].astype(str)
month = surveys["month"].astype(str)
day = surveys["day"].astype(str)

surveys["date"] = year + "-" + month + "-" + day
surveys["date"]

OUTPUT

0         1977-7-16
1         1977-7-16
2         1977-7-16
3         1977-7-16
4         1977-7-16
            ...
35544    2002-12-31
35545    2002-12-31
35546    2002-12-31
35547    2002-12-31
35548    2002-12-31
Name: date, Length: 35549, dtype: object

Note that some of the dates look a little funny because single-digit days and months do not include a leading zero. For example, in the first row we have 1977-7-16 instead of 1977-07-16. This is usually not a big deal but can be neatened up using the str accessor.

In pandas, an accessor is an attribute that provides additional functionality to an object. Here, the str accessor allows us to access many of the methods from the built-in str data type, including zfill(), which pads a string to a given length by adding zeroes to the start of the string:

PYTHON

text = "2"
text.zfill(3)

OUTPUT

'002'

Using the str accessor, we can use that method to zero-pad the data in a Series:

PYTHON

# Pad month and day to two characters
month = month.str.zfill(2)
day = day.str.zfill(2)

surveys["date"] = year + "-" + month + "-" + day
surveys["date"]

OUTPUT

0        1977-07-16
1        1977-07-16
2        1977-07-16
3        1977-07-16
4        1977-07-16
            ...
35544    2002-12-31
35545    2002-12-31
35546    2002-12-31
35547    2002-12-31
35548    2002-12-31
Name: date, Length: 35549, dtype: object

The month and date values in date are now padded to a length of two, allowing us to create a well-formed YYYY-MM-DD date string. Other string methods, like upper() and lower(), can be used in the same way.

Before we convert the date column to a datetime, we’re going to use the date string to show the opposite operation: Splitting a value stored in one column into multiple columns. One way to do this in pandas is to use str.split(), which splits each value in a series based on a delimiter, a character used as a boundary between parts of a string. Here, a hyphen is used to delimit the year, month, and date in each date. By splitting the column on a hyphen, we can extract each of those components into its own column. We also pass True to the expand keyword argument, which makes the str.split() method return a dataframe:

PYTHON

surveys["date"].str.split("-", expand=True)
0 1 2
0 1977 07 16
1 1977 07 16
2 1977 07 16
3 1977 07 16
4 1977 07 16
35544 2002 12 31
35545 2002 12 31
35546 2002 12 31
35547 2002 12 31
35548 2002 12 31

35549 rows × 3 columns

Now let’s go ahead and convert our date column into a datetime object using pd.to_datetime():

PYTHON

surveys["date"] = pd.to_datetime(surveys["date"])
surveys["date"]

OUTPUT

0       1977-07-16
1       1977-07-16
2       1977-07-16
3       1977-07-16
4       1977-07-16
           ...
35544   2002-12-31
35545   2002-12-31
35546   2002-12-31
35547   2002-12-31
35548   2002-12-31
Name: date, Length: 35549, dtype: datetime64[ns]

Challenge

pandas can help us ask specific questions which we want to answer about our data. The real skill is to know how to translate our scientific questions into a sensible approach (and subsequently visualize and interpret our results).

Try using pandas to answer the following questions.

  1. How many specimens of each sex are there for each year, including those whose sex is unknown?
  2. What is the average weight of each taxa?
  3. What are the minimum, maximum and average weight for each species of Rodent?
  4. What is the average hindfoot length for male and female rodent of each species? Is there a male/female difference?
  5. What is the average weight of each rodent species over the course of the years? Is there any noticeable trend for any of the species?

How many specimens of each sex are there for each year, including those whose sex is unknown?

PYTHON

# Fill in NaN values in sex with U
surveys["sex"] = surveys["sex"].fillna("U")

# Count records by sex
result = surveys.groupby(["year", "sex"])["record_id"].count()

# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
    print(result)

OUTPUT

year  sex
1977  F       204
      M       214
      U        85
1978  F       503
      M       433
      U       112
1979  F       327
      M       324
      U        68
1980  F       605
      M       727
      U        83
1981  F       631
      M       745
      U        96
1982  F       823
      M      1027
      U       128
1983  F       771
      M       797
      U       105
1984  F       445
      M       443
      U        93
1985  F       636
      M       716
      U        86
1986  F       414
      M       455
      U        73
1987  F       677
      M       862
      U       132
1988  F       604
      M       737
      U       128
1989  F       678
      M       780
      U       111
1990  F       581
      M       636
      U        94
1991  F       606
      M       637
      U       104
1992  F       443
      M       477
      U       118
1993  F       285
      M       380
      U        85
1994  F       243
      M       327
      U        98
1995  F       574
      M       534
      U       114
1996  F       725
      M       910
      U        71
1997  F      1071
      M      1357
      U        65
1998  F       717
      M       798
      U        95
1999  F       545
      M       530
      U        60
2000  F       690
      M       779
      U        83
2001  F       743
      M       744
      U       123
2002  F      1149
      M       979
      U       101
Name: record_id, dtype: int64

What is the average weight of each taxa?

PYTHON

# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")

# Group by taxa
grouped = merged.groupby("taxa")

# Calculate the min, max, and mean weight
grouped["weight"].mean()

OUTPUT

taxa
Bird             NaN
Rabbit           NaN
Reptile          NaN
Rodent     42.672428
Name: weight, dtype: float64

What are the minimum, maximum and average weight for each species of Rodent?

PYTHON

# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")

# Limit merged dataframe to rodents
rodents = merged[merged["taxa"] == "Rodent"]

# Group rodents by species
grouped = rodents.groupby("species_id")

# Calculate the min, max, and mean weight
grouped.agg({"weight": ["min", "max", "mean"]})
weight
min max mean
species_id
AH NaN NaN NaN
BA 6.0 18.0 8.600000
DM 10.0 66.0 43.157864
DO 12.0 76.0 48.870523
DS 12.0 190.0 120.130546
DX NaN NaN NaN
NL 30.0 280.0 159.245660
OL 10.0 56.0 31.575258
OT 5.0 46.0 24.230556
OX 18.0 24.0 21.000000
PB 12.0 55.0 31.735943
PE 8.0 40.0 21.586508
PF 4.0 25.0 7.923127
PH 18.0 48.0 31.064516
PI 17.0 21.0 19.250000
PL 8.0 27.0 19.138889
PM 7.0 49.0 21.364155
PP 4.0 74.0 17.173942
PX 18.0 20.0 19.000000
RF 9.0 20.0 13.386667
RM 4.0 29.0 10.585010
RO 8.0 13.0 10.250000
RX 11.0 20.0 15.500000
SF 24.0 199.0 58.878049
SH 16.0 140.0 73.148936
SO 15.0 105.0 55.414634
SS 57.0 130.0 93.500000
ST NaN NaN NaN
UR NaN NaN NaN

What is the average hindfoot length for male and female rodent of each species? Is there a male/female difference?

PYTHON

# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")

# Limit merged dataframe to rodents
rodents = merged[merged["taxa"] == "Rodent"]

# Group rodents by species and sex
grouped = rodents.groupby(["species_id", "sex"])

# Calculate the mean hindfoot length, plus count and standard deviation
# to better assess the question
with pd.option_context("display.max_rows", None):
    print(grouped["hindfoot_length"].agg(["count", "mean", "std"]))

OUTPUT

                count       mean        std
species_id sex
AH         M        0        NaN        NaN
           U        2  33.000000   2.828427
BA         F       31  13.161290   1.529636
           M       14  12.642857   2.097880
           U        0        NaN        NaN
DM         F     4302  35.712692   1.433067
           M     5658  36.188229   1.455396
           U       12  35.583333   0.996205
DO         F     1244  35.486334   1.726414
           M     1640  35.698780   1.611656
           U        3  36.000000   0.000000
DS         F     1046  49.583174   1.973573
           M     1083  50.301939   2.128120
           U        3  50.000000   1.732051
DX         U        0        NaN        NaN
NL         F      620  32.024194   1.669181
           M      452  32.577434   1.903603
           U        2  52.000000  25.455844
OL         F      438  20.287671   1.101022
           M      480  20.758333   1.651195
           U        2  20.000000   0.000000
OT         F     1019  20.281649   1.519412
           M     1115  20.257399   1.087354
           U        5  19.600000   1.140175
OX         F        3  19.666667   1.154701
           M        5  18.800000   3.346640
           U        0        NaN        NaN
PB         F     1645  25.882675   0.928842
           M     1213  26.431987   1.444130
           U        6  26.166667   0.983192
PE         F      544  20.272059   1.156084
           M      666  20.132132   1.177721
           U        2  20.500000   0.707107
PF         F      721  15.550624   1.381627
           M      770  15.620779   1.142208
           U        2  13.000000   4.242641
PH         F       20  25.900000   1.293709
           M       11  25.545455   1.752920
PI         M        8  22.500000   0.534522
           U        1  20.000000        NaN
PL         F       16  19.750000   2.720294
           M       19  20.263158   0.933459
           U        1  20.000000        NaN
PM         F      361  20.279778   1.162782
           M      483  20.530021   1.275621
           U        3  21.333333   4.041452
PP         F     1574  21.676620   1.316529
           M     1444  21.838643   1.201006
           U        9  20.888889   2.027588
PX         F        1  19.000000        NaN
           M        1  20.000000        NaN
           U        0        NaN        NaN
RF         F       55  17.527273   0.813191
           M       18  17.500000   0.985184
RM         F     1133  16.391880   1.123946
           M     1296  16.491512   1.134202
           U       13  15.846154   2.444250
RO         F        4  14.750000   0.957427
           M        4  16.000000   1.154701
RX         M        2  18.500000   2.121320
SF         F       16  27.500000   2.683282
           M       23  26.347826   3.524056
           U        2  24.500000   0.707107
SH         F       71  29.028169   2.298893
           M       60  27.983333   2.801281
           U        0        NaN        NaN
SO         F       30  25.633333   4.139660
           M       11  25.727273   1.954017
SS         F        0        NaN        NaN
           M        0        NaN        NaN
           U        0        NaN        NaN
ST         U        0        NaN        NaN
UR         U        0        NaN        NaN

What is the average weight of each rodent species over the course of the years? Is there any noticeable trend for any of the species?

PYTHON

# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")

# Limit merged dataframe to rodents
rodents = merged[merged["taxa"] == "Rodent"]

# Group rodents by species and year
grouped = rodents.groupby(["species_id", "year"])

# Calculate the mean weight by year
result = grouped["weight"].mean()

# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
    print(result)

OUTPUT

species_id  year
AH          1978           NaN
            1982           NaN
            1983           NaN
            1984           NaN
            1985           NaN
            1986           NaN
            1987           NaN
            1988           NaN
            1989           NaN
            1990           NaN
            1991           NaN
            1992           NaN
            1993           NaN
            1994           NaN
            1995           NaN
            1996           NaN
            1997           NaN
            1998           NaN
            1999           NaN
            2000           NaN
            2001           NaN
            2002           NaN
BA          1989      7.000000
            1990      8.000000
            1991      9.240000
            1992      7.833333
DM          1977     41.141304
            1978     40.795455
            1979     43.507317
            1980     44.136082
            1981     44.032907
            1982     41.975042
            1983     40.944551
            1984     40.765306
            1985     41.507645
            1986     43.361596
            1987     43.232609
            1988     43.397790
            1989     44.349206
            1990     41.769912
            1991     43.148338
            1992     43.877966
            1993     43.373984
            1994     42.373288
            1995     44.806147
            1996     44.506173
            1997     44.708551
            1998     43.131403
            1999     43.945402
            2000     43.126638
            2001     45.442177
            2002     46.168317
DO          1977     42.666667
            1978     45.000000
            1979     45.862069
            1980     48.058824
            1981     49.111111
            1982     47.919643
            1983     47.150602
            1984     48.415094
            1985     47.956731
            1986     49.372727
            1987     50.087379
            1988     51.463768
            1989     51.025641
            1990     48.512048
            1991     49.695652
            1992     48.367816
            1993     48.461538
            1994     47.750000
            1995     49.592593
            1996     47.234940
            1997     48.177866
            1998     49.731183
            1999     48.012048
            2000     49.224719
            2001     52.233766
            2002     49.514403
DS          1977    121.437500
            1978    115.198653
            1979    111.796954
            1980    120.725664
            1981    125.818444
            1982    115.647929
            1983    122.033088
            1984    124.082192
            1985    124.326316
            1986    128.071429
            1987    126.383838
            1988    129.490566
            1989    121.896552
            1990    121.187500
            1991    113.000000
            1992    112.352941
            1993    105.722222
            1994    106.625000
            1997    111.000000
            1998    116.000000
            1999    120.714286
DX          1980           NaN
            1983           NaN
            1986           NaN
            1988           NaN
            1991           NaN
            1992           NaN
            1993           NaN
            1994           NaN
            1995           NaN
            1996           NaN
            1998           NaN
            1999           NaN
            2000           NaN
            2001           NaN
            2002           NaN
NL          1977           NaN
            1978    184.656250
            1979    138.000000
            1980    158.982143
            1981    165.903226
            1982    160.613208
            1983    156.691489
            1984    150.095238
            1985    148.829268
            1986    159.724138
            1987    158.840000
            1988    163.104167
            1989    151.278689
            1990    154.275862
            1991    148.785714
            1992    139.000000
            1993    127.391304
            1994    186.875000
            1995    155.833333
            1996    148.000000
            1997    150.688889
            1998    159.466667
            1999    182.200000
            2000    179.307692
            2001    167.851064
            2002    182.159091
OL          1977     21.666667
            1978     31.027027
            1979     33.717391
            1980     33.107143
            1981     33.296296
            1982     35.590361
            1983     34.584615
            1984     32.550725
            1985     32.108696
            1986     30.880952
            1987     30.779221
            1988     30.306122
            1989     31.947368
            1990     31.200000
            1991     28.171429
            1992     28.454545
            1993     27.545455
            1994     21.900000
            1995     27.296296
            1996     27.769231
            1997     33.625000
            2002     25.428571
OT          1977     21.000000
            1978     23.220000
            1979     23.075758
            1980     24.083333
            1981     24.415385
            1982     24.925532
            1983     24.697674
            1984     22.416667
            1985     22.064516
            1986     21.500000
            1987     23.069767
            1988     24.070588
            1989     24.663366
            1990     23.675325
            1991     24.588235
            1992     24.928571
            1993     22.363636
            1994     25.285714
            1995     24.868421
            1996     23.453704
            1997     24.785156
            1998     24.675676
            1999     25.723810
            2000     25.303448
            2001     23.707792
            2002     23.833333
OX          1977     21.333333
            1978           NaN
            1982     24.000000
            1984     18.000000
            1989     20.000000
            1993           NaN
PB          1995     34.000000
            1996     32.578947
            1997     31.085603
            1998     30.082237
            1999     31.710037
            2000     30.878899
            2001     32.798851
            2002     32.359447
PE          1977     19.500000
            1978     20.428571
            1979     20.529412
            1980     22.450000
            1981     20.558140
            1982     21.173077
            1983     20.864865
            1984     20.210526
            1985     20.360000
            1986     22.520000
            1987     21.625899
            1988     22.625668
            1989     21.935673
            1990     21.671233
            1991     21.435484
            1992     22.750000
            1993     19.230769
            1994     18.000000
            1995     21.230769
            1996     22.027778
            1997     20.929825
            1998     20.304348
            1999     25.428571
            2000     21.615385
            2001     20.400000
            2002     21.719298
PF          1977      7.173913
            1978      7.088235
            1979      7.529412
            1980      7.455556
            1981      7.152542
            1982      6.918919
            1983      6.833333
            1984      7.428571
            1985           NaN
            1986      8.000000
            1987      7.500000
            1988      8.000000
            1989      7.230769
            1990      7.117647
            1991      7.827586
            1992      8.767857
            1993      8.238806
            1994      7.855072
            1995      8.780645
            1996      8.393846
            1997      8.448649
            1998      8.720000
            2001      8.538462
            2002      7.388889
PH          1984     28.000000
            1985     32.666667
            1986     39.000000
            1988     33.000000
            1989     30.500000
            1990     31.142857
            1992     31.142857
            1995     35.333333
            1997     22.000000
PI          1992     18.000000
            1995           NaN
            2001     18.000000
            2002     20.000000
PL          1995     22.333333
            1996     21.166667
            1997     18.789474
            1998     16.714286
            2000     21.000000
PM          1978     20.666667
            1979     23.666667
            1980     21.000000
            1981     19.933333
            1982     21.391304
            1983     22.023810
            1984     19.545455
            1985     19.000000
            1986     21.696970
            1987     22.297710
            1988     21.759259
            1989     20.222222
            1995     27.375000
            1996     20.224490
            1997     21.126394
            1998     20.591398
            1999     22.523810
            2000     20.500000
            2001     26.666667
            2002     19.000000
PP          1977     15.333333
            1978     14.869565
            1979     15.105263
            1980     14.176471
            1981     13.950000
            1982     16.125000
            1983     15.468750
            1984     15.307692
            1985     15.764706
            1986     16.750000
            1987     17.840909
            1988     18.280000
            1989     17.409091
            1990     16.029412
            1991     17.904762
            1992     17.479675
            1993     17.954023
            1994     17.585714
            1995     16.844444
            1996     18.095563
            1997     18.175000
            1998     16.266990
            1999     16.521212
            2000     16.788618
            2001     17.752896
            2002     17.018617
PX          1996           NaN
            1997     19.000000
            1998           NaN
            2000           NaN
RF          1982     11.500000
            1988     13.818182
            1989     13.346939
            1990     12.916667
            1997     20.000000
RM          1977     10.000000
            1978      7.500000
            1979      8.333333
            1980     10.227273
            1981     11.178571
            1982     10.436242
            1983      9.872000
            1984     11.152542
            1985      8.371429
            1986     10.750000
            1987     10.657609
            1988     10.322115
            1989     10.411311
            1990     10.305677
            1991     10.498305
            1992     10.904762
            1993     10.747475
            1994     10.675000
            1995     12.653846
            1996     11.455556
            1997     11.230769
            1998     13.100000
            1999     10.555556
            2000     11.400000
            2001     11.333333
            2002     10.000000
RO          1991     11.000000
            2002     10.142857
RX          1995     20.000000
            1997     11.000000
SF          1989     54.800000
            1990     52.611111
            1991     93.166667
            1992     43.000000
            1993     44.000000
            2002     62.166667
SH          1977           NaN
            1978     89.000000
            1982     79.000000
            1983           NaN
            1986     55.000000
            1987     73.800000
            1988     72.806452
            1989     76.345455
            1990     76.888889
            1991     63.000000
            1997     50.857143
            1999     54.000000
            2000     73.375000
            2001     79.900000
            2002     64.666667
SO          1991     53.909091
            1992     55.263158
            1993     55.600000
            1994     62.333333
            1997     54.666667
SS          1977           NaN
            1978    130.000000
            1979           NaN
            1980           NaN
            1981     57.000000
            1982           NaN
            1983           NaN
            1984           NaN
            1985           NaN
            1986           NaN
            1987           NaN
            1988           NaN
            1989           NaN
            1990           NaN
            1991           NaN
            1992           NaN
            1993           NaN
            1994           NaN
            1995           NaN
            1996           NaN
            1997           NaN
            1998           NaN
            1999           NaN
            2000           NaN
            2001           NaN
            2002           NaN
ST          1993           NaN
UR          1987           NaN
            1988           NaN
            1991           NaN
            1992           NaN
            1994           NaN
            2001           NaN
            2002           NaN
Name: weight, dtype: float64

Key Points

  • Combine two dataframes on one or more common values using pd.merge()
  • Append rows from one dataframe to another using pd.concat()
  • Combine multiple text columns into one using the + operator
  • Use the str accessor to use string methods like split() and zfill() on text columns
  • Convert date strings to datetime objects using pd.to_datetime()