Content from Introducing Python


Last updated on 2025-07-03 | Edit this page

Overview

Questions

  • What is Python?
  • How do we assign variables in Python?
  • How do we perform actions in Python?
  • How does Python handle different types of data?

Objectives

  • Introduce some Python terminology
  • Learn how to assign values to variables
  • Learn how to use functions and methods
  • Introduce the built-in Python data types and functions
  • Introduce the Python Standard Library

About Python


Python is a high-level, general purpose language. In practical terms, that means it dispenses with many of the complexities of writing code. It is widely used and has an active community of developers, making it suitable for a large number of tasks.

A core philosophical tenet of Python is that code is read more often than it is written. Best practices for coding in Python therefore emphasize writing well-documented, consistently formatted code that follows language-wide standards. In a Pythonic world, everyone’s code would look the same.

Whitespace

One feature to be aware of going in is that Python uses whitespace (as opposed to braces) to organize blocks of code. This can be a sticking point for people used to other languages.

Getting started


Everything in Python is an object. Each object has a type, which controls what the object can do. For example, we can add integers together or append a string to a list. An object can be assigned to a variable, and a function performs an action.

Python is an object-oriented language. This means that each object includes built-in variables and functions that can be used from the object itself. Whe use different terms to refer to variables and functions associated with an object:

  • an attribute stores a piece of data
  • a method performs an action

Let’s look at one of the main built-in types, str, to see how this works in practice. In Python, str, short for string, is used to store and manipulate text. To get started, we’ll assign the string “hello world” to the variable text. In Python, we use a single equal sign for assignment and quotes to create a string.

PYTHON

text = "hello world"

Using quotes

Either single or double quotes can be used to create strings, but try to use them consistently! We will use double quotes in the examples here.

Now, instead of retyping the full string every time we need it, we can access that string by referring to the variable:

PYTHON

text

OUTPUT

'hello world'

We can check the type of the variable using the type() function:

PYTHON

type(text)

OUTPUT

str

A str is also an object, which means that it includes methods and attributes that we can use. Let’s use the upper() method to convert the text to upper case. We call methods by adding a period and the name of the method to the variable:

PYTHON

text.upper()

OUTPUT

'HELLO WORLD'

After running a cell, the notebook displays the object that appears in the last line of a cell (but note that some actions, like assigning to a variable, have no output). Some cells will have more than one line, and it is often useful to display output from earlier in a cell. This can be accomplished using the built-in print() function. In the cell below, we’ll use print() to show a series of strings:

PYTHON

"hello"
print("h")
print("e")
print("l")
print("l")
print("o")
"world"

OUTPUT

h
e
l
l
o

'world'

Note that the string “hello” at the top of the cell lacks a print statement and does not appear in the output, whereas the text “world” does appear in the output because it is the last line of the cell.

Each object may contain many attributes and methods. Use the help() function on any object, including functions or methods, to show a description of the object and list the available methods.

PYTHON

help(str)

OUTPUT

Help on class str in module builtins:

class str(object)
 |  str(object='') -> str
 |  str(bytes_or_buffer[, encoding[, errors]]) -> str
 |
 |  Create a new string object from the given object. If encoding or
 |  errors is specified, then the object must expose a data buffer
 |  that will be decoded using the given encoding and error handler.
 |  Otherwise, returns the result of object.__str__() (if defined)
 |  or repr(object).
 |  encoding defaults to 'utf-8'.
 |  errors defaults to 'strict'.
 |
 |  Methods defined here:
 |
 |  __add__(self, value, /)
 |      Return self+value.
 |
 |  __contains__(self, key, /)
 |      Return bool(key in self).
 |
 |  __eq__(self, value, /)
 |      Return self==value.
 |
 |  __format__(self, format_spec, /)
 |      Return a formatted version of the string as described by format_spec.
 |
 |  __ge__(self, value, /)
 |      Return self>=value.
 |
 |  __getitem__(self, key, /)
 |      Return self[key].
 |
 |  __getnewargs__(self, /)
 |
 |  __gt__(self, value, /)
 |      Return self>value.
 |
 |  __hash__(self, /)
 |      Return hash(self).
 |
 |  __iter__(self, /)
 |      Implement iter(self).
 |
 |  __le__(self, value, /)
 |      Return self<=value.
 |
 |  __len__(self, /)
 |      Return len(self).
 |
 |  __lt__(self, value, /)
 |      Return self<value.
 |
 |  __mod__(self, value, /)
 |      Return self%value.
 |
 |  __mul__(self, value, /)
 |      Return self*value.
 |
 |  __ne__(self, value, /)
 |      Return self!=value.
 |
 |  __repr__(self, /)
 |      Return repr(self).
 |
 |  __rmod__(self, value, /)
 |      Return value%self.
 |
 |  __rmul__(self, value, /)
 |      Return value*self.
 |
 |  __sizeof__(self, /)
 |      Return the size of the string in memory, in bytes.
 |
 |  __str__(self, /)
 |      Return str(self).
 |
 |  capitalize(self, /)
 |      Return a capitalized version of the string.
 |
 |      More specifically, make the first character have upper case and the rest lower
 |      case.
 |
 |  casefold(self, /)
 |      Return a version of the string suitable for caseless comparisons.
 |
 |  center(self, width, fillchar=' ', /)
 |      Return a centered string of length width.
 |
 |      Padding is done using the specified fill character (default is a space).
 |
 |  count(self, sub[, start[, end]], /)
 |      Return the number of non-overlapping occurrences of substring sub in string S[start:end].
 |
 |      Optional arguments start and end are interpreted as in slice notation.
 |
 |  encode(self, /, encoding='utf-8', errors='strict')
 |      Encode the string using the codec registered for encoding.
 |
 |      encoding
 |        The encoding in which to encode the string.
 |      errors
 |        The error handling scheme to use for encoding errors.
 |        The default is 'strict' meaning that encoding errors raise a
 |        UnicodeEncodeError.  Other possible values are 'ignore', 'replace' and
 |        'xmlcharrefreplace' as well as any other name registered with
 |        codecs.register_error that can handle UnicodeEncodeErrors.
 |
 |  endswith(self, suffix[, start[, end]], /)
 |      Return True if the string ends with the specified suffix, False otherwise.
 |
 |      suffix
 |        A string or a tuple of strings to try.
 |      start
 |        Optional start position. Default: start of the string.
 |      end
 |        Optional stop position. Default: end of the string.
 |
 |  expandtabs(self, /, tabsize=8)
 |      Return a copy where all tab characters are expanded using spaces.
 |
 |      If tabsize is not given, a tab size of 8 characters is assumed.
 |
 |  find(self, sub[, start[, end]], /)
 |      Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end].
 |
 |      Optional arguments start and end are interpreted as in slice notation.
 |      Return -1 on failure.
 |
 |  format(self, /, *args, **kwargs)
 |      Return a formatted version of the string, using substitutions from args and kwargs.
 |      The substitutions are identified by braces ('{' and '}').
 |
 |  format_map(self, mapping, /)
 |      Return a formatted version of the string, using substitutions from mapping.
 |      The substitutions are identified by braces ('{' and '}').
 |
 |  index(self, sub[, start[, end]], /)
 |      Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end].
 |
 |      Optional arguments start and end are interpreted as in slice notation.
 |      Raises ValueError when the substring is not found.
 |
 |  isalnum(self, /)
 |      Return True if the string is an alpha-numeric string, False otherwise.
 |
 |      A string is alpha-numeric if all characters in the string are alpha-numeric and
 |      there is at least one character in the string.
 |
 |  isalpha(self, /)
 |      Return True if the string is an alphabetic string, False otherwise.
 |
 |      A string is alphabetic if all characters in the string are alphabetic and there
 |      is at least one character in the string.
 |
 |  isascii(self, /)
 |      Return True if all characters in the string are ASCII, False otherwise.
 |
 |      ASCII characters have code points in the range U+0000-U+007F.
 |      Empty string is ASCII too.
 |
 |  isdecimal(self, /)
 |      Return True if the string is a decimal string, False otherwise.
 |
 |      A string is a decimal string if all characters in the string are decimal and
 |      there is at least one character in the string.
 |
 |  isdigit(self, /)
 |      Return True if the string is a digit string, False otherwise.
 |
 |      A string is a digit string if all characters in the string are digits and there
 |      is at least one character in the string.
 |
 |  isidentifier(self, /)
 |      Return True if the string is a valid Python identifier, False otherwise.
 |
 |      Call keyword.iskeyword(s) to test whether string s is a reserved identifier,
 |      such as "def" or "class".
 |
 |  islower(self, /)
 |      Return True if the string is a lowercase string, False otherwise.
 |
 |      A string is lowercase if all cased characters in the string are lowercase and
 |      there is at least one cased character in the string.
 |
 |  isnumeric(self, /)
 |      Return True if the string is a numeric string, False otherwise.
 |
 |      A string is numeric if all characters in the string are numeric and there is at
 |      least one character in the string.
 |
 |  isprintable(self, /)
 |      Return True if all characters in the string are printable, False otherwise.
 |
 |      A character is printable if repr() may use it in its output.
 |
 |  isspace(self, /)
 |      Return True if the string is a whitespace string, False otherwise.
 |
 |      A string is whitespace if all characters in the string are whitespace and there
 |      is at least one character in the string.
 |
 |  istitle(self, /)
 |      Return True if the string is a title-cased string, False otherwise.
 |
 |      In a title-cased string, upper- and title-case characters may only
 |      follow uncased characters and lowercase characters only cased ones.
 |
 |  isupper(self, /)
 |      Return True if the string is an uppercase string, False otherwise.
 |
 |      A string is uppercase if all cased characters in the string are uppercase and
 |      there is at least one cased character in the string.
 |
 |  join(self, iterable, /)
 |      Concatenate any number of strings.
 |
 |      The string whose method is called is inserted in between each given string.
 |      The result is returned as a new string.
 |
 |      Example: '.'.join(['ab', 'pq', 'rs']) -> 'ab.pq.rs'
 |
 |  ljust(self, width, fillchar=' ', /)
 |      Return a left-justified string of length width.
 |
 |      Padding is done using the specified fill character (default is a space).
 |
 |  lower(self, /)
 |      Return a copy of the string converted to lowercase.
 |
 |  lstrip(self, chars=None, /)
 |      Return a copy of the string with leading whitespace removed.
 |
 |      If chars is given and not None, remove characters in chars instead.
 |
 |  partition(self, sep, /)
 |      Partition the string into three parts using the given separator.
 |
 |      This will search for the separator in the string.  If the separator is found,
 |      returns a 3-tuple containing the part before the separator, the separator
 |      itself, and the part after it.
 |
 |      If the separator is not found, returns a 3-tuple containing the original string
 |      and two empty strings.
 |
 |  removeprefix(self, prefix, /)
 |      Return a str with the given prefix string removed if present.
 |
 |      If the string starts with the prefix string, return string[len(prefix):].
 |      Otherwise, return a copy of the original string.
 |
 |  removesuffix(self, suffix, /)
 |      Return a str with the given suffix string removed if present.
 |
 |      If the string ends with the suffix string and that suffix is not empty,
 |      return string[:-len(suffix)]. Otherwise, return a copy of the original
 |      string.
 |
 |  replace(self, old, new, /, count=-1)
 |      Return a copy with all occurrences of substring old replaced by new.
 |
 |        count
 |          Maximum number of occurrences to replace.
 |          -1 (the default value) means replace all occurrences.
 |
 |      If the optional argument count is given, only the first count occurrences are
 |      replaced.
 |
 |  rfind(self, sub[, start[, end]], /)
 |      Return the highest index in S where substring sub is found, such that sub is contained within S[start:end].
 |
 |      Optional arguments start and end are interpreted as in slice notation.
 |      Return -1 on failure.
 |
 |  rindex(self, sub[, start[, end]], /)
 |      Return the highest index in S where substring sub is found, such that sub is contained within S[start:end].
 |
 |      Optional arguments start and end are interpreted as in slice notation.
 |      Raises ValueError when the substring is not found.
 |
 |  rjust(self, width, fillchar=' ', /)
 |      Return a right-justified string of length width.
 |
 |      Padding is done using the specified fill character (default is a space).
 |
 |  rpartition(self, sep, /)
 |      Partition the string into three parts using the given separator.
 |
 |      This will search for the separator in the string, starting at the end. If
 |      the separator is found, returns a 3-tuple containing the part before the
 |      separator, the separator itself, and the part after it.
 |
 |      If the separator is not found, returns a 3-tuple containing two empty strings
 |      and the original string.
 |
 |  rsplit(self, /, sep=None, maxsplit=-1)
 |      Return a list of the substrings in the string, using sep as the separator string.
 |
 |        sep
 |          The separator used to split the string.
 |
 |          When set to None (the default value), will split on any whitespace
 |          character (including \n \r \t \f and spaces) and will discard
 |          empty strings from the result.
 |        maxsplit
 |          Maximum number of splits.
 |          -1 (the default value) means no limit.
 |
 |      Splitting starts at the end of the string and works to the front.
 |
 |  rstrip(self, chars=None, /)
 |      Return a copy of the string with trailing whitespace removed.
 |
 |      If chars is given and not None, remove characters in chars instead.
 |
 |  split(self, /, sep=None, maxsplit=-1)
 |      Return a list of the substrings in the string, using sep as the separator string.
 |
 |        sep
 |          The separator used to split the string.
 |
 |          When set to None (the default value), will split on any whitespace
 |          character (including \n \r \t \f and spaces) and will discard
 |          empty strings from the result.
 |        maxsplit
 |          Maximum number of splits.
 |          -1 (the default value) means no limit.
 |
 |      Splitting starts at the front of the string and works to the end.
 |
 |      Note, str.split() is mainly useful for data that has been intentionally
 |      delimited.  With natural text that includes punctuation, consider using
 |      the regular expression module.
 |
 |  splitlines(self, /, keepends=False)
 |      Return a list of the lines in the string, breaking at line boundaries.
 |
 |      Line breaks are not included in the resulting list unless keepends is given and
 |      true.
 |
 |  startswith(self, prefix[, start[, end]], /)
 |      Return True if the string starts with the specified prefix, False otherwise.
 |
 |      prefix
 |        A string or a tuple of strings to try.
 |      start
 |        Optional start position. Default: start of the string.
 |      end
 |        Optional stop position. Default: end of the string.
 |
 |  strip(self, chars=None, /)
 |      Return a copy of the string with leading and trailing whitespace removed.
 |
 |      If chars is given and not None, remove characters in chars instead.
 |
 |  swapcase(self, /)
 |      Convert uppercase characters to lowercase and lowercase characters to uppercase.
 |
 |  title(self, /)
 |      Return a version of the string where each word is titlecased.
 |
 |      More specifically, words start with uppercased characters and all remaining
 |      cased characters have lower case.
 |
 |  translate(self, table, /)
 |      Replace each character in the string using the given translation table.
 |
 |        table
 |          Translation table, which must be a mapping of Unicode ordinals to
 |          Unicode ordinals, strings, or None.
 |
 |      The table must implement lookup/indexing via __getitem__, for instance a
 |      dictionary or list.  If this operation raises LookupError, the character is
 |      left untouched.  Characters mapped to None are deleted.
 |
 |  upper(self, /)
 |      Return a copy of the string converted to uppercase.
 |
 |  zfill(self, width, /)
 |      Pad a numeric string with zeros on the left, to fill a field of the given width.
 |
 |      The string is never truncated.
 |
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |
 |  __new__(*args, **kwargs)
 |      Create and return a new object.  See help(type) for accurate signature.
 |
 |  maketrans(x, y=<unrepresentable>, z=<unrepresentable>, /)
 |      Return a translation table usable for str.translate().
 |
 |      If there is only one argument, it must be a dictionary mapping Unicode
 |      ordinals (integers) or characters to Unicode ordinals, strings or None.
 |      Character keys will be then converted to ordinals.
 |      If there are two arguments, they must be strings of equal length, and
 |      in the resulting dictionary, each character in x will be mapped to the
 |      character at the same position in y. If there is a third argument, it
 |      must be a string, whose characters will be mapped to None in the result.

Data types

Python includes data types for representing other types of data, including numbers or collections of data. The core Python data types are introduced in the table below. We’ll talk more about some of these as we encounter them in the lesson:

Type Definition Example
str Character string "hello world"
int Integer numerical 42
float Approximate numerical 0.406
bool Stores True or False values True or False
list Sequence that can be modified ["a", "b", "c"]
tuple Sequence that cannot be modified ("a", "b", "c")
dict Mapping of keys to values {"DC": "District of Columbia", "MD": "Maryland", "VA": "Virginia"}
set Collection of unique values {"1", "2", 1}

Libraries


Python includes a number of built-in functions that are available wherever Python is installed. See the table below for some examples.

Examples of built-in functions

Name Description Example
abs() Gets the absolute value of a number abs(-1.5) # returns 1.5
max() Gets the highest value in a sequence min([1, 2, 3]) # returns 3
min() Gets the lowest value in a sequence min([1, 2, 3]) # returns 1
round() Rounds a number to the nearest integer round(5.4) # returns 5

Python also includes a number of built-in modules. A module bundles functions and other code related to a single task or data type. They are used to simplify the performance of common tasks. By using a common, well-tested code base, modules allow coders to work more quickly and with fewer errors.

Modules, packages, and libraries

A module is generally a single file. Collections of modules are referred to as packages or libraries.

The modules built into Python are referred to as the Python Standard Library. They can be accessed through a typical Python installation and do not require any additional downloads. A few examples are included in the table below, but as with the table of built-in functions, there are more where those came from.

Examples from the Python Standard Library

Library Description
datetime Read, write, and analyze dates and times
os Create, manipulate, and get information about files and paths
random Generate pseudo-random numbers

Documentation

The documentation for each module can be viewed by clicking the link in the table. Documentation is an invaluable resource. It provides descriptions of what a module does and detailed information about how it can be used, often including examples.

Unlike the built-in functions, we must import a library before we can use it. We do so using the import statement:

PYTHON

import datetime

Once imported, the module is available for use anywhere in the current document. A module is itself an object, and we must include the name of the module to access any functions it defines. For example, to create a datetime.date object (that is, a date object defined by the datetime module), we include both the module and method name:

PYTHON

date = datetime.date(1970, 1, 1)

As with the built-in types, the datetime.date object includes its own suite of attributes and methods. We can, for example, use the year attribute to get the year:

PYTHON

date.year

OUTPUT

1970

Or convert the date to a string using the strftime() method and Python’s date format codes. Here, %Y corresponds to YYYY, %m to MM, and %d to DD.

PYTHON

date.strftime("%Y-%m-%d")

OUTPUT

'1970-01-01'

Like Python itself, the Python Standard Library is maintained by the Python Software Foundation. The built-in modules are limited to a relatively small set of operations expected to be useful to a broad population of users. However, Python allows users to create their own packages to perform actions that are beyond the scope of core Python. The rest of this lesson will focus on an external package called pandas.

Key Points

  • Python is a widely used language that can be used for a variety of tasks, including analyzing data
  • Python uses different data types to handle text, numbers, collections, and other kinds of data
  • Assign values to variables using the = operator
  • Use functions and methods to perform specific actions
  • Python’s functionality can be extended using packages developed by the community
  • Use the help() function and developer documentation to learn more about Python modules and packages

Content from Introducing pandas


Last updated on 2025-07-03 | Edit this page

Overview

Questions

  • What data will we be working with in this lesson?
  • What is pandas?
  • Why use pandas for data analysis?
  • How do we read and write data using pandas?

Objectives

  • Learn about the dataset we’ll be working with
  • Look at the benefits of using pandas to analyze data
  • Import data from a CSV into a pandas dataframe
  • Learn how pandas handles different types of data
  • Write a dataframe to a CSV

Dataset description


The dataset description was taken from Data Management with SQL for Ecologists (CC-BY-4.0)

The data we will be using is a time-series for a small mammal community in southern Arizona. This is part of a project studying the effects of rodents and ants on the plant community that has been running for almost 40 years. The rodents are sampled on a series of 24 plots, with different experimental manipulations controlling which rodents are allowed to access which plots. Each row represents a single observation of an animal, including its species, weight, and hindfoot length.

This is a real dataset that has been used in over 100 publications. We’ve simplified it for the workshop, but you can download the full dataset and work with it using exactly the same tools we’ll learn about today.

Answering questions using data


Let’s look at some of the cleaned spreadsheets we downloaded during Setup to complete this challenge. Over the course of this lesson, we’ll be working with the following three files:

  • surveys.csv
  • species.csv
  • plots.csv

Discussion

Open each of these csv files and explore them. What information is contained in each file? Specifically, if we had the following research questions:

  • How has the hindfoot length and weight of Dipodomys species changed over time?
  • What is the average weight of each species, per year?
  • What information can I learn about Dipodomys species in the 2000s, over time?

What would we need to answer these questions? Which files have the data we need? What operations would we need to perform if we were doing these analyses by hand?

Hint: We can view CSV files by clicking on them in the left sidebar.

In order to answer the questions described above, we’ll need to do the following basic data operations:

  • select subsets of the data (rows and columns)
  • group subsets of data
  • do math and other calculations
  • combine data across spreadsheets
  • plot data to identify patterns

In addition, we don’t want to do this manually! Instead of searching for the right pieces of data ourselves, or clicking between spreadsheets, or manually sorting columns, we want to make the computer do the work.

In particular, we want to use a tool where it’s easy to repeat our analysis in case our data changes. We also want to do all this searching without actually modifying our source data.

Why use pandas?


The Python Data Analysis Library, or pandas, is a Python package used to work with dataframes. A dataframe is a representation of tabular data similar to a well-organized spreadsheet, consisting of rows (representing records) and columns (representing fields or variables). Tables are a very common format for representing scientific data and should be very familiar to anyone taking this course.

Pandas offers the same advantages as any well-written package: It creates a common codebase for working on a single task, in this case, analyzing data. Some benefits of this approach include:

  • Reliability: Provides flexible, well-tested methods for reading, querying, aggregating, grouping, and plotting data
  • Repeatability: Repeat the same analyses when data is added or changed
  • Speed: Faster in many cases than coding our own functions in Python
  • Reproducibility: Document and share code in narrative form using tools like Jupyter notebooks
  • Community: Access a large, active community for help when we run into problems

Importing data using pandas


Unlike the modules in the Python Standard Library discussed in the previous lesson, pandas is not part of the typical Python installation. Once it has been installed, however, it can be accessed using the same import command used to import built-in modules. By convention, pandas is imported using the alias “pd”. We can assign the alias using the as keyword:

PYTHON

import pandas as pd

Using aliases

Why use an alias? We will refer to pandas many, many times when writing a script, so it’s useful to abbreviate the name and save some keystrokes. But it’s also a matter of consistency with the larger community. Many of the core scientific Python packages, including pandas, recommend a specific alias, so most code shared online will use those aliases as well.

Now that pandas has been imported, we can access the function we need to load data from a CSV, pd.read_csv(). The function call has three parts:

  • The name (or in this case, alias) of the object that defines the function. This can be a module, package, or any other object. It can also be omitted in some cases (for example, when using a function built into Python).
  • The name of the method we’d like to use
  • A set of parentheses that tells the function to run.

Many functions include parameters that allow the user to modify the behavior of the function. Parameters may be positional or named. In Python, data passed to positional parameters are called arguments (often abbreviated as args), and data passed to named parameters are called keyword arguments (often abbreviated as kwargs). In either case, the arguments are included inside the parentheses used to call the function.

Below, we will pass a single argument to pd.read_csv(): A string that tells the function where to find the surveys.csv file.

PYTHON

pd.read_csv("data/surveys.csv")

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

35549 rows × 9 columns

Here are a few things to observe about how the dataframe is structured:

  • By default, the notebook displays the first and last five rows of the dataframe
  • Each row represents a record
  • Each column represents a field
  • The unlabeled column on the far left is called the row label
  • pandas has done a lot of work behind the scenes when reading the data, including:
    • Assigning the row index as the row label
    • Assigning each column a data type based on its contents
    • Assigning certain cells the value NaN, which stands for “not a number” and is used to designate null values in the dataset. Here, those cells represent blank cells in the spreadsheet.

Much of this behavior can be controlled when the spreadsheet is first read by using keyword arguments. For example, to force pd.read_csv() to use the existing record_id column as the row label, use the index_col keyword argument:

PYTHON

pd.read_csv("data/surveys.csv", index_col="record_id")

OUTPUT

month day year plot_id species_id sex hindfoot_length weight
record_id
1 7 16 1977 2 NL M 32.0 NaN
2 7 16 1977 3 NL M 33.0 NaN
3 7 16 1977 2 DM F 37.0 NaN
4 7 16 1977 7 DM M 36.0 NaN
5 7 16 1977 3 DM M 35.0 NaN
35545 12 31 2002 15 AH NaN NaN NaN
35546 12 31 2002 15 AH NaN NaN NaN
35547 12 31 2002 10 RM F 15.0 14.0
35548 12 31 2002 7 DO M 36.0 51.0
35549 12 31 2002 5 NaN NaN NaN NaN

35549 rows × 8 columns

Now record_id appears on the far left, indicating that pandas is using that column as the row label.

Challenge

Run help() on the pd.read_csv() method (note that you should omit the trailing parentheses). Alternatively, take a look at the much prettier documentation provided by the pandas developers. Based on that information, answer the following questions:

  1. What parameter would you use to control how null values are interpreted?
  2. What values are interpreted as NaN by default?

na_values

In addition to empty cells, the following values are interpreted as NaN, or null, by pandas. These strings may look familiar from programs like Excel.

  • #N/A
  • #N/A N/A
  • #NA
  • -1.#IND
  • -1.#QNAN
  • -NaN
  • -nan
  • 1.#IND
  • 1.#QNAN
  • <NA>
  • N/A
  • NA
  • NULL
  • NaN
  • n/a
  • nan
  • null

Assigning to variables


We can assign the dataframe to a variable so we don’t need to load it every time we want to access it. As we saw in the previous lesson, we use a single equals sign to assign an object to a variable. The variable name should be short and descriptive. By convention, variable names in Python use snake_case (that is, lower case with individual words separated by underscores).

PYTHON

surveys = pd.read_csv("data/surveys.csv")
surveys

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

35549 rows × 9 columns

Understanding data types in pandas


We can use the info() method to see how pandas interpreted each column in the dataset. This method gives use the name, count, and data type of each column and provides some information about the dataset as a whole (for example, memory usage, which is helpful to know when working with large datasets):

PYTHON

surveys.info()

OUTPUT

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35549 entries, 0 to 35548
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   record_id        35549 non-null  int64
 1   month            35549 non-null  int64
 2   day              35549 non-null  int64
 3   year             35549 non-null  int64
 4   plot_id          35549 non-null  int64
 5   species_id       34786 non-null  object
 6   sex              33038 non-null  object
 7   hindfoot_length  31438 non-null  float64
 8   weight           32283 non-null  float64
dtypes: float64(2), int64(5), object(2)
memory usage: 2.4+ MB

Note that the data types used by pandas look a little different than the ones used by Python itself. See the table below for the rough equivalents.

Data types in pandas

Data Type Description Similar To
object Character string or mixed str
int64 Integer numerical int
float64 Approximate numerical float
bool Stores True or False values bool
datetime64 Stores date and time values datetime.datetime

Saving a dataframe


When analyzing a dataset, we’ll often want to save our work to a file. The to_csv() method can be used to write a dataframe to a CSV, which is a text file which places each row on a separate line and separates individual cells with commas. We will include a positional argument with the path to which we want to save the file. We will also include the index keyword argument. Setting that parameter to False tells pandas not to include the row label when writing the CSV.

PYTHON

surveys.to_csv("data/surveys_mod.csv", index=False)

This is far from the only option for writing data—pandas supports a variety of file types for both reading and writing. Try searching for “read_” in the pandas API reference to see other supported formats.

Key Points

  • This lesson uses real data from a decades-long survey of rodents in Arizona
  • pandas is a data analysis package that allows users to read, manipulate, and view tabular data using Python
  • pandas represents data as a dataframe consisting of rows (records) and columns (fields or variables)
  • We can read a dataframe from CSV using the pd.read_csv() function and write a dataframe to CSV using the to_csv() method
  • The behavior of a function can be modified by including arguments and keyword arguments when the function is called
  • pandas uses its own classes to represent text, numbers, booleans, and datetimes

Content from Accessing and Filtering Data


Last updated on 2025-07-15 | Edit this page

Overview

Questions

  • How can we look at individual rows and columns in a dataframe?
  • How can we look at subsets of the dataset?

Objectives

  • Access individual rows and columns
  • Access multiple columns at once using a list
  • Filter the dataframe based on the data it contains
  • Sort the dataframe

In the previous lesson, we saw how to load a dataframe from a file. Now we’ll look at how to access the data within that dataframe. We’ll begin by importing pandas and reading our CSV, as we did in the previous lesson:

PYTHON

import pandas as pd

surveys = pd.read_csv("data/surveys.csv")
surveys

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

35549 rows × 9 columns

Getting a column


We can get a single column from the dataframe using square brackets. Square brackets are used in Python to access objects inside a container like a list, dict, or DataFrame. To get a column, we pass the name of the column inside a set brackets appended to the dataframe. For example, tp retrieve the year, use:

PYTHON

surveys["year"]

OUTPUT

0        1977
1        1977
2        1977
3        1977
4        1977
         ...
35544    2002
35545    2002
35546    2002
35547    2002
35548    2002
Name: year, Length: 35549, dtype: int64

A single column is returned as a Series, which is why the output for this cell is formatted differently than the other cells in this lesson. The main difference between a Series and a DataFrame is that a DataFrame can have multiple columns. The two classes share many but not all attributes and methods.

Note also that this is a copy of data from the original dataframe. Changing values in the series will have no effect on the original dataframe. Most operations on a DataFrame or Series return a copy of the original object.

Getting unique values


We can get the list of unique values within a column using the unique() method:

PYTHON

surveys["species_id"].unique()

OUTPUT

array(['NL', 'DM', 'PF', 'PE', 'DS', 'PP', 'SH', 'OT', 'DO', 'OX', 'SS',
       'OL', 'RM', nan, 'SA', 'PM', 'AH', 'DX', 'AB', 'CB', 'CM', 'CQ',
       'RF', 'PC', 'PG', 'PH', 'PU', 'CV', 'UR', 'UP', 'ZL', 'UL', 'CS',
       'SC', 'BA', 'SF', 'RO', 'AS', 'SO', 'PI', 'ST', 'CU', 'SU', 'RX',
       'PB', 'PL', 'PX', 'CT', 'US'], dtype=object)

Getting multiple columns


It is also possible to retrieve more than one column at a time. To do some, we’ll use the built-in list data type.

Python uses list to store sequences, that is, ordered lists of objects. Any type of Python object can be stored in a list: strings, integers, floats, even other lists or collections. Once created, a list can be modified by appending, inserting, or deleting items. We will not be going into detail about these operations here, but as always you can learn more about how to use a list using help() or the Python docs.

You can create a list using square brackets. Let’s create a list of the three columns in our dataframe that together give the date of an observation:

PYTHON

cols = ["year", "month", "day"]

When we pass this list to the survey dataframe using square brackets, as we did above, we get a copy of the dataframe containing just those columns. Note that, because we asked for more than one column, pandas returns a dataframe:

PYTHON

surveys[cols]

OUTPUT

year month day
0 1977 7 16
1 1977 7 16
2 1977 7 16
3 1977 7 16
4 1977 7 16
35544 2002 12 31
35545 2002 12 31
35546 2002 12 31
35547 2002 12 31
35548 2002 12 31

35549 rows × 3 columns

Suppose we want to get the unique values for multiple columns. The unique() method only works on a Series, that is, a single column. Instead, we can use the drop_duplicates() method on a copy of the dataframe with the columns we’re interested in. Like any well-named method, drop_duplicates() does exactly what the name implies: It returns a copy of the dataframe with all duplicate rows removed.

PYTHON

surveys[["plot_id", "species_id"]].drop_duplicates()

OUTPUT

plot_id species_id
0 2 NL
1 3 NL
2 2 DM
3 7 DM
4 3 DM
35474 22 SF
35511 11 US
35519 9 SF
35527 13 US
35543 15 US

612 rows × 2 columns

Getting one or more rows


Pandas provides a variety of ways to view rows within a dataframe. We can get the rows at the beginning of the dataframe using head():

PYTHON

surveys.head()

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN

By default, this methods returns the first five rows. We can provide a number inside the parentheses if we want to view a different number of rows:

PYTHON

surveys.head(10)

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
5 6 7 16 1977 1 PF M 14.0 NaN
6 7 7 16 1977 2 PE F NaN NaN
7 8 7 16 1977 1 DM M 37.0 NaN
8 9 7 16 1977 1 DM F 34.0 NaN
9 10 7 16 1977 6 PF F 20.0 NaN

The tail() method is similar, except it returns rows from the end of the table:

PYTHON

surveys.tail()

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

Or we can use sample() to return a random row from anywhere in the dataframe:

PYTHON

surveys.sample()

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
25936 25937 5 11 1997 15 OT F 20.0 25.0

If you’re following along, you may notice that the output of this method on your screen differs from what’s shown here. That’s exactly what we’d expect to see. Remember, sample() is returnning a random row—it would be more surprising if the outputs were the same!

Slicing the dataframe


The head(), tail(), and sample() methods are useful for getting a feel for how our data is structured, but we may also want to look at specific rows. One way to do so is to extract rows based on where they appear in the dataframe. We can use square brackets to extract these slices. A slice is a subset of the dataframe starting at one row and ending at another. To get a slice, we pass the starting and ending indexes to the square brackets as start:end:

PYTHON

surveys[2:5]

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN

There are three things to be aware of when slicing a dataframe:

  • Row indexes are zero-based. The first row has an index of 0, not
  • When slicing, the slice includes start but not the end index. In this case, that means the slice includes rows 2, 3, and 4 but not 5.
  • The row label can be different from the row index. They happen to be the same here, but don’t count on that being true.

Core Python types like list and tuple use the same conventions, as do most Python pacakges that work with sequences.

Filtering data


It is often more useful to subset a dataframe based on the data itself. Pandas provides a variety of ways to filter a dataframe in this way. For example, suppose we want to look at a specific species in the surveys dataframe. We can view the rows matching a given species using the same square brackets we used above to select specific columns and rows. Here, however, instead of using a value or list of values, we will use a conditional expression.

A conditional expression is a statement that evaluates as either True or False. They often make use of inequality operators, for example:

  • == for equals
  • != for does not equal
  • > for greater than
  • >= for greater than or equal to
  • < for less than
  • <= for less than or equal to

Examples of conditional statements include:

  • "a" == "b" evaluates False
  • "a" != b" evaluates True
  • 3 > 4 evaluates False

Note that, when comparing strings, evaluations are case sensitive:

  • "a" == "A" evaluates False

= for assignment, == for equality

Remember that, in Python, a single equal sign is used to assign values to variables. We’ve already used the assignment operator in this lesson, for example, when we created a new column in the dataframe.

To limit the dataframe to rows matching the species “DM”, we will again use square brackets. This time, instead of passing a string or a number, we will include the conditional surveys["species_id"] == "DM" inside the square brackets:

PYTHON

surveys[surveys["species_id"] == "DM"]

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
7 8 7 16 1977 1 DM M 37.0 NaN
8 9 7 16 1977 1 DM F 34.0 NaN
35532 35533 12 31 2002 14 DM F 36.0 48.0
35533 35534 12 31 2002 14 DM M 37.0 56.0
35534 35535 12 31 2002 14 DM M 37.0 53.0
35535 35536 12 31 2002 14 DM F 35.0 42.0
35536 35537 12 31 2002 14 DM F 36.0 46.0

10596 rows × 9 columns

Other comparisons can be used in the same way. To limit our results to observations made in or after 2000, use:

PYTHON

surveys[surveys["year"] >= 2000]

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
30158 30159 1 8 2000 1 PP F 22.0 17.0
30159 30160 1 8 2000 1 DO M 35.0 53.0
30160 30161 1 8 2000 1 PP F 21.0 17.0
30161 30162 1 8 2000 1 DM M 36.0 50.0
30162 30163 1 8 2000 1 PP M 20.0 16.0
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

5391 rows × 9 columns

As when we selected columns above, each filtering operation returns a copy of the dataframe.

Using complex filters


When analyzing data, we will often need to filter on multiple columns at one time. In pandas, we can combine conditionals using bitwise operators. These work like the terms AND and OR in many search interfaces:

  • &: True if conditions on both sides of the operator are True (and)
  • |: True if a condition on either side is True (or)

To return all observations of DM in or after 2000, we can combine the two conditionals we used previously into a single operation. Note that, when joining conditionals using & or |, we must wrap each individual condition in parentheses. If we omit the parentheses, pandas will not perform the comparisons in the expected order.

PYTHON

surveys[(surveys["species_id"] == "DM") & (surveys["year"] >= 2000)]

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
30161 30162 1 8 2000 1 DM M 36.0 50.0
30178 30179 1 8 2000 12 DM M 36.0 60.0
30195 30196 1 8 2000 17 DM M 37.0 52.0
30196 30197 1 8 2000 17 DM F 34.0 43.0
30209 30210 1 8 2000 22 DM M 38.0 56.0
35532 35533 12 31 2002 14 DM F 36.0 48.0
35533 35534 12 31 2002 14 DM M 37.0 56.0
35534 35535 12 31 2002 14 DM M 37.0 53.0
35535 35536 12 31 2002 14 DM F 35.0 42.0
35536 35537 12 31 2002 14 DM F 36.0 46.0

847 rows × 9 columns

We can also use methods to filter the dataframe. For example, isin() can be used to match a list of values. Methods can be combined with other conditionals as above. The example below returns rows from 2000 or later with either “DM”, “DO”, or “DS” in the species_id column:

PYTHON

surveys[
    surveys["species_id"].isin(["DM", "DO", "DS"]) & (surveys["year"] >= 2000)
]

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
30159 30160 1 8 2000 1 DO M 35.0 53.0
30161 30162 1 8 2000 1 DM M 36.0 50.0
30166 30167 1 8 2000 1 DO M 35.0 41.0
30170 30171 1 8 2000 2 DO M 36.0 52.0
30172 30173 1 8 2000 2 DO F 35.0 54.0
35533 35534 12 31 2002 14 DM M 37.0 56.0
35534 35535 12 31 2002 14 DM M 37.0 53.0
35535 35536 12 31 2002 14 DM F 35.0 42.0
35536 35537 12 31 2002 14 DM F 36.0 46.0
35547 35548 12 31 2002 7 DO M 36.0 51.0

1268 rows × 9 columns

Sorting data


We can sort a dataframe using the sort_values() method. To sort by weight, we’ll pass the name of that column to the sort_values():

PYTHON

surveys.sort_values("weight")

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
28125 28126 6 28 1998 15 PF M NaN 4.0
217 218 9 13 1977 1 PF M 13.0 4.0
9822 9823 1 19 1985 23 RM M 16.0 4.0
9852 9853 1 19 1985 17 RM M 16.0 4.0
9936 9937 2 16 1985 21 RM M 16.0 4.0
35530 35531 12 31 2002 13 PB F 27.0 NaN
35543 35544 12 31 2002 15 US NaN NaN NaN
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35548 35549 12 31 2002 5 NaN NaN NaN NaN

35549 rows × 9 columns

By default, rows are sorted in ascending order (smallest to largest). We can reorder them from largest to smallest using the ascending keyword argument:

PYTHON

surveys.sort_values("weight", ascending=False)

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
33048 33049 11 17 2001 12 NL M 33.0 280.0
12870 12871 5 28 1987 2 NL M 32.0 278.0
15458 15459 1 11 1989 9 NL M 36.0 275.0
2132 2133 10 25 1979 2 NL F 33.0 274.0
12728 12729 4 26 1987 2 NL M 32.0 270.0
35530 35531 12 31 2002 13 PB F 27.0 NaN
35543 35544 12 31 2002 15 US NaN NaN NaN
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35548 35549 12 31 2002 5 NaN NaN NaN NaN

35549 rows × 9 columns

We can sort on multiple fields at once by passing a list of column names. We can control how each column sorts by passing a list with the same number of values (that is, one value per column) to the ascending keyword. The cell below sorts the results first by species_id (largest to smallest), then by weight (smallest to largest):

PYTHON

surveys.sort_values(["species_id", "weight"], ascending=[False, True])

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
14249 14250 3 20 1988 18 ZL NaN NaN NaN
14350 14351 4 17 1988 23 ZL NaN NaN NaN
35511 35512 12 31 2002 11 US NaN NaN NaN
35512 35513 12 31 2002 11 US NaN NaN NaN
35527 35528 12 31 2002 13 US NaN NaN NaN
34756 34757 9 10 2002 23 NaN NaN NaN NaN
34969 34970 10 6 2002 10 NaN NaN NaN NaN
35187 35188 11 10 2002 10 NaN NaN NaN NaN
35384 35385 12 8 2002 10 NaN NaN NaN NaN
35548 35549 12 31 2002 5 NaN NaN NaN NaN

35549 rows × 9 columns

As with the dataframe methods above, sort_values() returns a copy of the original dataframe and leaves the original untouched.

Challenge

Write a query that returns year, species_id, and weight from the surveys table, sorted with the largest weights at the top.

PYTHON

# Create a subset containing only year, species_id, and weight
subset = surveys[["year", "species_id", "weight"]]

# Sort the subset by weight
subset.sort_values("weight", ascending=False)

OUTPUT

year species_id weight
33048 2001 NL 280.0
12870 1987 NL 278.0
15458 1989 NL 275.0
2132 1979 NL 274.0
12728 1987 NL 270.0
35530 2002 PB NaN
35543 2002 US NaN
35544 2002 AH NaN
35545 2002 AH NaN
35548 2002 NaN NaN

35549 rows × 3 columns

Showing data on a plot


We will discuss data visualization using plotly in depth in lesson 6 but will introduce some fundamental concepts as we go along. Like pandas, plotly is an external package installed separately from Python itself. It can be used to create interactive, highly customizable plots based on pandas dataframes using just a few lines of code. For example, to create a scatter plot of the weight and hindfoot length, we need only to import plotly:

PYTHON

import plotly.express as px

Once plotly is loaded, we can create a scatter plot using the px.scatter() method. We include the dataframe as the first argument, then x and y keyword arguments to select the columns we want to show on our scatter plot:

PYTHON

px.scatter(surveys, x="weight", y="hindfoot_length")

OUTPUT

This simple plot is limited in what it can tell us about the observations in the dataset. We will return to this scatter plot in later lessons to see how we can improve it to better understand the survey data.

Key Points

  • Use square brackets to access rows, columns, and specific cells
  • Sort data and get unique values in a dataframe using methods provided by pandas
  • By default, most dataframe operations return a copy of the original data
  • Scatter plots can be used to visualize how two parameters in a dataset covary

Content from Aggregating Data


Last updated on 2025-07-15 | Edit this page

Overview

Questions

  • How do we calculate summary statistics?
  • How do we group data?
  • How do null values affect calculations?

Objectives

  • Introduce aggregation calculations in pandas
  • Introduce grouping in pandas
  • Learn about how pandas handles null values
  • Make a boxplot showing summary stats

In the previous lesson, we saw how to access and filter our dataframe. Here, we’ll use pandas to calculate summary statistics for some of that data.

As always, we’ll begin by importing pandas and reading our CSV:

PYTHON

import pandas as pd

surveys = pd.read_csv("data/surveys.csv")

Aggregating data


Aggregation allows us to describe the data in our dataframe by calculating totals (like the number of records) and statistics (like the mean value of a column). Pandas allows us to run these calculations on dataframes or on subsets of interest.

Suppose we want to know how many records are in our dataset. We’ve already seen that we can use the info() method to get high-level about the dataset, including the number of entries. What if just wanted the number of rows? One approach is to use the built-in function len(), which is used to count the number of items in an object (for example, the number of characters in a string or the number of items in a list). When used on a dataframe, len() returns the number of rows:

PYTHON

len(surveys)

OUTPUT

35549

Pandas provides a suite of aggregation methods that go beyond this simple case. For example, we can count the number of non-NaN values in each column using count():

PYTHON

surveys.count()

OUTPUT

record_id          35549
month              35549
day                35549
year               35549
plot_id            35549
species_id         34786
sex                33038
hindfoot_length    31438
weight             32283
dtype: int64

Or we can find out the total weight of all individuals using sum():

PYTHON

surveys["weight"].sum()

OUTPUT

np.float64(1377594.0)

Other aggregation methods supported by pandas include min(), max(), and mean(). These methods all ignore NaNs, so missing data does not affect the calculations.

Challenge

Calculate the total weight, average weight, minimum and maximum weights for all animals observed during the survey. Can you modify your code so that it outputs these values only for weights between 5 and 10 grams?

PYTHON

# Create a subset of only the animals between 5 and 10 grams
weights = surveys[(surveys["weight"] > 5) & (surveys["weight"] < 10)]["weight"]

# Display aggregation calculations using a dict
{
    "sum": weights.sum(),
    "mean": weights.mean(),
    "min": weights.min(),
    "max": weights.max(),
}

OUTPUT

{'sum': np.float64(16994.0),
 'mean': np.float64(7.91523055426176),
 'min': np.float64(6.0),
 'max': np.float64(9.0)}

To quickly generate summary statistics, we can use the describe() method instead of calling each aggregation method separately. When we use this method on a dataframe, it calculates stats for all columns with numeric data:

PYTHON

surveys.describe()

OUTPUT

record_id month day year plot_id hindfoot_length weight
count 35549.000000 35549.000000 35549.000000 35549.000000 35549.000000 31438.000000 32283.000000
mean 17775.000000 6.477847 15.991195 1990.475231 11.397001 29.287932 42.672428
std 10262.256696 3.396925 8.257366 7.493355 6.799406 9.564759 36.631259
min 1.000000 1.000000 1.000000 1977.000000 1.000000 2.000000 4.000000
25% 8888.000000 4.000000 9.000000 1984.000000 5.000000 21.000000 20.000000
50% 17775.000000 6.000000 16.000000 1990.000000 11.000000 32.000000 37.000000
75% 26662.000000 10.000000 23.000000 1997.000000 17.000000 36.000000 48.000000
max 35549.000000 12.000000 31.000000 2002.000000 24.000000 70.000000 280.000000

You can see that describe() isn’t picky: It includes the ID, year, month, and day columns in its results. Notice also that counts differ in between columns. This is because count() only counts non-NaN rows.

If desired, we can also describe a single column at a time:

PYTHON

surveys["weight"].describe()

OUTPUT

count    32283.000000
mean        42.672428
std         36.631259
min          4.000000
25%         20.000000
50%         37.000000
75%         48.000000
max        280.000000
Name: weight, dtype: float64

If we need more control over the output (for example, if we want to calculate the total weight of all animals, as in the challenge above), pandas provides the agg() method, which allows us to specify aggregation methods by column. This method requires a dict, which is a built-in data type that we have not discussed yet.

Like a list, a dict is a container that can include more than one object. Where as a list is a sequence, a dict is a mapping consisting of keys that map to values. Because it is built in, it can be used in any Python application without having to import anything.

Let’s see what that looks like in practice by defining a dict that maps lowercase to uppercase letters. We use curly braces to define a dict. The parts of each item are separated by a colon, with the key on the left and the value on the right.

PYTHON

letters = {"a": "A", "b": "B", "c": "C"}

To retrieve the value for a given key, we use square brackets:

PYTHON

letters["a"]

OUTPUT

'A'

There are many other ways to interact with a dict. We can add or delete items, remap a key to a new value, or iterate over all the keys, values, or items inside. Use help() or check out the Python docs to learn more about dict.

Now let’s return to the calculating summary statistics using agg(). For this method, Each key must be a column name and each value a list of the names of aggregation methods.

To calculate the total weight, mean weight, and mean hindfoot length of all records in the survey, we can use:

PYTHON

surveys.agg({"weight": ["sum", "mean"], "hindfoot_length": ["mean"]})

OUTPUT

weight hindfoot_length
sum 1.377594e+06 NaN
mean 4.267243e+01 29.287932

Grouping data


Up to now, we have been calculating statistics based on all records in the dataframe, but the dataset includes a variety of species, each with a characteristic size and number of observations. We may find it useful to look at subsets for individual species (or plots, years, etc.). We can do so using the groupby() method, which groups records based on the data in one or more columns. To group by species_id, use:

PYTHON

surveys.groupby("species_id")

OUTPUT

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f3cfb63d7f0>

The groupby() methods returns a special object that includes the rows associated with each species_id, but we wouldn’t know that based on the output of the previous cell. Instead of a table or another view of the grouped records, the method outputs a string with the type of object (pandas.core.groupby.generic.DataFrameGroupBy) and some information about where that object is stored (0x0...). Not the most useful thing in the world! We might encounter strings like this when developers determine that there is no concise, human-readable way to represent an object.

Nevertheless, we can use the DataFrameGroupBy object to calculate summary statistics for each group. In the example below, we’ll calculate the number of times each species appears in the dataset. To simplify the output, we’ll limit the count to a single column, species_id, using square brackets. (Because count() excludes NaNs, it’s usually a good practice to choose a column that does not contain missing data. Record ID fields are a good choice, but any field that is populated for every row will work.)

PYTHON

surveys.groupby("species_id")["species_id"].count()

OUTPUT

species_id
AB      303
AH      437
AS        2
BA       46
CB       50
CM       13
CQ       16
CS        1
CT        1
CU        1
CV        1
DM    10596
DO     3027
DS     2504
DX       40
NL     1252
OL     1006
OT     2249
OX       12
PB     2891
PC       39
PE     1299
PF     1597
PG        8
PH       32
PI        9
PL       36
PM      899
PP     3123
PU        5
PX        6
RF       75
RM     2609
RO        8
RX        2
SA       75
SC        1
SF       43
SH      147
SO       43
SS      248
ST        1
SU        5
UL        4
UP        8
UR       10
US        4
ZL        2
Name: species_id, dtype: int64

To group by multiple columns, we can pass a list to groupby(). To determine how many observations of each species were made in each year, we can use:

PYTHON

surveys.groupby(["species_id", "year"])["record_id"].count()

OUTPUT

species_id  year
AB          1980     5
            1981     7
            1982    34
            1983    41
            1984    12
                    ..
UR          1994     1
            2001     2
            2002     1
US          2002     4
ZL          1988     2
Name: record_id, Length: 509, dtype: int64

Challenge

Write statements to answer the following questions:

  1. How many individuals were counted in each year in total?
  2. How many were counted each year, for each different species?
  3. What was the average weight of each species in each year?
  4. How many individuals were counted for each species that was observed more than 10 times?

Can you get the answer to both 2 and 3 in a single query?

How many individuals were counted in each year in total?

PYTHON

# Individual counts per year
surveys.groupby("year")["record_id"].count()

OUTPUT

year
1977     503
1978    1048
1979     719
1980    1415
1981    1472
1982    1978
1983    1673
1984     981
1985    1438
1986     942
1987    1671
1988    1469
1989    1569
1990    1311
1991    1347
1992    1038
1993     750
1994     668
1995    1222
1996    1706
1997    2493
1998    1610
1999    1135
2000    1552
2001    1610
2002    2229
Name: record_id, dtype: int64

How many individuals were counted each year for each different species?

PYTHON

# Individual counts by species and year
result = surveys.groupby(["year", "species_id"])["record_id"].count()

# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
    print(result)

OUTPUT

year  species_id
1977  DM            264
      DO             12
      DS             98
      NL             31
      OL             10
      OT             17
      OX              7
      PE              6
      PF             31
      PP              7
      RM              2
      SH              1
      SS              1
1978  AH              1
      DM            389
      DO             22
      DS            320
      NL             48
      OL             47
      OT             66
      OX              1
      PE             17
      PF             40
      PM              3
      PP             24
      RM              3
      SA              2
      SH              1
      SS              8
1979  DM            209
      DO             31
      DS            204
      NL             30
      OL             46
      OT             68
      PE             17
      PF             19
      PM              6
      PP             20
      RM              6
      SS              2
1980  AB              5
      CB              1
      CM             12
      CQ              2
      DM            493
      DO             89
      DS            346
      DX              1
      NL             57
      OL             85
      OT             96
      PE             20
      PF             95
      PM              1
      PP             17
      RM             46
      SS              9
1981  AB              7
      CB              1
      CQ              4
      DM            559
      DO            123
      DS            354
      NL             63
      OL             55
      OT             65
      PE             43
      PF             59
      PM             16
      PP             20
      RM             29
      SA             11
      SS              8
1982  AB             34
      AH              1
      CQ              8
      DM            609
      DO            117
      DS            354
      NL            111
      OL             83
      OT             94
      OX              1
      PC              2
      PE            107
      PF            152
      PM             72
      PP             32
      RF              2
      RM            154
      SA             17
      SH              2
      SS             12
1983  AB             41
      AH              2
      CB              3
      DM            528
      DO            167
      DS            280
      DX              1
      NL             98
      OL             65
      OT             86
      PC              3
      PE             38
      PF            114
      PG              1
      PM             42
      PP             32
      RM            127
      SA             10
      SH              2
      SS             12
1984  AB             12
      AH              2
      CB              1
      CQ              2
      DM            396
      DO            108
      DS             76
      NL             64
      OL             71
      OT             49
      OX              1
      PE             20
      PF              7
      PH              7
      PM             12
      PP             26
      PU              2
      RM             60
      SA             20
      SS             15
1985  AB             14
      AH             14
      CB              1
      DM            667
      DO            225
      DS             98
      NL             45
      OL             94
      OT             62
      PE             51
      PF              1
      PH              3
      PM             17
      PP             35
      RM             72
      SA              3
      SS             14
1986  AB              5
      AH             19
      CB              4
      DM            406
      DO            115
      DS             88
      DX              1
      NL             60
      OL             43
      OT             25
      PE             26
      PF              2
      PH              1
      PM             34
      PP             26
      RM             53
      SA              1
      SH              1
      SS             12
1987  AB             35
      AH             41
      CB              5
      CM              1
      CV              1
      DM            469
      DO            213
      DS            104
      NL            128
      OL             79
      OT             43
      PC              5
      PE            142
      PF              4
      PM            133
      PP             45
      RM            189
      SA              1
      SH              5
      SS             20
      UP              2
      UR              3
1988  AB             39
      AH             26
      CB              6
      DM            365
      DO            144
      DS             54
      DX              1
      NL            102
      OL             51
      OT             86
      PE            190
      PF              2
      PG              2
      PH              1
      PM             54
      PP             53
      RF             11
      RM            211
      SA              1
      SH             31
      SS             20
      UL              1
      UP              3
      UR              1
      ZL              2
1989  AB             31
      AH             30
      BA              3
      CB              1
      CS              1
      DM            321
      DO            119
      DS             33
      NL             67
      OL             57
      OT            102
      OX              1
      PC              1
      PE            177
      PF             13
      PH              2
      PM             27
      PP             45
      RF             49
      RM            398
      SA              3
      SC              1
      SF              5
      SH             55
      SS             18
      UP              2
1990  AB             27
      AH             22
      BA             11
      DM            462
      DO            171
      DS             17
      NL             29
      OL             25
      OT             77
      PC              8
      PE             75
      PF             51
      PG              4
      PH              7
      PP             34
      RF             12
      RM            231
      SF             18
      SH              9
      SS              5
      UP              1
1991  AB             15
      AH             21
      AS              1
      BA             26
      CB              2
      DM            404
      DO            122
      DS             11
      DX              4
      NL             30
      OL             36
      OT            107
      PC              6
      PE             65
      PF             62
      PP             88
      RM            307
      RO              1
      SA              2
      SF              7
      SH              1
      SO             11
      SS              2
      UR              1
1992  AB             10
      AH             16
      BA              6
      CB              7
      DM            307
      DO             94
      DS             18
      DX             12
      NL             15
      OL             45
      OT             42
      PC              9
      PE             41
      PF             57
      PH              7
      PI              1
      PP            131
      RM            158
      SF              3
      SO             20
      SS              7
      UL              1
      UR              1
1993  AB              9
      AH             15
      AS              1
      CB              5
      CU              1
      DM            253
      DO             29
      DS             18
      DX              3
      NL             24
      OL             35
      OT             22
      OX              1
      PC              1
      PE             13
      PF             70
      PP            100
      RM            103
      SF              3
      SO              6
      SS              1
      ST              1
1994  AB              7
      AH             19
      CB              1
      DM            293
      DO             25
      DS              9
      DX              2
      NL             10
      OL             21
      OT             21
      PC              1
      PE              3
      PF             73
      PP             74
      PU              2
      RM             40
      SO              3
      SS              7
      SU              1
      UL              2
      UR              1
1995  AB              4
      AH             36
      CB              2
      DM            436
      DO             58
      DX              3
      NL              8
      OL             29
      OT             38
      PB              7
      PC              1
      PE             14
      PF            158
      PH              3
      PI              1
      PL              3
      PM              8
      PP            277
      PU              1
      RM             81
      RX              1
      SS              7
      SU              4
1996  AH             25
      DM            492
      DO            174
      DX              1
      NL              7
      OL             13
      OT            108
      PB             39
      PC              1
      PE             36
      PF            330
      PL              6
      PM             50
      PP            298
      PX              2
      RM             90
      SA              1
      SS             16
1997  AH             37
      CB              3
      DM            576
      DO            253
      DS              6
      NL             48
      OL              8
      OT            258
      PB            259
      PE             57
      PF            186
      PH              1
      PL             19
      PM            271
      PP            325
      PX              2
      RF              1
      RM            158
      RX              1
      SA              1
      SH              7
      SO              3
      SS             12
1998  AB              2
      AH             33
      CB              4
      CT              1
      DM            503
      DO            111
      DS              9
      DX              2
      NL             32
      OT            164
      PB            329
      PC              1
      PE             24
      PF             26
      PL              7
      PM            103
      PP            208
      PX              1
      RM             13
      SA              2
      SS             15
1999  AH             14
      DM            348
      DO             84
      DS              7
      DX              3
      NL             20
      OT            105
      PB            272
      PE              7
      PM             44
      PP            167
      RM             28
      SH              2
      SS              7
2000  AB              3
      AH             12
      DM            233
      DO             91
      DX              4
      NL             29
      OT            154
      PB            555
      PE             15
      PG              1
      PL              1
      PM              2
      PP            381
      PX              1
      RM             15
      SH              8
      SS              4
2001  AB              1
      AH             23
      CB              3
      DM            305
      DO             81
      DX              1
      NL             48
      OT            167
      PB            538
      PE             35
      PF             27
      PI              2
      PM              3
      PP            273
      RM             15
      SH             11
      SS              5
      UR              2
2002  AB              2
      AH             28
      DM            309
      DO            249
      DX              1
      NL             48
      OL              8
      OT            127
      PB            892
      PE             60
      PF             18
      PI              5
      PM              1
      PP            385
      RM             20
      RO              7
      SF              7
      SH             11
      SS              9
      UR              1
      US              4
Name: record_id, dtype: int64

What was the average weight of each species in each year?

PYTHON

# Mean weight by species and year
result = surveys.groupby(["year", "species_id"])["weight"].mean()

# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
    print(result)

OUTPUT

year  species_id
1977  DM             41.141304
      DO             42.666667
      DS            121.437500
      NL                   NaN
      OL             21.666667
      OT             21.000000
      OX             21.333333
      PE             19.500000
      PF              7.173913
      PP             15.333333
      RM             10.000000
      SH                   NaN
      SS                   NaN
1978  AH                   NaN
      DM             40.795455
      DO             45.000000
      DS            115.198653
      NL            184.656250
      OL             31.027027
      OT             23.220000
      OX                   NaN
      PE             20.428571
      PF              7.088235
      PM             20.666667
      PP             14.869565
      RM              7.500000
      SA                   NaN
      SH             89.000000
      SS            130.000000
1979  DM             43.507317
      DO             45.862069
      DS            111.796954
      NL            138.000000
      OL             33.717391
      OT             23.075758
      PE             20.529412
      PF              7.529412
      PM             23.666667
      PP             15.105263
      RM              8.333333
      SS                   NaN
1980  AB                   NaN
      CB                   NaN
      CM                   NaN
      CQ                   NaN
      DM             44.136082
      DO             48.058824
      DS            120.725664
      DX                   NaN
      NL            158.982143
      OL             33.107143
      OT             24.083333
      PE             22.450000
      PF              7.455556
      PM             21.000000
      PP             14.176471
      RM             10.227273
      SS                   NaN
1981  AB                   NaN
      CB                   NaN
      CQ                   NaN
      DM             44.032907
      DO             49.111111
      DS            125.818444
      NL            165.903226
      OL             33.296296
      OT             24.415385
      PE             20.558140
      PF              7.152542
      PM             19.933333
      PP             13.950000
      RM             11.178571
      SA                   NaN
      SS             57.000000
1982  AB                   NaN
      AH                   NaN
      CQ                   NaN
      DM             41.975042
      DO             47.919643
      DS            115.647929
      NL            160.613208
      OL             35.590361
      OT             24.925532
      OX             24.000000
      PC                   NaN
      PE             21.173077
      PF              6.918919
      PM             21.391304
      PP             16.125000
      RF             11.500000
      RM             10.436242
      SA                   NaN
      SH             79.000000
      SS                   NaN
1983  AB                   NaN
      AH                   NaN
      CB                   NaN
      DM             40.944551
      DO             47.150602
      DS            122.033088
      DX                   NaN
      NL            156.691489
      OL             34.584615
      OT             24.697674
      PC                   NaN
      PE             20.864865
      PF              6.833333
      PG                   NaN
      PM             22.023810
      PP             15.468750
      RM              9.872000
      SA                   NaN
      SH                   NaN
      SS                   NaN
1984  AB                   NaN
      AH                   NaN
      CB                   NaN
      CQ                   NaN
      DM             40.765306
      DO             48.415094
      DS            124.082192
      NL            150.095238
      OL             32.550725
      OT             22.416667
      OX             18.000000
      PE             20.210526
      PF              7.428571
      PH             28.000000
      PM             19.545455
      PP             15.307692
      PU                   NaN
      RM             11.152542
      SA                   NaN
      SS                   NaN
1985  AB                   NaN
      AH                   NaN
      CB                   NaN
      DM             41.507645
      DO             47.956731
      DS            124.326316
      NL            148.829268
      OL             32.108696
      OT             22.064516
      PE             20.360000
      PF                   NaN
      PH             32.666667
      PM             19.000000
      PP             15.764706
      RM              8.371429
      SA                   NaN
      SS                   NaN
1986  AB                   NaN
      AH                   NaN
      CB                   NaN
      DM             43.361596
      DO             49.372727
      DS            128.071429
      DX                   NaN
      NL            159.724138
      OL             30.880952
      OT             21.500000
      PE             22.520000
      PF              8.000000
      PH             39.000000
      PM             21.696970
      PP             16.750000
      RM             10.750000
      SA                   NaN
      SH             55.000000
      SS                   NaN
1987  AB                   NaN
      AH                   NaN
      CB                   NaN
      CM                   NaN
      CV                   NaN
      DM             43.232609
      DO             50.087379
      DS            126.383838
      NL            158.840000
      OL             30.779221
      OT             23.069767
      PC                   NaN
      PE             21.625899
      PF              7.500000
      PM             22.297710
      PP             17.840909
      RM             10.657609
      SA                   NaN
      SH             73.800000
      SS                   NaN
      UP                   NaN
      UR                   NaN
1988  AB                   NaN
      AH                   NaN
      CB                   NaN
      DM             43.397790
      DO             51.463768
      DS            129.490566
      DX                   NaN
      NL            163.104167
      OL             30.306122
      OT             24.070588
      PE             22.625668
      PF              8.000000
      PG                   NaN
      PH             33.000000
      PM             21.759259
      PP             18.280000
      RF             13.818182
      RM             10.322115
      SA                   NaN
      SH             72.806452
      SS                   NaN
      UL                   NaN
      UP                   NaN
      UR                   NaN
      ZL                   NaN
1989  AB                   NaN
      AH                   NaN
      BA              7.000000
      CB                   NaN
      CS                   NaN
      DM             44.349206
      DO             51.025641
      DS            121.896552
      NL            151.278689
      OL             31.947368
      OT             24.663366
      OX             20.000000
      PC                   NaN
      PE             21.935673
      PF              7.230769
      PH             30.500000
      PM             20.222222
      PP             17.409091
      RF             13.346939
      RM             10.411311
      SA                   NaN
      SC                   NaN
      SF             54.800000
      SH             76.345455
      SS                   NaN
      UP                   NaN
1990  AB                   NaN
      AH                   NaN
      BA              8.000000
      DM             41.769912
      DO             48.512048
      DS            121.187500
      NL            154.275862
      OL             31.200000
      OT             23.675325
      PC                   NaN
      PE             21.671233
      PF              7.117647
      PG                   NaN
      PH             31.142857
      PP             16.029412
      RF             12.916667
      RM             10.305677
      SF             52.611111
      SH             76.888889
      SS                   NaN
      UP                   NaN
1991  AB                   NaN
      AH                   NaN
      AS                   NaN
      BA              9.240000
      CB                   NaN
      DM             43.148338
      DO             49.695652
      DS            113.000000
      DX                   NaN
      NL            148.785714
      OL             28.171429
      OT             24.588235
      PC                   NaN
      PE             21.435484
      PF              7.827586
      PP             17.904762
      RM             10.498305
      RO             11.000000
      SA                   NaN
      SF             93.166667
      SH             63.000000
      SO             53.909091
      SS                   NaN
      UR                   NaN
1992  AB                   NaN
      AH                   NaN
      BA              7.833333
      CB                   NaN
      DM             43.877966
      DO             48.367816
      DS            112.352941
      DX                   NaN
      NL            139.000000
      OL             28.454545
      OT             24.928571
      PC                   NaN
      PE             22.750000
      PF              8.767857
      PH             31.142857
      PI             18.000000
      PP             17.479675
      RM             10.904762
      SF             43.000000
      SO             55.263158
      SS                   NaN
      UL                   NaN
      UR                   NaN
1993  AB                   NaN
      AH                   NaN
      AS                   NaN
      CB                   NaN
      CU                   NaN
      DM             43.373984
      DO             48.461538
      DS            105.722222
      DX                   NaN
      NL            127.391304
      OL             27.545455
      OT             22.363636
      OX                   NaN
      PC                   NaN
      PE             19.230769
      PF              8.238806
      PP             17.954023
      RM             10.747475
      SF             44.000000
      SO             55.600000
      SS                   NaN
      ST                   NaN
1994  AB                   NaN
      AH                   NaN
      CB                   NaN
      DM             42.373288
      DO             47.750000
      DS            106.625000
      DX                   NaN
      NL            186.875000
      OL             21.900000
      OT             25.285714
      PC                   NaN
      PE             18.000000
      PF              7.855072
      PP             17.585714
      PU                   NaN
      RM             10.675000
      SO             62.333333
      SS                   NaN
      SU                   NaN
      UL                   NaN
      UR                   NaN
1995  AB                   NaN
      AH                   NaN
      CB                   NaN
      DM             44.806147
      DO             49.592593
      DX                   NaN
      NL            155.833333
      OL             27.296296
      OT             24.868421
      PB             34.000000
      PC                   NaN
      PE             21.230769
      PF              8.780645
      PH             35.333333
      PI                   NaN
      PL             22.333333
      PM             27.375000
      PP             16.844444
      PU                   NaN
      RM             12.653846
      RX             20.000000
      SS                   NaN
      SU                   NaN
1996  AH                   NaN
      DM             44.506173
      DO             47.234940
      DX                   NaN
      NL            148.000000
      OL             27.769231
      OT             23.453704
      PB             32.578947
      PC                   NaN
      PE             22.027778
      PF              8.393846
      PL             21.166667
      PM             20.224490
      PP             18.095563
      PX                   NaN
      RM             11.455556
      SA                   NaN
      SS                   NaN
1997  AH                   NaN
      CB                   NaN
      DM             44.708551
      DO             48.177866
      DS            111.000000
      NL            150.688889
      OL             33.625000
      OT             24.785156
      PB             31.085603
      PE             20.929825
      PF              8.448649
      PH             22.000000
      PL             18.789474
      PM             21.126394
      PP             18.175000
      PX             19.000000
      RF             20.000000
      RM             11.230769
      RX             11.000000
      SA                   NaN
      SH             50.857143
      SO             54.666667
      SS                   NaN
1998  AB                   NaN
      AH                   NaN
      CB                   NaN
      CT                   NaN
      DM             43.131403
      DO             49.731183
      DS            116.000000
      DX                   NaN
      NL            159.466667
      OT             24.675676
      PB             30.082237
      PC                   NaN
      PE             20.304348
      PF              8.720000
      PL             16.714286
      PM             20.591398
      PP             16.266990
      PX                   NaN
      RM             13.100000
      SA                   NaN
      SS                   NaN
1999  AH                   NaN
      DM             43.945402
      DO             48.012048
      DS            120.714286
      DX                   NaN
      NL            182.200000
      OT             25.723810
      PB             31.710037
      PE             25.428571
      PM             22.523810
      PP             16.521212
      RM             10.555556
      SH             54.000000
      SS                   NaN
2000  AB                   NaN
      AH                   NaN
      DM             43.126638
      DO             49.224719
      DX                   NaN
      NL            179.307692
      OT             25.303448
      PB             30.878899
      PE             21.615385
      PG                   NaN
      PL             21.000000
      PM             20.500000
      PP             16.788618
      PX                   NaN
      RM             11.400000
      SH             73.375000
      SS                   NaN
2001  AB                   NaN
      AH                   NaN
      CB                   NaN
      DM             45.442177
      DO             52.233766
      DX                   NaN
      NL            167.851064
      OT             23.707792
      PB             32.798851
      PE             20.400000
      PF              8.538462
      PI             18.000000
      PM             26.666667
      PP             17.752896
      RM             11.333333
      SH             79.900000
      SS                   NaN
      UR                   NaN
2002  AB                   NaN
      AH                   NaN
      DM             46.168317
      DO             49.514403
      DX                   NaN
      NL            182.159091
      OL             25.428571
      OT             23.833333
      PB             32.359447
      PE             21.719298
      PF              7.388889
      PI             20.000000
      PM             19.000000
      PP             17.018617
      RM             10.000000
      RO             10.142857
      SF             62.166667
      SH             64.666667
      SS                   NaN
      UR                   NaN
      US                   NaN
Name: weight, dtype: float64

How many individuals were counted for each species that was observed more than 10 times?

PYTHON

# Counts by species that appear more than 10 times
species = surveys.groupby("species_id")["record_id"].count()
species[species > 10]

OUTPUT

species_id
AB      303
AH      437
BA       46
CB       50
CM       13
CQ       16
DM    10596
DO     3027
DS     2504
DX       40
NL     1252
OL     1006
OT     2249
OX       12
PB     2891
PC       39
PE     1299
PF     1597
PH       32
PL       36
PM      899
PP     3123
RF       75
RM     2609
SA       75
SF       43
SH      147
SO       43
SS      248
Name: record_id, dtype: int64

Calculate the number of individuals observed and the average weight for each species in each year.

Note that weight contains NaNs, so counts for record_id and weight differ.

PYTHON

# Counts and mean weight by species and year
surveys.groupby(["year", "species_id"]).agg(
    {"record_id": "count", "weight": ["count", "mean"]}
)

OUTPUT

record_id weight
count count mean
year species_id
1977 DM 264 184 41.141304
DO 12 12 42.666667
DS 98 32 121.437500
NL 31 0 NaN
OL 10 3 21.666667
2002 SF 7 6 62.166667
SH 11 9 64.666667
SS 9 0 NaN
UR 1 0 NaN
US 4 0 NaN

509 rows × 3 columns

Handling missing data


As we’ve discussed, some columns in the surveys dataframe have the value NaN instead of text or numbers. NaN, short for “not a number,” is a special type of float used by pandas to represent missing data. When reading from a CSV, pandas interpret certains values as NaN. NaNs are excluded from groups and most aggregation calculations in pandas. Below, we’ll see how they can also cause issues when plotting.

Defaults

See na_values in the pd.read_csv() documentation for the values that pandas interprets as NaN by default.

When analyzing a dataset, it is critical to understand how missing data is represented. Failing to do so may introduce errors into the analysis. The ecology dataset used in this lesson uses empty cells to represent missing data, but other disciplines have different conventions. For example, some geographic datasets use -9999 to represent null values. Failure to convert such values to NaN will result in significant errors on any calculations performed on that dataset.

In some cases, it can be useful to fill in cells containing NaN with a non-null value. Recall that the groupby() method excludes NaN cells. When looking at sex, the following code counts only those cells with either F (female) or M (male):

PYTHON

surveys.groupby("sex")["record_id"].count()

OUTPUT

sex
F    15690
M    17348
Name: record_id, dtype: int64

But not all records specify a sex. To include records where sex was not specified, we can use the fillna() method on the sex column. This method replaces each NaN with the first argument passed to the function call. To replace all NaN values in sex with “U”, use:

PYTHON

surveys["sex"] = surveys["sex"].fillna("U")

The aggregation calculation now includes all records in the dataframe:

PYTHON

surveys.groupby("sex")["record_id"].count()

OUTPUT

sex
F    15690
M    17348
U     2511
Name: record_id, dtype: int64

In other cases, we may want to ignore rows that contain NaNs. This can be done using dropna():

PYTHON

surveys = surveys.dropna()

This method returns a copy of the dataframe containing only those rows that have non-NaN data in every field.

Visualizing groups and statistics


Plots can be a useful way to identify patterns, commonalities, and differences within a dataset. For example, they can use color, shape, and size to highlight features of interest. Plotly allows us to style the markers on a scatter plot in much the same way as we selected x and y variables to plot at the end of the last lesson. To color the markers on a scatter plot based on species_id, we can use the color keyword argument:

PYTHON

import plotly.express as px

px.scatter(surveys, x="weight", y="hindfoot_length", color="species_id")

OUTPUT

Now we have a much more colorful plot that shows more clearly how the sizes of individual species produce some of the patterns we noted earlier on the single-color plot. When hovering over a point, we can now quickly see the exact weight, hindfoot length, and species_id (although it remains unclear exactly what each ID means).

Scatter plots help us understand how parameters within a dataset covary. Other plots, like box and violin plots, can be used to show the distribution of a single parameter. Plotly can create a box plot using almost the same syntax that we used above to create the scatter plot. Let’s make one for hindfoot length by changing the method to px.box() and changing the x variable to species_id:

PYTHON

px.box(surveys, x="species_id", y="hindfoot_length", color="species_id")

OUTPUT

By default, this plot includes boxes (which show the distribution of each species) and points (for outliers). When we hover over any of the boxes on the plot above, the tooltip now shows aggregate statistics for the species, including the minimum, median, maximum, upper, Q1, Q3, upper fence, and lower fence values for hindfoot length.

Aesthetically, these plots could still use some work. They list species in an arbitrary order, repeat colors, and include species with no associated data. We’ll return to this plot in lesson 6 to look at how we can address those issues.

Key Points

  • Calculate individual summary statistics using dataframe methods like mean(), max(), and min()
  • Calculate multiple summary statistics at once using the dataframe methods describe() and agg()
  • Group data by one or more columns using the groupby() method
  • Failing to consider how missing data is interpreted in a dataset can introduce significant errors
  • Box plots can be used to visualize the distribution of a single parameter

Content from Combining Dataframes


Last updated on 2025-07-03 | Edit this page

Overview

Questions

  • How do we combine data from multiple sources using pandas?
  • How do we add data to an existing dataframe?
  • How do we combine columns within a single dataframe?

Objectives

  • Use pd.merge() to add related columns to a dataframe
  • Use pd.concat() to add rows to a dataframe
  • Combine text columns
  • Combine year, month, and day columns into a datetime64 object

Dataframes can be used to organize and group data by common characteristics. Often, we need to combine elements from separate dataframes into one for analysis and visualization. Likewise, we may need to append data collected under different circumstances. In this chapter, we will show how to merge, concatenate, and combine data using pandas.

Merging dataframes


The survey dataframe we’ve been using throughout this lesson has a column called species_id. We used this column in the previous lesson to calculate summary statistics about observations of each species. But the species_id is just a two-letter code. Without knowing what each code represents, we can’t really understand, describe, or even properly label the observations. We need a way to go from the code to more detailed species information.

Managing repetitive data

Why store species data in a separate table in the first place? Species information is repetitive: Every observation of the same species has the same genus, species, and taxa. Storing it in the original survey table would require including that data in every record, increasing both the size of the table and the likelihood of errors. Storing that data in a separate table means we only have to enter and validate it once. A tool like pandas then allows us to access the authoratative data when we need it.

We can enhance the species data in the dataframe using an authoratative dataset that contains the taxonomic information associated with each species. This data is stored in the species.csv file, which can be loaded using pd.read_csv():

PYTHON

import pandas as pd

surveys = pd.read_csv("data/surveys.csv")
species = pd.read_csv("data/species.csv")

species

OUTPUT

species_id genus species taxa
0 AB Amphispiza bilineata Bird
1 AH Ammospermophilus harrisi Rodent
2 AS Ammodramus savannarum Bird
3 BA Baiomys taylori Rodent
4 CB Campylorhynchus brunneicapillus Bird
5 CM Calamospiza melanocorys Bird
6 CQ Callipepla squamata Bird
7 CS Crotalus scutalatus Reptile
8 CT Cnemidophorus tigris Reptile
9 CU Cnemidophorus uniparens Reptile
10 CV Crotalus viridis Reptile
11 DM Dipodomys merriami Rodent
12 DO Dipodomys ordii Rodent
13 DS Dipodomys spectabilis Rodent
14 DX Dipodomys sp. Rodent
15 EO Eumeces obsoletus Reptile
16 GS Gambelia silus Reptile
17 NL Neotoma albigula Rodent
18 NX Neotoma sp. Rodent
19 OL Onychomys leucogaster Rodent
20 OT Onychomys torridus Rodent
21 OX Onychomys sp. Rodent
22 PB Chaetodipus baileyi Rodent
23 PC Pipilo chlorurus Bird
24 PE Peromyscus eremicus Rodent
25 PF Perognathus flavus Rodent
26 PG Pooecetes gramineus Bird
27 PH Perognathus hispidus Rodent
28 PI Chaetodipus intermedius Rodent
29 PL Peromyscus leucopus Rodent
30 PM Peromyscus maniculatus Rodent
31 PP Chaetodipus penicillatus Rodent
32 PU Pipilo fuscus Bird
33 PX Chaetodipus sp. Rodent
34 RF Reithrodontomys fulvescens Rodent
35 RM Reithrodontomys megalotis Rodent
36 RO Reithrodontomys montanus Rodent
37 RX Reithrodontomys sp. Rodent
38 SA Sylvilagus audubonii Rabbit
39 SB Spizella breweri Bird
40 SC Sceloporus clarki Reptile
41 SF Sigmodon fulviventer Rodent
42 SH Sigmodon hispidus Rodent
43 SO Sigmodon ochrognathus Rodent
44 SS Spermophilus spilosoma Rodent
45 ST Spermophilus tereticaudus Rodent
46 SU Sceloporus undulatus Reptile
47 SX Sigmodon sp. Rodent
48 UL Lizard sp. Reptile
49 UP Pipilo sp. Bird
50 UR Rodent sp. Rodent
51 US Sparrow sp. Bird
52 ZL Zonotrichia leucophrys Bird
53 ZM Zenaida macroura Bird

We can see that the species dataframe includes the genus, species, and taxon for each species_id. This is much more useful than the species_id included in the original dataframe—how can we integrate that data into our surveys dataframe? Adding it by hand would be tedious and error prone. Fortunately, pandas provides the pd.merge() function to join two dataframes. We’ll assigned the new, merged dataframe to the variable merged:

PYTHON

merged = surveys.merge(species)
merged

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa
0 1 7 16 1977 2 NL M 32.0 NaN Neotoma albigula Rodent
1 2 7 16 1977 3 NL M 33.0 NaN Neotoma albigula Rodent
2 3 7 16 1977 2 DM F 37.0 NaN Dipodomys merriami Rodent
3 4 7 16 1977 7 DM M 36.0 NaN Dipodomys merriami Rodent
4 5 7 16 1977 3 DM M 35.0 NaN Dipodomys merriami Rodent
34781 35544 12 31 2002 15 US NaN NaN NaN Sparrow sp. Bird
34782 35545 12 31 2002 15 AH NaN NaN NaN Ammospermophilus harrisi Rodent
34783 35546 12 31 2002 15 AH NaN NaN NaN Ammospermophilus harrisi Rodent
34784 35547 12 31 2002 10 RM F 15.0 14.0 Reithrodontomys megalotis Rodent
34785 35548 12 31 2002 7 DO M 36.0 51.0 Dipodomys ordii Rodent

34786 rows × 12 columns

Following the merge, the genus, species, and taxa columns have all been added to the survey dataframe. We can now use those columns to filter and summarize our data.

Joins

The merge() method performs the same function as joins in SQL

Challenge

We previously described the dataset as consisting of observations of rodents. That’s mostly true, but observations of other animals are also included. Based on the merged dataframe, what other types of animals are present in the dataframe? We are interesed in the type of animal, not the Latin name.

PYTHON

merged[merged["taxa"] != "Rodent"]["taxa"].unique()

OUTPUT

array(['Rabbit', 'Bird', 'Reptile'], dtype=object)

In the example above, we didn’t provide any information about how we wanted to merge the dataframes together, so pandas made an educated guess by looking at the columns in each of the dataframes and merging them on the only column that appeared in both datasets, species_id. For more complex tables, we may need more control over the operation, for example, by specifiying the exact columns we want to merge on. We can do that by passing one or more column names using the on keyword argument:

PYTHON

surveys.merge(species, on="species_id")

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa
0 1 7 16 1977 2 NL M 32.0 NaN Neotoma albigula Rodent
1 2 7 16 1977 3 NL M 33.0 NaN Neotoma albigula Rodent
2 3 7 16 1977 2 DM F 37.0 NaN Dipodomys merriami Rodent
3 4 7 16 1977 7 DM M 36.0 NaN Dipodomys merriami Rodent
4 5 7 16 1977 3 DM M 35.0 NaN Dipodomys merriami Rodent
34781 35544 12 31 2002 15 US NaN NaN NaN Sparrow sp. Bird
34782 35545 12 31 2002 15 AH NaN NaN NaN Ammospermophilus harrisi Rodent
34783 35546 12 31 2002 15 AH NaN NaN NaN Ammospermophilus harrisi Rodent
34784 35547 12 31 2002 10 RM F 15.0 14.0 Reithrodontomys megalotis Rodent
34785 35548 12 31 2002 7 DO M 36.0 51.0 Dipodomys ordii Rodent

34786 rows × 12 columns

It’s not necessary here, but we can also merge datasets where the common data is stored in columns with different names. To do so, we’d use the left_on (to specify the column name in the first dataframe) and right_on (to specify the column name in the second dataframe).

PYTHON

surveys.merge(species, left_on="species_id", right_on="species_id")

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa
0 1 7 16 1977 2 NL M 32.0 NaN Neotoma albigula Rodent
1 2 7 16 1977 3 NL M 33.0 NaN Neotoma albigula Rodent
2 3 7 16 1977 2 DM F 37.0 NaN Dipodomys merriami Rodent
3 4 7 16 1977 7 DM M 36.0 NaN Dipodomys merriami Rodent
4 5 7 16 1977 3 DM M 35.0 NaN Dipodomys merriami Rodent
34781 35544 12 31 2002 15 US NaN NaN NaN Sparrow sp. Bird
34782 35545 12 31 2002 15 AH NaN NaN NaN Ammospermophilus harrisi Rodent
34783 35546 12 31 2002 15 AH NaN NaN NaN Ammospermophilus harrisi Rodent
34784 35547 12 31 2002 10 RM F 15.0 14.0 Reithrodontomys megalotis Rodent
34785 35548 12 31 2002 7 DO M 36.0 51.0 Dipodomys ordii Rodent

34786 rows × 12 columns

Challenge

Compare the number of rows in the original and merged survey dataframes. How do they differ? Why might that be?

Hint: Use pd.unique() method to look at the species_id column in each dataframe.

PYTHON

pd.unique(surveys["species_id"].sort_values())

OUTPUT

array(['AB', 'AH', 'AS', 'BA', 'CB', 'CM', 'CQ', 'CS', 'CT', 'CU', 'CV',
       'DM', 'DO', 'DS', 'DX', 'NL', 'OL', 'OT', 'OX', 'PB', 'PC', 'PE',
       'PF', 'PG', 'PH', 'PI', 'PL', 'PM', 'PP', 'PU', 'PX', 'RF', 'RM',
       'RO', 'RX', 'SA', 'SC', 'SF', 'SH', 'SO', 'SS', 'ST', 'SU', 'UL',
       'UP', 'UR', 'US', 'ZL', nan], dtype=object)

The merged dataframe omits rows with no value in the species_id column.

The number of rows in the merged dataframe is lower than in the original surveys dataframe. By default, merge() performs an inner join. This means that a row will only appear in the merged dataframe if the value(s) in the join column(s) appear in both dataframes. Here, observations that did not include a species_id or that included a species_id that was not definedin the species dataframe were dropped.

This is not always desirable behavior. Fortunately, pandas supports other kinds of merges:

  • Inner: Include all rows with common values in the join columns. This is the default behavior.
  • Left: Include all rows from the left dataframe. Columns from the right dataframe are populated if a common value exists and set to NaN if not.
  • Right: Include all rows from the right dataframe. Columns from the left dataframe are populated if a common value exists and set to NaN if not.
  • Outer: Include all rows from both dataframes

We want to keep all of the observations, so let’s do a left join instead. To specify the type of merge, we use the how keyword argument:

PYTHON

surveys.merge(species, how="left")

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa
0 1 7 16 1977 2 NL M 32.0 NaN Neotoma albigula Rodent
1 2 7 16 1977 3 NL M 33.0 NaN Neotoma albigula Rodent
2 3 7 16 1977 2 DM F 37.0 NaN Dipodomys merriami Rodent
3 4 7 16 1977 7 DM M 36.0 NaN Dipodomys merriami Rodent
4 5 7 16 1977 3 DM M 35.0 NaN Dipodomys merriami Rodent
35544 35545 12 31 2002 15 AH NaN NaN NaN Ammospermophilus harrisi Rodent
35545 35546 12 31 2002 15 AH NaN NaN NaN Ammospermophilus harrisi Rodent
35546 35547 12 31 2002 10 RM F 15.0 14.0 Reithrodontomys megalotis Rodent
35547 35548 12 31 2002 7 DO M 36.0 51.0 Dipodomys ordii Rodent
35548 35549 12 31 2002 5 NaN NaN NaN NaN NaN NaN NaN

35549 rows × 12 columns

Now all 35,549 rows appear in the merged dataframe.

Appending rows to a dataframe


Merges address the case where information for a set of observations is spread across multiple files. What about when the observations themselves are in more than one file? For a survey like the one we’ve been looking at in this lesson, we might get a new file once a year with the same columns but a completely new set of observations. How can we add those new observations to our dataframe?

We’ll simulate this operation by splitting data from two different years, 2001 and 2002, into separate dataframes. We can filter the dataset this using conditionals, as we saw in lesson 3:

PYTHON

surveys_2001 = surveys[surveys["year"] == 2001].copy()
surveys_2001

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
31710 31711 1 21 2001 1 PB F 26.0 25.0
31711 31712 1 21 2001 1 DM M 37.0 43.0
31712 31713 1 21 2001 1 PB M 29.0 44.0
31713 31714 1 21 2001 1 DO M 34.0 53.0
31714 31715 1 21 2001 2 OT M 20.0 27.0
33315 33316 12 16 2001 11 NaN NaN NaN NaN
33316 33317 12 16 2001 13 NaN NaN NaN NaN
33317 33318 12 16 2001 14 NaN NaN NaN NaN
33318 33319 12 16 2001 15 NaN NaN NaN NaN
33319 33320 12 16 2001 16 NaN NaN NaN NaN

1610 rows × 9 columns

PYTHON

surveys_2002 = surveys[surveys["year"] == 2002].copy()
surveys_2002

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
33320 33321 1 12 2002 1 DM M 38.0 44.0
33321 33322 1 12 2002 1 DO M 37.0 58.0
33322 33323 1 12 2002 1 PB M 28.0 45.0
33323 33324 1 12 2002 1 AB NaN NaN NaN
33324 33325 1 12 2002 1 DO M 35.0 29.0
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

2229 rows × 9 columns

We now have two different dataframes with the same columns but different data, one with 1,610 rows, the other with 2,229 rows, for a total of 3,839 records. We can combine them into a new dataframe using pd.concat(), which stacks the dataframes vertically (that is, it adds records from one dataset to the end of another). This method accepts a list of dataframes and works from left to right (so the leftmost dataframe will be at the top of the new dataframe and the rightmost will be at the bottom). We’re only combining two dataframes here but could include more if necessary.

PYTHON

surveys_2001_2002 = pd.concat([surveys_2001, surveys_2002])
surveys_2001_2002

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight
31710 31711 1 21 2001 1 PB F 26.0 25.0
31711 31712 1 21 2001 1 DM M 37.0 43.0
31712 31713 1 21 2001 1 PB M 29.0 44.0
31713 31714 1 21 2001 1 DO M 34.0 53.0
31714 31715 1 21 2001 2 OT M 20.0 27.0
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

3839 rows × 9 columns

The combined dataframe includes all rows from both dataframes.

In some cases, the exact columns may change from year to year even within the same project. For example, researchers may decide to add an additional column to track a new piece of data or to provide a quality check. If a column is present in only one dataset, you can still concatenate the datasets. Any column that does not appear in a given dataset will have the value NaN for those rows in the combined dataframe.

To illustrate this, we’ll add a column called test to the 2002 survey, then re-run pd.concat():

PYTHON

surveys_2002["test"] = True

surveys_2001_2002 = pd.concat([surveys_2001, surveys_2002])
surveys_2001_2002

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight test
31710 31711 1 21 2001 1 PB F 26.0 25.0 NaN
31711 31712 1 21 2001 1 DM M 37.0 43.0 NaN
31712 31713 1 21 2001 1 PB M 29.0 44.0 NaN
31713 31714 1 21 2001 1 DO M 34.0 53.0 NaN
31714 31715 1 21 2001 2 OT M 20.0 27.0 NaN
35544 35545 12 31 2002 15 AH NaN NaN NaN True
35545 35546 12 31 2002 15 AH NaN NaN NaN True
35546 35547 12 31 2002 10 RM F 15.0 14.0 True
35547 35548 12 31 2002 7 DO M 36.0 51.0 True
35548 35549 12 31 2002 5 NaN NaN NaN NaN True

3839 rows × 10 columns

As expected, the test column has a value of NaN for the 2001 data in the combined dataframe.

Combining data in multiple columns


Sometimes we want to combine values from two or more columns into a single column. For example, we might want to refer to the species in each record by both its genus and species names. In Python, we use the + operator to concatenate (or join) strings, and pandas works the same way:

PYTHON

species["genus_species"] = species["genus"] + " " + species["species"]
species["genus_species"]

OUTPUT

0                Amphispiza bilineata
1            Ammospermophilus harrisi
2               Ammodramus savannarum
3                     Baiomys taylori
4     Campylorhynchus brunneicapillus
5             Calamospiza melanocorys
6                 Callipepla squamata
7                 Crotalus scutalatus
8                Cnemidophorus tigris
9             Cnemidophorus uniparens
10                   Crotalus viridis
11                 Dipodomys merriami
12                    Dipodomys ordii
13              Dipodomys spectabilis
14                      Dipodomys sp.
15                  Eumeces obsoletus
16                     Gambelia silus
17                   Neotoma albigula
18                        Neotoma sp.
19              Onychomys leucogaster
20                 Onychomys torridus
21                      Onychomys sp.
22                Chaetodipus baileyi
23                   Pipilo chlorurus
24                Peromyscus eremicus
25                 Perognathus flavus
26                Pooecetes gramineus
27               Perognathus hispidus
28            Chaetodipus intermedius
29                Peromyscus leucopus
30             Peromyscus maniculatus
31           Chaetodipus penicillatus
32                      Pipilo fuscus
33                    Chaetodipus sp.
34         Reithrodontomys fulvescens
35          Reithrodontomys megalotis
36           Reithrodontomys montanus
37                Reithrodontomys sp.
38               Sylvilagus audubonii
39                   Spizella breweri
40                  Sceloporus clarki
41               Sigmodon fulviventer
42                  Sigmodon hispidus
43              Sigmodon ochrognathus
44             Spermophilus spilosoma
45          Spermophilus tereticaudus
46               Sceloporus undulatus
47                       Sigmodon sp.
48                         Lizard sp.
49                         Pipilo sp.
50                         Rodent sp.
51                        Sparrow sp.
52             Zonotrichia leucophrys
53                   Zenaida macroura
Name: genus_species, dtype: object

Note that the + operator can also be used to add numeric columns. In Python, the same operator can be used to perform different operations for different data types.

Combining dates


Another common need is to join or split dates. In the ecology dataset, the date of each observation is split across year, month, and day columns. However, pandas has a special data type, datetime64, for representing dates that can be useful for analyzing time series data. To make use of that functionality, we can use pd.to_datetime() method to create a datetime object:

PYTHON

surveys["date"] = pd.to_datetime(surveys[["year", "month", "day"]])
surveys["date"]

OUTPUT

0       1977-07-16
1       1977-07-16
2       1977-07-16
3       1977-07-16
4       1977-07-16
           ...
35544   2002-12-31
35545   2002-12-31
35546   2002-12-31
35547   2002-12-31
35548   2002-12-31
Name: date, Length: 35549, dtype: datetime64[ns]

Challenge

Pandas allows us to ask specific questions about our data. A key skill is knowing how to translate our scientific questions into a sensible approach (and subsequently visualize and interpret our results).

Try using pandas to answer the following questions.

  1. How many specimens of each sex are there for each year, including those whose sex is unknown?
  2. What is the average weight of each taxa?
  3. What are the minimum, maximum and average weight for each species of Rodent?
  4. What is the average hindfoot length for male and female rodent of each species? Is there a male/female difference?
  5. What is the average weight of each rodent species over the course of the years? Is there any noticeable trend for any of the species?

How many specimens of each sex are there for each year, including those whose sex is unknown?

PYTHON

# Fill in NaN values in sex with U
surveys["sex"] = surveys["sex"].fillna("U")

# Count records by sex
result = surveys.groupby(["year", "sex"])["record_id"].count()

# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
    print(result)

OUTPUT

year  sex
1977  F       204
      M       214
      U        85
1978  F       503
      M       433
      U       112
1979  F       327
      M       324
      U        68
1980  F       605
      M       727
      U        83
1981  F       631
      M       745
      U        96
1982  F       823
      M      1027
      U       128
1983  F       771
      M       797
      U       105
1984  F       445
      M       443
      U        93
1985  F       636
      M       716
      U        86
1986  F       414
      M       455
      U        73
1987  F       677
      M       862
      U       132
1988  F       604
      M       737
      U       128
1989  F       678
      M       780
      U       111
1990  F       581
      M       636
      U        94
1991  F       606
      M       637
      U       104
1992  F       443
      M       477
      U       118
1993  F       285
      M       380
      U        85
1994  F       243
      M       327
      U        98
1995  F       574
      M       534
      U       114
1996  F       725
      M       910
      U        71
1997  F      1071
      M      1357
      U        65
1998  F       717
      M       798
      U        95
1999  F       545
      M       530
      U        60
2000  F       690
      M       779
      U        83
2001  F       743
      M       744
      U       123
2002  F      1149
      M       979
      U       101
Name: record_id, dtype: int64

What is the average weight of each taxa?

PYTHON

# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")

# Group by taxa
grouped = merged.groupby("taxa")

# Calculate the min, max, and mean weight
grouped["weight"].mean()

OUTPUT

taxa
Bird             NaN
Rabbit           NaN
Reptile          NaN
Rodent     42.672428
Name: weight, dtype: float64

What are the minimum, maximum and average weight for each species of Rodent?

PYTHON

# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")

# Limit merged dataframe to rodents
rodents = merged[merged["taxa"] == "Rodent"]

# Group rodents by species
grouped = rodents.groupby("species_id")

# Calculate the min, max, and mean weight
grouped.agg({"weight": ["min", "max", "mean"]})

OUTPUT

weight
min max mean
species_id
AH NaN NaN NaN
BA 6.0 18.0 8.600000
DM 10.0 66.0 43.157864
DO 12.0 76.0 48.870523
DS 12.0 190.0 120.130546
DX NaN NaN NaN
NL 30.0 280.0 159.245660
OL 10.0 56.0 31.575258
OT 5.0 46.0 24.230556
OX 18.0 24.0 21.000000
PB 12.0 55.0 31.735943
PE 8.0 40.0 21.586508
PF 4.0 25.0 7.923127
PH 18.0 48.0 31.064516
PI 17.0 21.0 19.250000
PL 8.0 27.0 19.138889
PM 7.0 49.0 21.364155
PP 4.0 74.0 17.173942
PX 18.0 20.0 19.000000
RF 9.0 20.0 13.386667
RM 4.0 29.0 10.585010
RO 8.0 13.0 10.250000
RX 11.0 20.0 15.500000
SF 24.0 199.0 58.878049
SH 16.0 140.0 73.148936
SO 15.0 105.0 55.414634
SS 57.0 130.0 93.500000
ST NaN NaN NaN
UR NaN NaN NaN

What is the average hindfoot length for male and female rodent of each species? Is there a male/female difference?

PYTHON

# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")

# Limit merged dataframe to rodents
rodents = merged[merged["taxa"] == "Rodent"]

# Group rodents by species and sex
grouped = rodents.groupby(["species_id", "sex"])

# Calculate the mean hindfoot length, plus count and standard deviation
# to better assess the question
with pd.option_context("display.max_rows", None):
    print(grouped["hindfoot_length"].agg(["count", "mean", "std"]))

OUTPUT

                count       mean        std
species_id sex
AH         M        0        NaN        NaN
           U        2  33.000000   2.828427
BA         F       31  13.161290   1.529636
           M       14  12.642857   2.097880
           U        0        NaN        NaN
DM         F     4302  35.712692   1.433067
           M     5658  36.188229   1.455396
           U       12  35.583333   0.996205
DO         F     1244  35.486334   1.726414
           M     1640  35.698780   1.611656
           U        3  36.000000   0.000000
DS         F     1046  49.583174   1.973573
           M     1083  50.301939   2.128120
           U        3  50.000000   1.732051
DX         U        0        NaN        NaN
NL         F      620  32.024194   1.669181
           M      452  32.577434   1.903603
           U        2  52.000000  25.455844
OL         F      438  20.287671   1.101022
           M      480  20.758333   1.651195
           U        2  20.000000   0.000000
OT         F     1019  20.281649   1.519412
           M     1115  20.257399   1.087354
           U        5  19.600000   1.140175
OX         F        3  19.666667   1.154701
           M        5  18.800000   3.346640
           U        0        NaN        NaN
PB         F     1645  25.882675   0.928842
           M     1213  26.431987   1.444130
           U        6  26.166667   0.983192
PE         F      544  20.272059   1.156084
           M      666  20.132132   1.177721
           U        2  20.500000   0.707107
PF         F      721  15.550624   1.381627
           M      770  15.620779   1.142208
           U        2  13.000000   4.242641
PH         F       20  25.900000   1.293709
           M       11  25.545455   1.752920
PI         M        8  22.500000   0.534522
           U        1  20.000000        NaN
PL         F       16  19.750000   2.720294
           M       19  20.263158   0.933459
           U        1  20.000000        NaN
PM         F      361  20.279778   1.162782
           M      483  20.530021   1.275621
           U        3  21.333333   4.041452
PP         F     1574  21.676620   1.316529
           M     1444  21.838643   1.201006
           U        9  20.888889   2.027588
PX         F        1  19.000000        NaN
           M        1  20.000000        NaN
           U        0        NaN        NaN
RF         F       55  17.527273   0.813191
           M       18  17.500000   0.985184
RM         F     1133  16.391880   1.123946
           M     1296  16.491512   1.134202
           U       13  15.846154   2.444250
RO         F        4  14.750000   0.957427
           M        4  16.000000   1.154701
RX         M        2  18.500000   2.121320
SF         F       16  27.500000   2.683282
           M       23  26.347826   3.524056
           U        2  24.500000   0.707107
SH         F       71  29.028169   2.298893
           M       60  27.983333   2.801281
           U        0        NaN        NaN
SO         F       30  25.633333   4.139660
           M       11  25.727273   1.954017
SS         F        0        NaN        NaN
           M        0        NaN        NaN
           U        0        NaN        NaN
ST         U        0        NaN        NaN
UR         U        0        NaN        NaN

What is the average weight of each rodent species over the course of the years? Is there any noticeable trend for any of the species?

PYTHON

# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")

# Limit merged dataframe to rodents
rodents = merged[merged["taxa"] == "Rodent"]

# Group rodents by species and year
grouped = rodents.groupby(["species_id", "year"])

# Calculate the mean weight by year
result = grouped["weight"].mean()

# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
    print(result)

OUTPUT

species_id  year
AH          1978           NaN
            1982           NaN
            1983           NaN
            1984           NaN
            1985           NaN
            1986           NaN
            1987           NaN
            1988           NaN
            1989           NaN
            1990           NaN
            1991           NaN
            1992           NaN
            1993           NaN
            1994           NaN
            1995           NaN
            1996           NaN
            1997           NaN
            1998           NaN
            1999           NaN
            2000           NaN
            2001           NaN
            2002           NaN
BA          1989      7.000000
            1990      8.000000
            1991      9.240000
            1992      7.833333
DM          1977     41.141304
            1978     40.795455
            1979     43.507317
            1980     44.136082
            1981     44.032907
            1982     41.975042
            1983     40.944551
            1984     40.765306
            1985     41.507645
            1986     43.361596
            1987     43.232609
            1988     43.397790
            1989     44.349206
            1990     41.769912
            1991     43.148338
            1992     43.877966
            1993     43.373984
            1994     42.373288
            1995     44.806147
            1996     44.506173
            1997     44.708551
            1998     43.131403
            1999     43.945402
            2000     43.126638
            2001     45.442177
            2002     46.168317
DO          1977     42.666667
            1978     45.000000
            1979     45.862069
            1980     48.058824
            1981     49.111111
            1982     47.919643
            1983     47.150602
            1984     48.415094
            1985     47.956731
            1986     49.372727
            1987     50.087379
            1988     51.463768
            1989     51.025641
            1990     48.512048
            1991     49.695652
            1992     48.367816
            1993     48.461538
            1994     47.750000
            1995     49.592593
            1996     47.234940
            1997     48.177866
            1998     49.731183
            1999     48.012048
            2000     49.224719
            2001     52.233766
            2002     49.514403
DS          1977    121.437500
            1978    115.198653
            1979    111.796954
            1980    120.725664
            1981    125.818444
            1982    115.647929
            1983    122.033088
            1984    124.082192
            1985    124.326316
            1986    128.071429
            1987    126.383838
            1988    129.490566
            1989    121.896552
            1990    121.187500
            1991    113.000000
            1992    112.352941
            1993    105.722222
            1994    106.625000
            1997    111.000000
            1998    116.000000
            1999    120.714286
DX          1980           NaN
            1983           NaN
            1986           NaN
            1988           NaN
            1991           NaN
            1992           NaN
            1993           NaN
            1994           NaN
            1995           NaN
            1996           NaN
            1998           NaN
            1999           NaN
            2000           NaN
            2001           NaN
            2002           NaN
NL          1977           NaN
            1978    184.656250
            1979    138.000000
            1980    158.982143
            1981    165.903226
            1982    160.613208
            1983    156.691489
            1984    150.095238
            1985    148.829268
            1986    159.724138
            1987    158.840000
            1988    163.104167
            1989    151.278689
            1990    154.275862
            1991    148.785714
            1992    139.000000
            1993    127.391304
            1994    186.875000
            1995    155.833333
            1996    148.000000
            1997    150.688889
            1998    159.466667
            1999    182.200000
            2000    179.307692
            2001    167.851064
            2002    182.159091
OL          1977     21.666667
            1978     31.027027
            1979     33.717391
            1980     33.107143
            1981     33.296296
            1982     35.590361
            1983     34.584615
            1984     32.550725
            1985     32.108696
            1986     30.880952
            1987     30.779221
            1988     30.306122
            1989     31.947368
            1990     31.200000
            1991     28.171429
            1992     28.454545
            1993     27.545455
            1994     21.900000
            1995     27.296296
            1996     27.769231
            1997     33.625000
            2002     25.428571
OT          1977     21.000000
            1978     23.220000
            1979     23.075758
            1980     24.083333
            1981     24.415385
            1982     24.925532
            1983     24.697674
            1984     22.416667
            1985     22.064516
            1986     21.500000
            1987     23.069767
            1988     24.070588
            1989     24.663366
            1990     23.675325
            1991     24.588235
            1992     24.928571
            1993     22.363636
            1994     25.285714
            1995     24.868421
            1996     23.453704
            1997     24.785156
            1998     24.675676
            1999     25.723810
            2000     25.303448
            2001     23.707792
            2002     23.833333
OX          1977     21.333333
            1978           NaN
            1982     24.000000
            1984     18.000000
            1989     20.000000
            1993           NaN
PB          1995     34.000000
            1996     32.578947
            1997     31.085603
            1998     30.082237
            1999     31.710037
            2000     30.878899
            2001     32.798851
            2002     32.359447
PE          1977     19.500000
            1978     20.428571
            1979     20.529412
            1980     22.450000
            1981     20.558140
            1982     21.173077
            1983     20.864865
            1984     20.210526
            1985     20.360000
            1986     22.520000
            1987     21.625899
            1988     22.625668
            1989     21.935673
            1990     21.671233
            1991     21.435484
            1992     22.750000
            1993     19.230769
            1994     18.000000
            1995     21.230769
            1996     22.027778
            1997     20.929825
            1998     20.304348
            1999     25.428571
            2000     21.615385
            2001     20.400000
            2002     21.719298
PF          1977      7.173913
            1978      7.088235
            1979      7.529412
            1980      7.455556
            1981      7.152542
            1982      6.918919
            1983      6.833333
            1984      7.428571
            1985           NaN
            1986      8.000000
            1987      7.500000
            1988      8.000000
            1989      7.230769
            1990      7.117647
            1991      7.827586
            1992      8.767857
            1993      8.238806
            1994      7.855072
            1995      8.780645
            1996      8.393846
            1997      8.448649
            1998      8.720000
            2001      8.538462
            2002      7.388889
PH          1984     28.000000
            1985     32.666667
            1986     39.000000
            1988     33.000000
            1989     30.500000
            1990     31.142857
            1992     31.142857
            1995     35.333333
            1997     22.000000
PI          1992     18.000000
            1995           NaN
            2001     18.000000
            2002     20.000000
PL          1995     22.333333
            1996     21.166667
            1997     18.789474
            1998     16.714286
            2000     21.000000
PM          1978     20.666667
            1979     23.666667
            1980     21.000000
            1981     19.933333
            1982     21.391304
            1983     22.023810
            1984     19.545455
            1985     19.000000
            1986     21.696970
            1987     22.297710
            1988     21.759259
            1989     20.222222
            1995     27.375000
            1996     20.224490
            1997     21.126394
            1998     20.591398
            1999     22.523810
            2000     20.500000
            2001     26.666667
            2002     19.000000
PP          1977     15.333333
            1978     14.869565
            1979     15.105263
            1980     14.176471
            1981     13.950000
            1982     16.125000
            1983     15.468750
            1984     15.307692
            1985     15.764706
            1986     16.750000
            1987     17.840909
            1988     18.280000
            1989     17.409091
            1990     16.029412
            1991     17.904762
            1992     17.479675
            1993     17.954023
            1994     17.585714
            1995     16.844444
            1996     18.095563
            1997     18.175000
            1998     16.266990
            1999     16.521212
            2000     16.788618
            2001     17.752896
            2002     17.018617
PX          1996           NaN
            1997     19.000000
            1998           NaN
            2000           NaN
RF          1982     11.500000
            1988     13.818182
            1989     13.346939
            1990     12.916667
            1997     20.000000
RM          1977     10.000000
            1978      7.500000
            1979      8.333333
            1980     10.227273
            1981     11.178571
            1982     10.436242
            1983      9.872000
            1984     11.152542
            1985      8.371429
            1986     10.750000
            1987     10.657609
            1988     10.322115
            1989     10.411311
            1990     10.305677
            1991     10.498305
            1992     10.904762
            1993     10.747475
            1994     10.675000
            1995     12.653846
            1996     11.455556
            1997     11.230769
            1998     13.100000
            1999     10.555556
            2000     11.400000
            2001     11.333333
            2002     10.000000
RO          1991     11.000000
            2002     10.142857
RX          1995     20.000000
            1997     11.000000
SF          1989     54.800000
            1990     52.611111
            1991     93.166667
            1992     43.000000
            1993     44.000000
            2002     62.166667
SH          1977           NaN
            1978     89.000000
            1982     79.000000
            1983           NaN
            1986     55.000000
            1987     73.800000
            1988     72.806452
            1989     76.345455
            1990     76.888889
            1991     63.000000
            1997     50.857143
            1999     54.000000
            2000     73.375000
            2001     79.900000
            2002     64.666667
SO          1991     53.909091
            1992     55.263158
            1993     55.600000
            1994     62.333333
            1997     54.666667
SS          1977           NaN
            1978    130.000000
            1979           NaN
            1980           NaN
            1981     57.000000
            1982           NaN
            1983           NaN
            1984           NaN
            1985           NaN
            1986           NaN
            1987           NaN
            1988           NaN
            1989           NaN
            1990           NaN
            1991           NaN
            1992           NaN
            1993           NaN
            1994           NaN
            1995           NaN
            1996           NaN
            1997           NaN
            1998           NaN
            1999           NaN
            2000           NaN
            2001           NaN
            2002           NaN
ST          1993           NaN
UR          1987           NaN
            1988           NaN
            1991           NaN
            1992           NaN
            1994           NaN
            2001           NaN
            2002           NaN
Name: weight, dtype: float64

Key Points

  • Combine two dataframes on one or more common values using pd.merge()
  • Append rows from one dataframe to another using pd.concat()
  • Combine multiple text columns into one using the + operator
  • Convert date info to datetime objects using pd.to_datetime()

Content from Visualizing Data


Last updated on 2025-07-15 | Edit this page

Overview

Questions

  • How can we create different types of plots using Python?
  • How can we style plots?
  • How can we add descriptive titles to plots and axes?

Objectives

  • Review processes for reading, modifying, and combining dataframes
  • Make and customize scatter, box, and bar plots

We’ll begin by loading and preparing the data we’d like to plot. This will include operations introduced in previous lessons, including reading CSVs into dataframes, merging dataframes, sorting a dataframe, and removing records that include null values. We’ll begin by importing pandas:

PYTHON

import pandas as pd

Next we’ll load the surveys dataset using pd.read_csv():

PYTHON

surveys = pd.read_csv("data/surveys.csv")

Now we want to take a quick look at the surveys dataset. Since we’re going to be plotting data, we need to consider how we want to handle any null values in the dataset. The info() method provides an overview, including counts of non-null values, that we can use to assess the dataset.

PYTHON

surveys.info()

OUTPUT

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35549 entries, 0 to 35548
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   record_id        35549 non-null  int64
 1   month            35549 non-null  int64
 2   day              35549 non-null  int64
 3   year             35549 non-null  int64
 4   plot_id          35549 non-null  int64
 5   species_id       34786 non-null  object
 6   sex              33038 non-null  object
 7   hindfoot_length  31438 non-null  float64
 8   weight           32283 non-null  float64
dtypes: float64(2), int64(5), object(2)
memory usage: 2.4+ MB

There are 35,459 records in the table. Four columns—species_id, sex, hindfoot_length, and weight—include null values, that is, they contain fewer non-null values than there are rows in the dataframe. We can us fillna() to replace null values where it makes sense to do so. For example, some specimens do not specify as sex. We can fill those values in with the letter U (for unknown):

PYTHON

surveys["sex"] = surveys["sex"].fillna("U")

The other three columns that contain null values are all required for the plots we will create below. This means that we can use dropna() to drop all rows where any column is null:

PYTHON

surveys = surveys.dropna()

Now we we’ll merge the main surveys dataframe with two other datasets containing additional information:

  • species.csv provides the genus and species corresponding to species_id
  • plots.csv provides the plot type corresponding to plot_id

We will read each CSV and merge it into the main dataframe:

PYTHON

species = pd.read_csv("data/species.csv")
plots = pd.read_csv("data/plots.csv")

surveys = surveys.merge(species, how="left").merge(plots, how="left")

Chaining

The previous cell performs two merges in the same line of code. Performing multiple operations on the same object in a single line of code is called chaining.

We now have a dataframe that includes all observations from the Portal dataset that specify a species, weight, and hindfoot length, as well as descriptive metadata about each species and plot type. The taxa column contains the general type of animal. If we look at the unique values in this column–

PYTHON

surveys["taxa"].unique()

OUTPUT

array(['Rodent'], dtype=object)

–we can see that all remaining observations are of rodents. In honor of this, we will assign our dataframe to a new variable:

PYTHON

rodents = surveys
rodents.sample(5)

OUTPUT

record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa plot_type
12369 15232 12 14 1988 11 PE M 20.0 22.0 Peromyscus eremicus Rodent Control
30117 34885 10 6 2002 4 PB F 27.0 25.0 Chaetodipus baileyi Rodent Control
2723 3815 1 31 1981 13 OL F 21.0 35.0 Onychomys leucogaster Rodent Short-term Krat Exclosure
22412 26462 7 9 1997 19 PP M 21.0 14.0 Chaetodipus penicillatus Rodent Long-term Krat Exclosure
13991 17109 1 30 1990 4 DM M 37.0 49.0 Dipodomys merriami Rodent Control

Finally, we’ll save the rodents dataframe to a file that we can load directly in the future if needed:

PYTHON

rodents.to_csv("data/rodents.csv", index=False)

Now we’re ready to plot.

Re-introducing plotly


We’ve already worked with plotly a little in previous lessons, but we haven’t provided a comprehensive introduction. Plotly is a data visualization package for Python that allows us to create customizable, interactive plots of a variety of different types. Plotly makes plots that are:

  • Customizable. Allows the appearance of plots to be extensively modified.
  • Interactive. Pan and zoom across plots, or hover over elements to get additional information about them.
  • Flexible. Many different plot types can be created, often with only a few lines of code. Because plotly uses similar syntax for each plot type, it is also easy to quickly change plot types to get a different perspective on a dataset.
  • Embeddable. Interactive plots can be embedded on websites using ploty’s JavaScript library.

Other plotting libraries

The R community has largely coalesced around gg2plot for plotting. In contrast, the Python community has no clear consensus pick and makes use of a number of data visualization packages. Some other commonly used packages include:

The functionality of this packages overlaps to a large degree, and which one to use depends in large part on personal preference.

Plotly has two main ways of making plots:

  • plotly.express provides a simplified interface for quickly building and customizing plots
  • plotly.graph_objects uses a more complex interface to provide more granular control over the contents of a plot

We will use plotly.express in this lesson.

We’ll begin by reproducing the scatter plot from the end of lesson 3, which included weight on the x axis and hindfoot length on the y axis. Likely pandas, the developers of plotly.express have a preferred alias, px, that we will use when we import plotly.express:

PYTHON

import plotly.express as px

px.scatter(rodents, x="weight", y="hindfoot_length")

OUTPUT

Before diving into the content of this plot, let’s take a quick look at the interactive elements plotly makes available. When we hover over a plotly plot, a toolbar appears in the upper right corner. Each icon on the toolbar is a widget that allows us to interact with the plot. By default, the toolbar includes the following widgets:

  • The camera allows us to save the current view as a PNG file
  • The next four widgets are toggles that control how click-and-drag affects the plot. Only one can be active at a time.
    • The magnifying glass enables drawing a box to zoom
    • The crossing arrows enable panning
    • The dotted box enables drawing a box to select data
    • The dotted lasso enables drawing an arbitrary shape to to select data
  • The plus box allows us to zoom in
  • The minus box allows us to zoom out
  • The crossing arrows autoscale the plot to show all adata
  • The house resets the plot to the original view

When we hover over a data point, we get a tooltip that shows information about that point.

Now let’s think about the plot itself.

Challenge

What are some limitations to the plot above? Think about how the data itself is presented as well as the general appearance of the plot.

Some issues with the plot include the following:

  • All points are the same color
  • Large number of overlapping points, making it difficult to understand how data is distributed
  • Axis labels include underscores and lack units
  • No plot title

Any others?

One issue with the plot is that many of the points in the dataframe overlap, making it difficult to get a feel for how the data is distributed. Does it cluster in places? Is it evenly distributed? We really can’t tell.

We can mitigate this issue in part by making the points semitransparent using the opacity keyword argument. We’ll use a value of 0.2, which means the points are mostly transparent.

PYTHON

px.scatter(rodents, x="weight", y="hindfoot_length", opacity=0.2)

OUTPUT

With the points now partially transparent, the places where they overlap are more obvious, and we can see several clusters. To get a better sense of what those clusters might be, let’s add some color to the plot.

At the end of lesson 4, we colored this scatter plot based on the species_id column. This revealed that several of the clusters reflected the species of rodent observed. But the use of species_id was problematic in other ways. For one, there are so many different species that colors ended up being re-used. For another, the species codes are not legible to people who are not intimately familiar with this dataset.

With those concerns in mind, let’s use genus for color instead. Genus information wasn’t available in lesson 4, but we can use it now now that we’ve merged the surveys and species dataframes.

PYTHON

px.scatter(rodents, x="weight", y="hindfoot_length", color="genus", opacity=0.2)

OUTPUT

Choosing colors


One concern when making plots is to make sure they are legible to as broad an audience as possible. Clear, informative labels are one way to do this. Color is another.

Plotly makes a number of color palettes available via its color module. Because we are working with categorical data, we will use a qualitative palette, which consists of a list of discrete colors. Other palette types are also available. For example, plots showing a range of values might benefit from using a sequential or diverging color scheme, which use a continuous range of colors (for example, blue to red for a heat map).

Qualitative palettes are available under px.colors.qualitative. We can view the available plaettes using the swatches function:

PYTHON

px.colors.qualitative.swatches()

OUTPUT

In the spirit of effective communication with a wide audience, we will use px.colors.qualitative.Safe, a colorblind-safe palette. Because we will be using the same palette for the rest of the lesson, we will store the palette as a variable:

PYTHON

colors = px.colors.qualitative.Safe

In addition to simplifying the code for future plots a bit, storing the palette as a variable also allows us to quickly change the color scheme for all our plots at once if needed.

Let’s apply the safe colors to our scatter plot:

PYTHON

px.scatter(
    rodents,
    x="weight",
    y="hindfoot_length",
    color="genus",
    opacity=0.2,
    color_discrete_sequence=colors,
)

OUTPUT

Sorting data


Take a look at the legend of the plot. The genera from the dataset are all listed, but they are in no apparent order. This makes it difficult for anyone looking at the plot to quickly pick out a given genus. We can alphabetize the legend to make it more readable. To do so, we can use the category_order keyword argument.

This argument requires a dict. Recall that a dict is a mapping of keys to values defined using curly braces. The dict passed to category_orders maps a column name from the dataframe to a list of values in the preferred order.

One approach to creating an alphabetical list would be to simply build the list manually. That would work well enough here, where we have only a handful of values, but quickly becomes unwiedly for larger datasets. Instead, we will used the built-in sorted() function to create a sorted list of values in the genus column. Because we will be using the same order in all following plots, we will store the dict with the sorted values in a variable that we can refer to whenever we need to and can change if needed.

Challenge

Create the dict needed for category_order. The dict should map the column name, genus, to an ordered list of unique values from that column.

PYTHON

vals = rodents["genus"]
unique = vals.unique()
ordered = sorted(unique)
cat_order = {"genus": ordered}

In addition to the approach used in the challenge, we can create the dict we need with a single line of code:

PYTHON

cat_order = {"genus": sorted(rodents["genus"].unique())}

That variable can now be passed to the category_orders keyword argument, producing a new version of the plot with an alphabetical legend.

PYTHON

px.scatter(
    rodents,
    x="weight",
    y="hindfoot_length",
    color="genus",
    opacity=0.2,
    color_discrete_sequence=colors,
    category_orders=cat_order,
)

OUTPUT

Note that the colors in the plot have also changed. Colors are assigned based on the same category order used by the legend.

Adding plot and axis titles


By default, plotly uses the column names from the dataframe to label the axes of a plot. Here, the axis labels are adequate but would benefit from removing the underscore and including units. We can use the labels keyword argument to assign human-readable labels to our plto.

We will turn again to a dict for this, which we will use to map the underlying column names to preferred display values:

PYTHON

labels = {
    "hindfoot_length": "Hindfoot length (mm)",
    "genus": "Genus",
    "weight": "Weight (g)"
}

We can then update the plot itself using the title and labels keyword arguments:

PYTHON

px.scatter(
    rodents,
    x="weight",
    y="hindfoot_length",
    color="genus",
    opacity=0.2,
    color_discrete_sequence=colors,
    category_orders=cat_order,
    title="Rodent size by genus",
    labels=labels,
)

OUTPUT

Create a faceted plot


Even with the semitransparent points, there is a good deal of overlap among the data on this plot, particularly in the lower left part. It may be useful to plot each genus separately to see if anyting interesting is being obscured. We could do so using some of the techniques we’ve already covered. For example, we could filter the dataframe by genus and create a separate plots for each.

Plotly provides a simpler approach called faceting. A facet is a type of filter, and a faceted plot includes a separate subplot for each unique value. In plotly, we can create a faceted plot using the facet_col keyword argument. This argument produces a separate subplot for each unique value in the specified column. The subplots are arranged in a single row.

The plots we have created so far have not specified dimensions, but we will need to consider the size of the faceted plot. Because this plot includes a number of subplots, it may appear cramped unless it is quite wide. We can use the height and width keyword arguments to set the size of the plot. Each of these arguments requires an integer specifying a size in pixels.

Let’s update our plot to facet it by genus and make it 1400 pixels wide by 400 pixels tall:

PYTHON

px.scatter(
    rodents,
    x="weight",
    y="hindfoot_length",
    color="genus",
    opacity=0.2,
    color_discrete_sequence=colors,
    category_orders=cat_order,
    title="Rodent size by genus",
    labels=labels,
    facet_col="genus",
    width=1400,
    height=400,
)

OUTPUT

Row facets

The facet_col argument produces a single row of subplots. There is another argument, facet_row, that places each subplot in a separate row instead. Column and row facets can even be combined, for example, to produce a grid of subplots with genus as columns and sex as rows.

Making a box plot


We briefly discussed box plots (also known as box-and-whisker plots) in lesson 4 as part of the discussion of summary statistics. Box plots are an effective way to visualize the distribution of data. Plotly uses the px.box() method to generate them.

Some issues raised about the plot created during the earlier lesson—including the arbitrary order in which data was plotted, the repeated colors, and the inclusion of species with no data—have already been addressed above. We can integrate them into the box plot using the same approaches that we used for the scatter plots previously. Let’s create a box plot of hindfoot length by genus using the same color and ordering rules we defined earlier:

PYTHON

px.box(
    rodents,
    x="genus",
    y="hindfoot_length",
    color="genus",
    color_discrete_sequence=colors,
    category_orders=cat_order,
    title="Rodent hindfoot length by genus",
    labels=labels,
)

OUTPUT

We now have a box plot with colors corresponding to the scatter plots above, with an alphabetically ordered x axis and legend.

There are other aspects of the box plot that we may want to tweak. We’ll start with a concern about how the data is being represented. By default, ploty’s box plots show individual points only for outliers, that is, points that plot outside the upper and lower fences. This works well enough for normally distributed data but can obscure patterns for more complex distributions. And indeed, some genera, like Dipodomys, show a large number of outliers on the box pot and multiple clusters of data on the scatter plot. How might we update the box plot to better convey these distributions?

The px.box() method includes a keyword argument, points, that allows us to change how the underlying data is displayed. It accepts three values:

  • outliers only shows the outliers (default)
  • all shows all points
  • None shows no points

Let’s try updating the box plot to show all the underlying data:

PYTHON

px.box(
    rodents,
    x="genus",
    y="hindfoot_length",
    color="genus",
    color_discrete_sequence=colors,
    category_orders=cat_order,
    title="Rodent hindfoot length by genus",
    labels=labels,
    points="all",
)

OUTPUT

A point cloud is now visible to the left of each box-and-whisker. We can see that plotly has spread the points out a bit along the x axis. This process, called jitter, is necessary because otherwise the points for each category would fall in a vertical line. We can also see that we’ve run into the same problem we did above with the scatter plot: The large number of overlapping points for each box makes it hard to see what is going on inside each point cloud.

We can again address this problem by changing the opacity of the markers. It’s a little more complicated than it was for the scatter plot, however, because the px.box() method does not allow us to set the marker opacity directly. Instead, we have to build the plot, then update the existing markers using the update_traces() method. (Plotly refers to markers, lines, and other elements as traces.)

PYTHON

fig = px.box(
    rodents,
    x="genus",
    y="hindfoot_length",
    color="genus",
    color_discrete_sequence=colors,
    category_orders=cat_order,
    title="Rodent hindfoot length by genus",
    labels=labels,
    points="all",
)
fig.update_traces(marker={"opacity": 0.1})

OUTPUT

With the points now semitransparent, it is possible to see separate populations among some genera, like Dipodomys.

Another way to examine the distribution of data is a violin plot, which visualizes the distribution of points as a line, similar to a Bell curve. We can change our box plot to a violin plot by swapping px.violin() in for px.box():

PYTHON

fig = px.violin(
    rodents,
    x="genus",
    y="hindfoot_length",
    color="genus",
    color_discrete_sequence=colors,
    category_orders=cat_order,
    title="Rodent hindfoot length by genus",
    labels=labels,
    points="all",
)
fig.update_traces(marker={"opacity": 0.1})

OUTPUT

This plot makes it easier to see identify complex distributions, like the bimodal distribution for Chaetodipus, that are visible but not necessarily obvious in the point clouds.

Challenge

Let’s return to a question posed all the way back in lesson 2: How has the weight of Dipodomys species changed over time? Make a plot that tries to answer this question.

There are several reasonable approaches to this question using the scatter plots and boxplots covered so far in this lesson. One possibility is a faceted plot showing the mean weight of each Dipodomys over the course of the study.

PYTHON

# Create genus_species column
rodents["genus_species"] = rodents["genus"] + " " + rodents["species"]

# Limit to Dipodomys
dipodomys = rodents[rodents["genus"] == "Dipodomys"].copy()

# Group by genus_species
grouped = dipodomys.groupby(["genus_species", "year"])["weight"].mean().reset_index()

# Create scatter plot
px.scatter(grouped, x="year", y="weight", facet_col="genus_species")

The plot generated by this code shows that the mean weights for these species oscillated over the course of the study. Overall, mean weight increased slightly for the two smaller species but decreased for the largest species (albeit with a large oscillation that makes determining a trend difficult.)

Key Points

  • Plotly offers a wide variety of ways to build and style scatter plots
  • Use scatter plots to visualize how parameters covary
  • Use box and violin plots to visualize the distribution of a parameter