Content from Introducing Python


Last updated on 2022-07-18 | Edit this page

Estimated time: 60 minutes

Overview

Questions

  • What is Python?
  • How do we assign variables in Python?
  • How do we perform actions in Python?
  • How does Python handle different types of data?

Objectives

  • Introduce some Python terminology
  • Learn how to assign values to variables
  • Learn how to use functions and methods
  • Introduce the built-in Python data types and functions
  • Introduce the Python Standard Library

About Python


Python is a high-level, general purpose language. In practical terms, that means it dispenses with many of the complexities of writing code. It is widely used and has an active community of developers, making it suitable for a large number of tasks.

A core philosophical tenet of Python is that code is read more often than it is written. Best practices for coding in Python therefore emphasize writing well-documented, consistently formatted code that follows language-wide standards. In a Pythonic world, everyone’s code would look the same.

One feature to be aware of going in is that Python uses whitespace (as opposed to braces) to organize blocks of code. This can be a sticking point for people used to other languages.

Getting started


Everything in Python is an object. Each object has a type, which controls what the object can do. For example, we can add integers together or append a string to a list. An object can be assigned to a variable, and a function performs an action.

Python is an object-oriented language. This means that each object includes built-in variables and functions that can be used from the object itself. When working from an object, we use different terms to refer to these concepts:

  • an attribute stores a piece of data
  • a method performs an action

Let’s look at one of the main built-in types, str, to see how this works in practice. In Python, str, short for string, is used to store and manipulate text. To get started, we’ll assign the string “hello world” to the variable text. In Python, we use a single equal sign for assignment and quotes to create a string.

Using quotes

Either single or double quotes can be used to create strings, but try to use them consistently! We will use double quotes in the examples here.

First we’ll define a variable called text. Variables can be any type of object supported by Python, but in this case we’ll use a string:

PYTHON

text = "hello world"

Now, instead of retyping the full string every time we need it, we can access that string by referring to the variable:

PYTHON

text

OUTPUT

'hello world'

We can check the type of the variable using the type() function:

PYTHON

type(text)

OUTPUT

str

We can modify the text by calling one of the methods built into the str class. We call methods by appending a period and the name of the method to the object. For example, to display the string in upper case, use upper():

PYTHON

text.upper()

OUTPUT

'HELLO WORLD'

After running a cell, Jupyter displays a representation of the object that appears in the last line of a cell (but note that some actions, like assigning an object, have no output). Some cells will have more than one line, and it is often useful to display content produced earlier in a cell. This can be accomplished using the built-in print() function. In the cell below, we’ll use print() to show a series of strings:

PYTHON

"hello"
print("h")
print("e")
print("l")
print("l")
print("o")
"world"

OUTPUT

h
e
l
l
o

'world'

Note that the string “hello” at the top of the cell lacks a print statement and does not appear in the output, whereas the text “world” does appear in the output because it is the last line of the cell.

Each object may contain many attributes and methods. Use the help() function on any object, including functions or methods, to show a description of the object and list the available methods.

PYTHON

help(str)

OUTPUT

Help on class str in module builtins:

class str(object)
 |  str(object='') -> str
 |  str(bytes_or_buffer[, encoding[, errors]]) -> str
 |
 |  Create a new string object from the given object. If encoding or
 |  errors is specified, then the object must expose a data buffer
 |  that will be decoded using the given encoding and error handler.
 |  Otherwise, returns the result of object.__str__() (if defined)
 |  or repr(object).
 |  encoding defaults to sys.getdefaultencoding().
 |  errors defaults to 'strict'.
 |
 |  Methods defined here:
 |
 |  __add__(self, value, /)
 |      Return self+value.
 |
 |  __contains__(self, key, /)
 |      Return key in self.
 |
 |  __eq__(self, value, /)
 |      Return self==value.
 |
 |  __format__(self, format_spec, /)
 |      Return a formatted version of the string as described by format_spec.
 |
 |  __ge__(self, value, /)
 |      Return self>=value.
 |
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |
 |  __getitem__(self, key, /)
 |      Return self[key].
 |
 |  __getnewargs__(...)
 |
 |  __gt__(self, value, /)
 |      Return self>value.
 |
 |  __hash__(self, /)
 |      Return hash(self).
 |
 |  __iter__(self, /)
 |      Implement iter(self).
 |
 |  __le__(self, value, /)
 |      Return self<=value.
 |
 |  __len__(self, /)
 |      Return len(self).
 |
 |  __lt__(self, value, /)
 |      Return self<value.
 |
 |  __mod__(self, value, /)
 |      Return self%value.
 |
 |  __mul__(self, value, /)
 |      Return self*value.
 |
 |  __ne__(self, value, /)
 |      Return self!=value.
 |
 |  __repr__(self, /)
 |      Return repr(self).
 |
 |  __rmod__(self, value, /)
 |      Return value%self.
 |
 |  __rmul__(self, value, /)
 |      Return value*self.
 |
 |  __sizeof__(self, /)
 |      Return the size of the string in memory, in bytes.
 |
 |  __str__(self, /)
 |      Return str(self).
 |
 |  capitalize(self, /)
 |      Return a capitalized version of the string.
 |
 |      More specifically, make the first character have upper case and the rest lower
 |      case.
 |
 |  casefold(self, /)
 |      Return a version of the string suitable for caseless comparisons.
 |
 |  center(self, width, fillchar=' ', /)
 |      Return a centered string of length width.
 |
 |      Padding is done using the specified fill character (default is a space).
 |
 |  count(...)
 |      S.count(sub[, start[, end]]) -> int
 |
 |      Return the number of non-overlapping occurrences of substring sub in
 |      string S[start:end].  Optional arguments start and end are
 |      interpreted as in slice notation.
 |
 |  encode(self, /, encoding='utf-8', errors='strict')
 |      Encode the string using the codec registered for encoding.
 |
 |      encoding
 |        The encoding in which to encode the string.
 |      errors
 |        The error handling scheme to use for encoding errors.
 |        The default is 'strict' meaning that encoding errors raise a
 |        UnicodeEncodeError.  Other possible values are 'ignore', 'replace' and
 |        'xmlcharrefreplace' as well as any other name registered with
 |        codecs.register_error that can handle UnicodeEncodeErrors.
 |
 |  endswith(...)
 |      S.endswith(suffix[, start[, end]]) -> bool
 |
 |      Return True if S ends with the specified suffix, False otherwise.
 |      With optional start, test S beginning at that position.
 |      With optional end, stop comparing S at that position.
 |      suffix can also be a tuple of strings to try.
 |
 |  expandtabs(self, /, tabsize=8)
 |      Return a copy where all tab characters are expanded using spaces.
 |
 |      If tabsize is not given, a tab size of 8 characters is assumed.
 |
 |  find(...)
 |      S.find(sub[, start[, end]]) -> int
 |
 |      Return the lowest index in S where substring sub is found,
 |      such that sub is contained within S[start:end].  Optional
 |      arguments start and end are interpreted as in slice notation.
 |
 |      Return -1 on failure.
 |
 |  format(...)
 |      S.format(*args, **kwargs) -> str
 |
 |      Return a formatted version of S, using substitutions from args and kwargs.
 |      The substitutions are identified by braces ('{' and '}').
 |
 |  format_map(...)
 |      S.format_map(mapping) -> str
 |
 |      Return a formatted version of S, using substitutions from mapping.
 |      The substitutions are identified by braces ('{' and '}').
 |
 |  index(...)
 |      S.index(sub[, start[, end]]) -> int
 |
 |      Return the lowest index in S where substring sub is found,
 |      such that sub is contained within S[start:end].  Optional
 |      arguments start and end are interpreted as in slice notation.
 |
 |      Raises ValueError when the substring is not found.
 |
 |  isalnum(self, /)
 |      Return True if the string is an alpha-numeric string, False otherwise.
 |
 |      A string is alpha-numeric if all characters in the string are alpha-numeric and
 |      there is at least one character in the string.
 |
 |  isalpha(self, /)
 |      Return True if the string is an alphabetic string, False otherwise.
 |
 |      A string is alphabetic if all characters in the string are alphabetic and there
 |      is at least one character in the string.
 |
 |  isascii(self, /)
 |      Return True if all characters in the string are ASCII, False otherwise.
 |
 |      ASCII characters have code points in the range U+0000-U+007F.
 |      Empty string is ASCII too.
 |
 |  isdecimal(self, /)
 |      Return True if the string is a decimal string, False otherwise.
 |
 |      A string is a decimal string if all characters in the string are decimal and
 |      there is at least one character in the string.
 |
 |  isdigit(self, /)
 |      Return True if the string is a digit string, False otherwise.
 |
 |      A string is a digit string if all characters in the string are digits and there
 |      is at least one character in the string.
 |
 |  isidentifier(self, /)
 |      Return True if the string is a valid Python identifier, False otherwise.
 |
 |      Call keyword.iskeyword(s) to test whether string s is a reserved identifier,
 |      such as "def" or "class".
 |
 |  islower(self, /)
 |      Return True if the string is a lowercase string, False otherwise.
 |
 |      A string is lowercase if all cased characters in the string are lowercase and
 |      there is at least one cased character in the string.
 |
 |  isnumeric(self, /)
 |      Return True if the string is a numeric string, False otherwise.
 |
 |      A string is numeric if all characters in the string are numeric and there is at
 |      least one character in the string.
 |
 |  isprintable(self, /)
 |      Return True if the string is printable, False otherwise.
 |
 |      A string is printable if all of its characters are considered printable in
 |      repr() or if it is empty.
 |
 |  isspace(self, /)
 |      Return True if the string is a whitespace string, False otherwise.
 |
 |      A string is whitespace if all characters in the string are whitespace and there
 |      is at least one character in the string.
 |
 |  istitle(self, /)
 |      Return True if the string is a title-cased string, False otherwise.
 |
 |      In a title-cased string, upper- and title-case characters may only
 |      follow uncased characters and lowercase characters only cased ones.
 |
 |  isupper(self, /)
 |      Return True if the string is an uppercase string, False otherwise.
 |
 |      A string is uppercase if all cased characters in the string are uppercase and
 |      there is at least one cased character in the string.
 |
 |  join(self, iterable, /)
 |      Concatenate any number of strings.
 |
 |      The string whose method is called is inserted in between each given string.
 |      The result is returned as a new string.
 |
 |      Example: '.'.join(['ab', 'pq', 'rs']) -> 'ab.pq.rs'
 |
 |  ljust(self, width, fillchar=' ', /)
 |      Return a left-justified string of length width.
 |
 |      Padding is done using the specified fill character (default is a space).
 |
 |  lower(self, /)
 |      Return a copy of the string converted to lowercase.
 |
 |  lstrip(self, chars=None, /)
 |      Return a copy of the string with leading whitespace removed.
 |
 |      If chars is given and not None, remove characters in chars instead.
 |
 |  partition(self, sep, /)
 |      Partition the string into three parts using the given separator.
 |
 |      This will search for the separator in the string.  If the separator is found,
 |      returns a 3-tuple containing the part before the separator, the separator
 |      itself, and the part after it.
 |
 |      If the separator is not found, returns a 3-tuple containing the original string
 |      and two empty strings.
 |
 |  removeprefix(self, prefix, /)
 |      Return a str with the given prefix string removed if present.
 |
 |      If the string starts with the prefix string, return string[len(prefix):].
 |      Otherwise, return a copy of the original string.
 |
 |  removesuffix(self, suffix, /)
 |      Return a str with the given suffix string removed if present.
 |
 |      If the string ends with the suffix string and that suffix is not empty,
 |      return string[:-len(suffix)]. Otherwise, return a copy of the original
 |      string.
 |
 |  replace(self, old, new, count=-1, /)
 |      Return a copy with all occurrences of substring old replaced by new.
 |
 |        count
 |          Maximum number of occurrences to replace.
 |          -1 (the default value) means replace all occurrences.
 |
 |      If the optional argument count is given, only the first count occurrences are
 |      replaced.
 |
 |  rfind(...)
 |      S.rfind(sub[, start[, end]]) -> int
 |
 |      Return the highest index in S where substring sub is found,
 |      such that sub is contained within S[start:end].  Optional
 |      arguments start and end are interpreted as in slice notation.
 |
 |      Return -1 on failure.
 |
 |  rindex(...)
 |      S.rindex(sub[, start[, end]]) -> int
 |
 |      Return the highest index in S where substring sub is found,
 |      such that sub is contained within S[start:end].  Optional
 |      arguments start and end are interpreted as in slice notation.
 |
 |      Raises ValueError when the substring is not found.
 |
 |  rjust(self, width, fillchar=' ', /)
 |      Return a right-justified string of length width.
 |
 |      Padding is done using the specified fill character (default is a space).
 |
 |  rpartition(self, sep, /)
 |      Partition the string into three parts using the given separator.
 |
 |      This will search for the separator in the string, starting at the end. If
 |      the separator is found, returns a 3-tuple containing the part before the
 |      separator, the separator itself, and the part after it.
 |
 |      If the separator is not found, returns a 3-tuple containing two empty strings
 |      and the original string.
 |
 |  rsplit(self, /, sep=None, maxsplit=-1)
 |      Return a list of the words in the string, using sep as the delimiter string.
 |
 |        sep
 |          The delimiter according which to split the string.
 |          None (the default value) means split according to any whitespace,
 |          and discard empty strings from the result.
 |        maxsplit
 |          Maximum number of splits to do.
 |          -1 (the default value) means no limit.
 |
 |      Splits are done starting at the end of the string and working to the front.
 |
 |  rstrip(self, chars=None, /)
 |      Return a copy of the string with trailing whitespace removed.
 |
 |      If chars is given and not None, remove characters in chars instead.
 |
 |  split(self, /, sep=None, maxsplit=-1)
 |      Return a list of the words in the string, using sep as the delimiter string.
 |
 |      sep
 |        The delimiter according which to split the string.
 |        None (the default value) means split according to any whitespace,
 |        and discard empty strings from the result.
 |      maxsplit
 |        Maximum number of splits to do.
 |        -1 (the default value) means no limit.
 |
 |  splitlines(self, /, keepends=False)
 |      Return a list of the lines in the string, breaking at line boundaries.
 |
 |      Line breaks are not included in the resulting list unless keepends is given and
 |      true.
 |
 |  startswith(...)
 |      S.startswith(prefix[, start[, end]]) -> bool
 |
 |      Return True if S starts with the specified prefix, False otherwise.
 |      With optional start, test S beginning at that position.
 |      With optional end, stop comparing S at that position.
 |      prefix can also be a tuple of strings to try.
 |
 |  strip(self, chars=None, /)
 |      Return a copy of the string with leading and trailing whitespace removed.
 |
 |      If chars is given and not None, remove characters in chars instead.
 |
 |  swapcase(self, /)
 |      Convert uppercase characters to lowercase and lowercase characters to uppercase.
 |
 |  title(self, /)
 |      Return a version of the string where each word is titlecased.
 |
 |      More specifically, words start with uppercased characters and all remaining
 |      cased characters have lower case.
 |
 |  translate(self, table, /)
 |      Replace each character in the string using the given translation table.
 |
 |        table
 |          Translation table, which must be a mapping of Unicode ordinals to
 |          Unicode ordinals, strings, or None.
 |
 |      The table must implement lookup/indexing via __getitem__, for instance a
 |      dictionary or list.  If this operation raises LookupError, the character is
 |      left untouched.  Characters mapped to None are deleted.
 |
 |  upper(self, /)
 |      Return a copy of the string converted to uppercase.
 |
 |  zfill(self, width, /)
 |      Pad a numeric string with zeros on the left, to fill a field of the given width.
 |
 |      The string is never truncated.
 |
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |
 |  __new__(*args, **kwargs) from builtins.type
 |      Create and return a new object.  See help(type) for accurate signature.
 |
 |  maketrans(...)
 |      Return a translation table usable for str.translate().
 |
 |      If there is only one argument, it must be a dictionary mapping Unicode
 |      ordinals (integers) or characters to Unicode ordinals, strings or None.
 |      Character keys will be then converted to ordinals.
 |      If there are two arguments, they must be strings of equal length, and
 |      in the resulting dictionary, each character in x will be mapped to the
 |      character at the same position in y. If there is a third argument, it
 |      must be a string, whose characters will be mapped to None in the result.

Data types

Python also includes data types for representing other types of data, including numbers or collections of data. The core Python data types are introduced in the table below. We’ll talk more about some of these as we encounter them in the lesson:

Type Definition Example
str Character string "hello world"
int Integer numerical 42
float Approximate numerical 0.406
bool Stores True or False values True or False
list Sequence that can be modified ["a", "b", "c"]
tuple Sequence that cannot be modified ("a", "b", "c")
dict Mapping of keys to values {"DC": "District of Columbia", "MD": "Maryland", "VA": "Virginia"}
set Collection of unique values {"1", "2", 1}

Libraries


Python includes a number of built-in functions that are available wherever Python is installed. See the table below for some examples.

Examples of built-in functions

Name Description Example
abs() Gets the absolute value of a number abs(-1.5) # returns 1.5
max() Gets the highest value in a sequence min([1, 2, 3]) # returns 3
min() Gets the lowest value in a sequence min([1, 2, 3]) # returns 1
round() Rounds a number to the nearest integer round(5.4) # returns 5

Python also includes a number of built-in libraries. A library bundles functions and other code related to a single task or data type. They are used to simplify the performance of common tasks. By using a common code base, a library allows coders to work more quickly and with fewer errors.

The libraries built into Python are referred to as the Python Standard Library. They can be accessed through a typical Python installation and do not require any additional downloads. A few examples are included in the table below, but as with the table of built-in functions, there are more where those came from.

Examples from the Python Standard Library

Library Description
datetime Reads, writes, and analyzes dates and times
os Create, manipulate, and get information about files and paths
random Generate pseudo-random numbers

Documentation

The documentation for each library can be viewed by clicking the link in the table. Documentation is an invaluable resource. It provides descriptions of what a library does and detailed information about how it can be used, often including examples.

Unlike the built-in functions, we must import a library before we can use it. We do so using the import statement:

PYTHON

import datetime

Once imported, the library is available for use anywhere in the current document. When using a library, we must include the name of the library to access its functions. For example, to create a datetime.date object (that is, a date object defined by the datetime library), we include both the library and method name:

PYTHON

date = datetime.date(1970, 1, 1)

As with the built-in types, the datetime.date object includes its own suite of attributes and methods. We can, for example, use the year attribute to get the year:

PYTHON

date.year

OUTPUT

1970

Or convert the date to a string using the strftime() method and date format codes. Here, %Y corresponds to YYYY, %m to MM, and %d to DD.

PYTHON

date.strftime("%Y-%m-%d")

OUTPUT

'1970-01-01'

Like Python itself, the Python Standard Library is maintained by the Python Software Foundation. The built-in libraries are limited to a relatively small set of operations expected to be useful to a broad population of users. However, Python allows users to create their own libraries to perform actions that are beyond the scope of core Python. The rest of this lesson will focus on an external library called pandas.

Key Points

  • Python is a widely used language that can be used for a variety of tasks, including analyzing data
  • Python uses different data types to handle text, numbers, collections, and other kinds of data
  • Assign values to variables using the = operator
  • Use functions and methods to perform specific actions
  • Python’s functionality can be extended using libraries, including libraries written by members of the community that address discipline-specific needs
  • Use the help() function and developer documentation to learn more about Python and Python libraries

Content from Introducing pandas


Last updated on 2022-07-18 | Edit this page

Estimated time: 60 minutes

Overview

Questions

  • What data will we be working with in this lesson?
  • What is pandas?
  • Why use pandas for data analysis?
  • How do we read and write data using pandas?

Objectives

  • Learn about the dataset we’ll be working with
  • Look at the benefits of using pandas to analyze data
  • Import data from a CSV into a pandas dataframe
  • Learn how pandas handles different types of data
  • Write a dataframe to a CSV

Dataset description


The dataset description was taken from Data Management with SQL for Ecologists (CC-BY-4.0)

The data we will be using is a time-series for a small mammal community in southern Arizona. This is part of a project studying the effects of rodents and ants on the plant community that has been running for almost 40 years. The rodents are sampled on a series of 24 plots, with different experimental manipulations controlling which rodents are allowed to access which plots.

This is a real dataset that has been used in over 100 publications. We’ve simplified it for the workshop, but you can download the full dataset and work with it using exactly the same tools we’ll learn about today.

Answering questions using data


Let’s look at some of the cleaned spreadsheets we downloaded during Setup to complete this challenge. We’ll need the following two files:

  • surveys.csv
  • species.csv

Challenge

Open each of these csv files and explore them. What information is contained in each file? Specifically, if we had the following research questions:

  • How has the hindfoot length and weight of Dipodomys species changed over time?
  • What is the average weight of each species, per year?
  • What information can I learn about Dipodomys species in the 2000s, over time?

What would we need to answer these questions? Which files have the data we need? What operations would we need to perform if we were doing these analyses by hand?

Hint: We can view CSV files using JupyterLab using the left sidebar. Click on the Folder icon in the top left of the sidebar to see the files, then go to the data directory to see the CSV we’ve downloaded for this lesson.

In order to answer the questions described above, we’ll need to do the following basic data operations:

  • select subsets of the data (rows and columns)
  • group subsets of data
  • do math and other calculations
  • combine data across spreadsheets

In addition, we don’t want to do this manually! Instead of searching for the right pieces of data ourselves, or clicking between spreadsheets, or manually sorting columns, we want to make the computer do the work.

In particular, we want to use a tool where it’s easy to repeat our analysis in case our data changes. We also want to do all this searching without actually modifying our source data.

Why use pandas?


The Python Data Analysis Library, or pandas, is a Python library used to work with dataframes. A dataframe is a representation of tabular data very similar to a spreadsheet, consisting of rows (representing records) and columns (representing fields or variables). It is a very common format for representing scientific data and is likely to be very familiar to anyone taking this course.

pandas offers the same advantages as any well-written library: It creates a common codebase for working on a single task, in this case, analyzing data. Some benefits of this approach include:

  • Reliability: Provides flexible, well-tested methods for reading, querying, aggregating, grouping, and plotting data
  • Repeatability: Repeat the same analyses when data is added or changed
  • Speed: Faster in many cases than coding our own functions in Python
  • Reproducibility: Document and share code in narrative form using tools like Jupyter notebooks
  • Community: Access a large, active community for help when we run into problems

Importing data using pandas


Unlike the libraries in the Python Standard Library discussed in the previous lesson, pandas is not part of the typical Python installation. Once it has been installed, however, it can be accessed using the same import command used to import the built-in libraries. By convention, pandas is imported using the alias “pd”. We can assign the alias using the as keyword:

PYTHON

import pandas as pd

Using aliases

Why use an alias? We will refer to pandas many, many times when writing a script, so it’s useful to abbreviate the name and save some keystrokes. But it’s also a matter of consistency with the larger community. Many of the core scientific Python libraries, including pandas, recommend a specific alias, so most code shared online will use those aliases as well.

Now that pandas has been imported, we can access the function we need to load data from a CSV, pd.read_csv(). The function call has three parts:

  • The name (or in this case, alias) of the object that defines the function. This can be a library or any other object. It can also be omitted in some cases (for example, when using a function built into Python).
  • The name of the method we’d like to use
  • A set of parentheses that tells the function to run.

Many functions define parameters that allow the user to modify the behavior of the function. Parameters may be positional or named. In Python, data passed to positional parameters are called arguments (often abbreviated as args), and data passed to named parameters are called keyword arguments (often abbreviated as kwargs). In either case, the arguments are included inside the parentheses used to call the function.

Below, we will pass a single argument to pd.read_csv(): A string that tells the function where to find the surveys.csv file.

PYTHON

pd.read_csv("data/surveys.csv")
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

35549 rows × 9 columns

Here are a few things to observe about how the dataframe is structured:

  • By default, JupyterLab displays the first and last five rows of the dataframe
  • Each row represents a record
  • Each column represents a field
  • The unlabeled column on the far left is called the row label
  • pandas has done a lot of work behind the scenes when reading the data, including:
    • Assigning the row index as the row label
    • Assigning each column a data type based on its contents
    • Assigning certain cells the value NaN, which stands for “not a number” and is used to designate null values in the dataset. Here, those cells represent blank cells in the spreadsheet.

Much of this behavior can be controlled when the spreadsheet is first read by using keyword arguments. For example, to force pd.read_csv() to use the existing record_id column as the row label, use the index_col keyword argument:

PYTHON

pd.read_csv("data/surveys.csv", index_col="record_id")
month day year plot_id species_id sex hindfoot_length weight
record_id
1 7 16 1977 2 NL M 32.0 NaN
2 7 16 1977 3 NL M 33.0 NaN
3 7 16 1977 2 DM F 37.0 NaN
4 7 16 1977 7 DM M 36.0 NaN
5 7 16 1977 3 DM M 35.0 NaN
35545 12 31 2002 15 AH NaN NaN NaN
35546 12 31 2002 15 AH NaN NaN NaN
35547 12 31 2002 10 RM F 15.0 14.0
35548 12 31 2002 7 DO M 36.0 51.0
35549 12 31 2002 5 NaN NaN NaN NaN

35549 rows × 8 columns

Now record_id appears on the far left, indicating that pandas is using that column as the row label.

Challenge

Run help() on the pd.read_csv() method (note that you should omit the trailing parentheses). Alternatively, take a look at the much prettier documentation provided by the pandas developers. Based on that information, answer the following questions:

  1. What parameter would you use to control how null values are interpreted?
  2. What values are interpreted as NaN by default?

na_values

In addition to empty cells, the following values are interpreted as NaN, or null, by pandas. These strings may look familiar from programs like Excel.

  • #N/A
  • #N/A N/A
  • #NA
  • -1.#IND
  • -1.#QNAN
  • -NaN
  • -nan
  • 1.#IND
  • 1.#QNAN
  • <NA>
  • N/A
  • NA
  • NULL
  • NaN
  • n/a
  • nan
  • null

Assigning to variables


We can assign the dataframe to a variable to make it easier to access. As we saw in the previous lesson, we use a single equals sign to assign an object to a variable. The variable name should be short and descriptive. By convention, variable names in Python use snake_case (that is, lower case with individual words separated by underscores).

PYTHON

surveys = pd.read_csv("data/surveys.csv")
surveys
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

35549 rows × 9 columns

Understanding data types in pandas


We can use the info() method to see how pandas interpreted each column in the dataset. This method gives use the name, count, and data type of each column and provides some information about the dataset as a whole (for example, memory usage, which is helpful to know when working with large datasets):

PYTHON

surveys.info()

OUTPUT

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35549 entries, 0 to 35548
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   record_id        35549 non-null  int64
 1   month            35549 non-null  int64
 2   day              35549 non-null  int64
 3   year             35549 non-null  int64
 4   plot_id          35549 non-null  int64
 5   species_id       34786 non-null  object
 6   sex              33038 non-null  object
 7   hindfoot_length  31438 non-null  float64
 8   weight           32283 non-null  float64
dtypes: float64(2), int64(5), object(2)
memory usage: 2.4+ MB

Note that the data types used by pandas look a little different than the ones used by Python itself. See the table below for the rough equivalents.

Data types in pandas

Data Type Description Similar To
object Character string or mixed str
int64 Integer numerical int
float64 Approximate numerical float
bool Stores True or False values bool
datetime64 Stores date and time values datetime.datetime

Saving a dataframe


When analyzing a dataset, we’ll often want to save our work to a file. The to_csv() method can be used to write a dataframe to a CSV. As when we read a CSV from a file above, we need to provide a path to which to save the file. The example below also includes the index keyword argument. Setting that parameter to False tells pandas not to include the row label when writing the CSV.

PYTHON

surveys.to_csv("data/surveys_mod.csv", index=False)

This is far from the only option for writing data–pandas supports a variety of file types for both reading and writing. Try searching for “read_” in the pandas API reference to see the supported formats.

Key Points

  • This lesson uses real data from a decades-long survey of rodents in Arizona
  • pandas is a data analysis library that allows users to read, manipulate, and view tabular data using Python
  • pandas represents data as a dataframe consisting of rows (records) and columns (fields or variables)
  • Read a dataframe from CSV using the pd.read_csv() function
  • Write a dataframe to CSV using the to_csv() method
  • The behavior of a function can be modified by including arguments and keyword arguments when the function is called
  • pandas uses its own classes to represent text, numbers, booleans, and datetimes

Content from Accessing Data in a Dataframe


Last updated on 2022-07-18 | Edit this page

Estimated time: 60 minutes

Overview

Questions

  • How can we look at individual rows and columns in a dataframe?
  • How can we perform calculations?
  • How can we modify the table and data?

Objectives

  • Access individual rows and columns
  • Access multiple columns at once using a list
  • Perform calculations like addition and subtraction
  • Rename columns using a dictionary
  • Access rows containing specific data
  • Sort the data returned by a query
  • Modify data using loc

We’ll begin by importing pandas and reading our CSV, as we did in the previous lesson:

PYTHON

import pandas as pd

surveys = pd.read_csv("data/surveys.csv")
surveys
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

35549 rows × 9 columns

We will now look at how to access rows and columns in the dataframe.

Getting columns


We can get the values from a single column by passing a string inside square brackets to the dataframe object. For example, to look at the year column, use:

PYTHON

surveys["year"]

OUTPUT

0        1977
1        1977
2        1977
3        1977
4        1977
         ...
35544    2002
35545    2002
35546    2002
35547    2002
35548    2002
Name: year, Length: 35549, dtype: int64

A single column is returned as a Series, which is why the output for this cell is formatted differently than the other cells in this lesson. The main difference between a Series and a DataFrame is that a DataFrame can have multiple columns. The two classes share many but not all attributes and methods. Note also that this is a copy of data from the original dataframe. Changing values in the series will have no effect on the original dataframe.

Using lists to get more than one column at a time

Python uses the built-in list data type to store sequences, that is, an ordered list of objects. Any type of Python object can be stored in a list: strings, integers, floats, even other lists or collections. Once created, a list can be modified by appending, inserting, or deleting items. We will not be going into detail about these operations here, but as always you can learn more about how to use a list using help() or the Python docs.

Create a list using square brackets. Let’s create a list of the three columns in our dataframe that together give the date of an observation:

PYTHON

cols = ["year", "month", "day"]

When we pass this list to the survey dataframe using square brackets, as we did above, we retrieve a copy of the dataframe containing just those columns. Note that, when we get more than one column, pandas returns a dataframe:

PYTHON

surveys[cols]
year month day
0 1977 7 16
1 1977 7 16
2 1977 7 16
3 1977 7 16
4 1977 7 16
35544 2002 12 31
35545 2002 12 31
35546 2002 12 31
35547 2002 12 31
35548 2002 12 31

35549 rows × 3 columns

Getting rows


We can get the rows at the beginning of the table using the head method:

PYTHON

surveys.head()
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN

By default, this methods returns the first five rows. We can provide a number inside the parentheses if we need a specific number of rows:

PYTHON

surveys.head(20)
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
5 6 7 16 1977 1 PF M 14.0 NaN
6 7 7 16 1977 2 PE F NaN NaN
7 8 7 16 1977 1 DM M 37.0 NaN
8 9 7 16 1977 1 DM F 34.0 NaN
9 10 7 16 1977 6 PF F 20.0 NaN
10 11 7 16 1977 5 DS F 53.0 NaN
11 12 7 16 1977 7 DM M 38.0 NaN
12 13 7 16 1977 3 DM M 35.0 NaN
13 14 7 16 1977 8 DM NaN NaN NaN
14 15 7 16 1977 6 DM F 36.0 NaN
15 16 7 16 1977 4 DM F 36.0 NaN
16 17 7 16 1977 3 DS F 48.0 NaN
17 18 7 16 1977 2 PP M 22.0 NaN
18 19 7 16 1977 4 PF NaN NaN NaN
19 20 7 17 1977 11 DS F 48.0 NaN

The tail() method is similar, except it returns rows from the end of the table:

PYTHON

surveys.tail()
record_id month day year plot_id species_id sex hindfoot_length weight
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

The head() and tail() methods are useful for getting a feel for how our data is structured, but we’ll also want to be able to look at specific rows. As when we selected columns above, we can use square brackets to extract slices from the dataframe. A slice is a subset of the dataframe starting at one row and ending at another. To get a slice, we pass the starting and ending indexes to the square brackets as start:end:

PYTHON

surveys[2:5]
record_id month day year plot_id species_id sex hindfoot_length weight
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN

There are three things to be aware of when slicing a dataframe:

  • Row indexes are zero-based. That is, the first row has an index of 0, not 1.
  • When slicing, the slice includes start but not the end index. In this case, that means the slice includes rows 2, 3, and 4 but not 5.
  • The row label can be different from the row index. They happen to be the same here, but don’t count on that being true.

Core Python types like list and tuple use the same conventions, as do most libraries that deal with sequences.

Getting unique values


Recall that we can use square brackets to return a single column from a dataframe. We can get the list of unique values within that column using the unique() method:

PYTHON

surveys["species_id"].unique()

OUTPUT

array(['NL', 'DM', 'PF', 'PE', 'DS', 'PP', 'SH', 'OT', 'DO', 'OX', 'SS',
       'OL', 'RM', nan, 'SA', 'PM', 'AH', 'DX', 'AB', 'CB', 'CM', 'CQ',
       'RF', 'PC', 'PG', 'PH', 'PU', 'CV', 'UR', 'UP', 'ZL', 'UL', 'CS',
       'SC', 'BA', 'SF', 'RO', 'AS', 'SO', 'PI', 'ST', 'CU', 'SU', 'RX',
       'PB', 'PL', 'PX', 'CT', 'US'], dtype=object)

To do the same across multiple columns, we can use the drop_duplicates() method on a copy of the dataframe containing only the columns we’re interested in. Like any well-named method, drop_duplicates() does exactly what the name implies: It returns a copy of the dataframe with all duplicate rows removed.

PYTHON

surveys[["plot_id", "species_id"]].drop_duplicates()
plot_id species_id
0 2 NL
1 3 NL
2 2 DM
3 7 DM
4 3 DM
35474 22 SF
35511 11 US
35519 9 SF
35527 13 US
35543 15 US

612 rows × 2 columns

Calculating values


The survey dataset includes two columns, hindfoot_length and weight, that are stored as numbers and represent measurements. We may want to perform calculations using numbers like these in our own data. We can do so using Python’s built-in mathematical operators, including:

  • x + y for addition
  • x - y for subtraction
  • x * y for multiplication
  • x / y for division
  • x % y for calculating remainders
  • x ** y for exponents

To make the examples in this section a little more useful, we’re going to remove all rows that contain null values using the dropna() method. This will filter out any rows that don’t have a valid hindfoot_length or weight, as well as those that have a null value in any other cell. (This is an inelegant solution to the problem of missing data. We’ll talk about more nuanced solutions later in the lesson.)

PYTHON

surveys_nona = surveys.dropna().copy()
surveys_nona
record_id month day year plot_id species_id sex hindfoot_length weight
62 63 8 19 1977 3 DM M 35.0 40.0
63 64 8 19 1977 7 DM M 37.0 48.0
64 65 8 19 1977 4 DM F 34.0 29.0
65 66 8 19 1977 4 DM F 35.0 46.0
66 67 8 19 1977 7 DM M 35.0 36.0
35540 35541 12 31 2002 15 PB F 24.0 31.0
35541 35542 12 31 2002 15 PB F 26.0 29.0
35542 35543 12 31 2002 15 PB F 27.0 34.0
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0

30676 rows × 9 columns

Suppose we want to convert the weight column from grams to milligrams. To do so, we can multiply that column by 1000:

PYTHON

surveys_nona["weight"] * 1000

OUTPUT

62       40000.0
63       48000.0
64       29000.0
65       46000.0
66       36000.0
          ...
35540    31000.0
35541    29000.0
35542    34000.0
35546    14000.0
35547    51000.0
Name: weight, Length: 30676, dtype: float64

To convert it to kilograms, we can divide by 1000:

PYTHON

surveys_nona["weight"] / 1000

OUTPUT

62       0.040
63       0.048
64       0.029
65       0.046
66       0.036
         ...
35540    0.031
35541    0.029
35542    0.034
35546    0.014
35547    0.051
Name: weight, Length: 30676, dtype: float64

Note that calculations do not modify the original dataset. If we want to retain the result, we have to assign it to a new column:

PYTHON

surveys_nona["weight_mg"] = surveys_nona["weight"] * 1000
surveys_nona
record_id month day year plot_id species_id sex hindfoot_length weight weight_mg
62 63 8 19 1977 3 DM M 35.0 40.0 40000.0
63 64 8 19 1977 7 DM M 37.0 48.0 48000.0
64 65 8 19 1977 4 DM F 34.0 29.0 29000.0
65 66 8 19 1977 4 DM F 35.0 46.0 46000.0
66 67 8 19 1977 7 DM M 35.0 36.0 36000.0
35540 35541 12 31 2002 15 PB F 24.0 31.0 31000.0
35541 35542 12 31 2002 15 PB F 26.0 29.0 29000.0
35542 35543 12 31 2002 15 PB F 27.0 34.0 34000.0
35546 35547 12 31 2002 10 RM F 15.0 14.0 14000.0
35547 35548 12 31 2002 7 DO M 36.0 51.0 51000.0

30676 rows × 10 columns

We can also add, subtract, multiply, and divide columns, as in the following (admittedly nonsensical) calculation, which adds together the hindfoot_length and weight columns:

PYTHON

surveys_nona["hindfoot_length"] + surveys_nona["weight"]

OUTPUT

62       75.0
63       85.0
64       63.0
65       81.0
66       71.0
         ...
35540    55.0
35541    55.0
35542    61.0
35546    29.0
35547    87.0
Length: 30676, dtype: float64

Renaming columns


The hindfoot_length and weight columns don’t specify a unit, which may get confusing if we want to perform unit conversions like the one above. Fortunately, dataframes allow us to rename existing columns using the rename() method.

The rename() method uses a dictionary (or dict) to map between the old and new column names. As with list above, the dict data type is built into Python–we don’t need to import anything to use it. A dict maps keys to values. We can create one using curly braces:

PYTHON

dct = {"key1": "val1", "key2": "val2"}
dct

OUTPUT

{'key1': 'val1', 'key2': 'val2'}

Dictionaries are a useful and highly flexible data type. As with list above, we’ll be giving them short shrift here, but you can learn more about them at the Python docs.

Here we’ll use a dict to specify how we want to rename our columns. The keys will be the current column names and the values the new column names. Note that we explicitly assign the result of rename() to the original variable–by default, rename() returns a copy of the original dataframe instead of modifying the original dataframe.

PYTHON

# Create a dict that maps from the old to the new column name
cols = {
    "hindfoot_length": "hindfoot_length_mm",
    "weight": "weight_g",
}

# Assign the result of the rename method back to surveys_nona
surveys_nona = surveys_nona.rename(columns=cols)

# View the dataframe with the new column names
surveys_nona
record_id month day year plot_id species_id sex hindfoot_length_mm weight_g weight_mg
62 63 8 19 1977 3 DM M 35.0 40.0 40000.0
63 64 8 19 1977 7 DM M 37.0 48.0 48000.0
64 65 8 19 1977 4 DM F 34.0 29.0 29000.0
65 66 8 19 1977 4 DM F 35.0 46.0 46000.0
66 67 8 19 1977 7 DM M 35.0 36.0 36000.0
35540 35541 12 31 2002 15 PB F 24.0 31.0 31000.0
35541 35542 12 31 2002 15 PB F 26.0 29.0 29000.0
35542 35543 12 31 2002 15 PB F 27.0 34.0 34000.0
35546 35547 12 31 2002 10 RM F 15.0 14.0 14000.0
35547 35548 12 31 2002 7 DO M 36.0 51.0 51000.0

30676 rows × 10 columns

Challenge

Create a dataframe that returns the year, month, day, species_id and weight in mg.

PYTHON

# Assign the weight in milligrams to the weight_mg column
surveys_nona["weight_mg"] = surveys_nona["weight_g"] * 1000

# Display a copy of survey with only the desired columns
surveys_nona[["year", "month", "day", "species_id", "weight_mg"]]
year month day species_id weight_mg
62 1977 8 19 DM 40000.0
63 1977 8 19 DM 48000.0
64 1977 8 19 DM 29000.0
65 1977 8 19 DM 46000.0
66 1977 8 19 DM 36000.0
35540 2002 12 31 PB 31000.0
35541 2002 12 31 PB 29000.0
35542 2002 12 31 PB 34000.0
35546 2002 12 31 RM 14000.0
35547 2002 12 31 DO 51000.0

30676 rows × 5 columns

Filtering data


pandas provides a variety of ways to filter a dataframe. For example, suppose we want to look at a specific species in the surveys_nona dataframe. We can view the rows matching a given species using the same square brackets we used above to select specific columns and rows. Here, however, instead of passing a value or list of values, we will pass a conditional expression.

A conditional expression is a statement that evaluates as either True or False. They often make use of inequality operators, for example:

  • == for equals
  • != for does not equal
  • > for greater than
  • >= for greater than or equal to
  • < for less than
  • <= for less than or equal to

Examples of conditional statements include:

  • "a" == "b" evaluates False
  • "a" != b" evaluates True
  • 3 > 4 evaluates False

Note that, when comparing strings, evaluations are case sensitive:

  • "a" == "A" evaluates False

= for assignment, == for equality

Remember that, in Python, a single equal sign is used to assign values to variables. We’ve already used the assignment operator in this lesson, for example, when we created a new column in the dataframe.

To limit the dataframe to rows matching the species “DM”, include the conditional surveys_nona["species_id"] == "DM" inside the square brackets:

PYTHON

surveys_nona[surveys_nona["species_id"] == "DM"]
record_id month day year plot_id species_id sex hindfoot_length_mm weight_g weight_mg
62 63 8 19 1977 3 DM M 35.0 40.0 40000.0
63 64 8 19 1977 7 DM M 37.0 48.0 48000.0
64 65 8 19 1977 4 DM F 34.0 29.0 29000.0
65 66 8 19 1977 4 DM F 35.0 46.0 46000.0
66 67 8 19 1977 7 DM M 35.0 36.0 36000.0
35532 35533 12 31 2002 14 DM F 36.0 48.0 48000.0
35533 35534 12 31 2002 14 DM M 37.0 56.0 56000.0
35534 35535 12 31 2002 14 DM M 37.0 53.0 53000.0
35535 35536 12 31 2002 14 DM F 35.0 42.0 42000.0
35536 35537 12 31 2002 14 DM F 36.0 46.0 46000.0

9727 rows × 10 columns

To limit our results to observations made in or after 2000, use:

PYTHON

surveys_nona[surveys_nona["year"] >= 2000]
record_id month day year plot_id species_id sex hindfoot_length_mm weight_g weight_mg
30158 30159 1 8 2000 1 PP F 22.0 17.0 17000.0
30159 30160 1 8 2000 1 DO M 35.0 53.0 53000.0
30160 30161 1 8 2000 1 PP F 21.0 17.0 17000.0
30161 30162 1 8 2000 1 DM M 36.0 50.0 50000.0
30162 30163 1 8 2000 1 PP M 20.0 16.0 16000.0
35540 35541 12 31 2002 15 PB F 24.0 31.0 31000.0
35541 35542 12 31 2002 15 PB F 26.0 29.0 29000.0
35542 35543 12 31 2002 15 PB F 27.0 34.0 34000.0
35546 35547 12 31 2002 10 RM F 15.0 14.0 14000.0
35547 35548 12 31 2002 7 DO M 36.0 51.0 51000.0

4947 rows × 10 columns

As with rename() above, each filtering operation returns a copy of the dataframe. We will look at how to make changes to the original dataframe at the end of this lesson.

Building more complex queries


We can combine conditionals using what are called bitwise operators:

  • &: True if conditions on both sides of the operator are True (and)
  • |: True if a condition on either side is True (or)

To return all observations of DM in or after 2000, we can combine the two conditionals we used previously. Note that, when joining conditionals using & or |, we must wrap each individual condition in parentheses. If we omit the parentheses, pandas will not perform the comparisons in the expected order.

PYTHON

surveys_nona[(surveys_nona["species_id"] == "DM") & (surveys_nona["year"] >= 2000)]
record_id month day year plot_id species_id sex hindfoot_length_mm weight_g weight_mg
30161 30162 1 8 2000 1 DM M 36.0 50.0 50000.0
30178 30179 1 8 2000 12 DM M 36.0 60.0 60000.0
30195 30196 1 8 2000 17 DM M 37.0 52.0 52000.0
30196 30197 1 8 2000 17 DM F 34.0 43.0 43000.0
30209 30210 1 8 2000 22 DM M 38.0 56.0 56000.0
35532 35533 12 31 2002 14 DM F 36.0 48.0 48000.0
35533 35534 12 31 2002 14 DM M 37.0 56.0 56000.0
35534 35535 12 31 2002 14 DM M 37.0 53.0 53000.0
35535 35536 12 31 2002 14 DM F 35.0 42.0 42000.0
35536 35537 12 31 2002 14 DM F 36.0 46.0 46000.0

823 rows × 10 columns

Some column methods can also be used for filtering. One example is isin(), which is used to match a list of values. This method can be combined with other conditionals as above. The example below returns rows from 2000 or later with either “DM”, “DO”, or “DS” in the species_id column:

PYTHON

surveys_nona[
    surveys_nona["species_id"].isin(["DM", "DO", "DS"]) & (surveys_nona["year"] >= 2000)
]
record_id month day year plot_id species_id sex hindfoot_length_mm weight_g weight_mg
30159 30160 1 8 2000 1 DO M 35.0 53.0 53000.0
30161 30162 1 8 2000 1 DM M 36.0 50.0 50000.0
30166 30167 1 8 2000 1 DO M 35.0 41.0 41000.0
30170 30171 1 8 2000 2 DO M 36.0 52.0 52000.0
30172 30173 1 8 2000 2 DO F 35.0 54.0 54000.0
35533 35534 12 31 2002 14 DM M 37.0 56.0 56000.0
35534 35535 12 31 2002 14 DM M 37.0 53.0 53000.0
35535 35536 12 31 2002 14 DM F 35.0 42.0 42000.0
35536 35537 12 31 2002 14 DM F 36.0 46.0 46000.0
35547 35548 12 31 2002 7 DO M 36.0 51.0 51000.0

1229 rows × 10 columns

Sorting data


We can sort a dataframe using the sort_values() method. For this example, we’ll work from the subset defined above. First we need to assign that subset to a variable:

PYTHON

results = surveys_nona[
    surveys_nona["species_id"].isin(["DM", "DO", "DS"]) & (surveys_nona["year"] >= 2000)
]

Now we’ll sort the results by weight_g. To do so, pass that column name as an argument (that is, inside the trailing parentheses) to the sort_values() method:

PYTHON

results.sort_values("weight_g")
record_id month day year plot_id species_id sex hindfoot_length_mm weight_g weight_mg
30614 30615 4 30 2000 12 DM M 34.0 18.0 18000.0
34790 34791 10 5 2002 17 DM M 34.0 20.0 20000.0
32650 32651 9 22 2001 17 DM F 33.0 21.0 21000.0
32782 32783 9 23 2001 14 DM M 34.0 21.0 21000.0
32968 32969 10 14 2001 9 DM M 32.0 22.0 22000.0
35159 35160 11 10 2002 13 DO F 36.0 65.0 65000.0
30289 30290 2 5 2000 1 DO F 36.0 65.0 65000.0
31942 31943 4 21 2001 1 DO F 37.0 68.0 68000.0
32041 32042 5 26 2001 1 DO F 37.0 68.0 68000.0
31955 31956 4 21 2001 24 DO F 36.0 76.0 76000.0

1229 rows × 10 columns

By default, rows are sorted in ascending order (smallest to largest). We can modify this behavior using the ascending keyword argument:

PYTHON

results.sort_values("weight_g", ascending=False)
record_id month day year plot_id species_id sex hindfoot_length_mm weight_g weight_mg
31955 31956 4 21 2001 24 DO F 36.0 76.0 76000.0
32041 32042 5 26 2001 1 DO F 37.0 68.0 68000.0
31942 31943 4 21 2001 1 DO F 37.0 68.0 68000.0
35159 35160 11 10 2002 13 DO F 36.0 65.0 65000.0
30289 30290 2 5 2000 1 DO F 36.0 65.0 65000.0
32957 32958 10 14 2001 9 DM F 33.0 22.0 22000.0
32650 32651 9 22 2001 17 DM F 33.0 21.0 21000.0
32782 32783 9 23 2001 14 DM M 34.0 21.0 21000.0
34790 34791 10 5 2002 17 DM M 34.0 20.0 20000.0
30614 30615 4 30 2000 12 DM M 34.0 18.0 18000.0

1229 rows × 10 columns

We can sort on multiple fields at once by passing a list of column names. We can control how each column sorts by passing a list with the same number of values (that is, one value per column) to the ascending keyword. The cell below sorts the results first by species_id (largest to smallest), then by weight (smallest to largest):

PYTHON

results.sort_values(["species_id", "weight_g"], ascending=[False, True])
record_id month day year plot_id species_id sex hindfoot_length_mm weight_g weight_mg
30687 30688 5 1 2000 8 DO M 35.0 23.0 23000.0
30768 30769 6 3 2000 17 DO M 35.0 24.0 24000.0
33339 33340 1 12 2002 12 DO M 34.0 24.0 24000.0
34082 34083 5 16 2002 6 DO F 33.0 24.0 24000.0
34096 34097 5 16 2002 6 DO M 35.0 24.0 24000.0
31858 31859 3 24 2001 2 DM F 36.0 60.0 60000.0
32909 32910 10 14 2001 4 DM F 37.0 60.0 60000.0
34604 34605 7 14 2002 9 DM F 36.0 60.0 60000.0
34675 34676 9 8 2002 12 DM F 36.0 60.0 60000.0
34126 34127 5 16 2002 9 DM F 35.0 64.0 64000.0

1229 rows × 10 columns

As with the dataframe methods above, sort_values() returns a copy of the original dataframe and leaves the original untouched.

Challenge

Write a query that returns year, species_id, and weight in kg from the surveys_nona table, sorted with the largest weights at the top.

PYTHON

# Create a new column with weight in kg
surveys_nona["weight_kg"] = surveys_nona["weight_g"] / 1000

# Create a subset containing only year, species_id, and weight_kg
subset = surveys_nona[["year", "species_id", "weight_kg"]]

# Sort the subset by weight_kg
subset.sort_values("weight_kg", ascending=False)
year species_id weight_kg
33048 2001 NL 0.280
12870 1987 NL 0.278
15458 1989 NL 0.275
2132 1979 NL 0.274
12728 1987 NL 0.270
29905 1999 PP 0.004
9908 1985 RM 0.004
10438 1985 RM 0.004
8735 1983 RM 0.004
7083 1982 PF 0.004

30676 rows × 3 columns

Modifying data


We’ve already shown how to modify an existing a dataframe by adding a new column. What if we want to modify existing cells instead? As we’ve seen, this can be a little tricky in pandas because most of its methods return a copy of the original dataframe. For example, we can get subsets of a dataframe using square brackets. The cell below returns the species_id column for rows 2 through 5:

PYTHON

surveys[2:6]["species_id"]

OUTPUT

2    DM
3    DM
4    DM
5    PF
Name: species_id, dtype: object

But trying to set new values using this syntax may not work as expected. When working with the full dataframe, we can use

Say we want to set the species_id column to a new value, “FD”. Try running the code in the cell below:

PYTHON

surveys[2:6]["species_id"] = "FD"

OUTPUT

C:\...\1234567890.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  surveys[2:6]["species_id"] = "FD"

You should have received a SettingWithCopyWarning warning after running that cell. This warning tells us that the data in the original dataframe has not been modified. This is because the square brackets returned a copy, meaning that any changes will be reflected in the copy, not the original. We can verify that the original dataframe has not been changed by displaying the rows that would have been modified:

PYTHON

surveys[2:6]
record_id month day year plot_id species_id sex hindfoot_length weight
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN
5 6 7 16 1977 1 PF M 14.0 NaN

Using loc to modify existing cells

One way to modify existing data in pandas is to use the loc attribute. This attribute allows you to extract and modify cells in a DataFrame using the following syntax: df.loc[row_indexer, col_indexer].

The row_indexer argument is used to select one or more rows. It can be: - A row label (i.e., the bold column on the far left) - 0 returns the row with label 0

  • A slice including multiple rows:
    • : returns all rows
    • :2 returns all rows from the beginning of the dataframe to the row labeled 2, inclusive
    • 2: returns all rows from the row labeled 2 to the end of the dataframe, inclusive
    • 2:5 returns all rows between those labeled 2 and 5, inclusive
  • A conditional, as in the examples above.

The col_indexer argument is used to select one or more columns. It will typically be a list of column names and can be omitted, in which case all columns will be returned.

Row labels and indexes

The row labels in this dataframe happen to be numeric and aligned exactly to the row’s index (so for the first row both the index and the label are 0, for the second row both the index and the label are 1, etc.) This is often but not always true in pandas. For example, if we used record_id as the label, the row labels would be one-based and the row indexes would be zero-based.

loc slicing behavior

Slices using loc are inclusive–rows matching both the start and end values are included in the returned slice. This differs from list slices, where the start but not end value is included in the slice. loc works this way because it is looking at the row label, not the row index.

We’ll be making some changes to our data, so let’s work from a copy instead of modifying the original. Create a copy using the copy() method:

PYTHON

surveys_copy = surveys.copy()

To select a subset of rows and columns using loc, use:

PYTHON

surveys_copy.loc[2:5, "species_id"]

OUTPUT

2    DM
3    DM
4    DM
5    PF
Name: species_id, dtype: object

Unlike the methods earlier in the lesson, this is a view, not a copy, of the data in the surveys_copy dataframe. That means that the object returned by loc is live and can be used to change the original dataframe. We can now assign a new value to the species_id column in the matching rows of the original dataframe:

PYTHON

surveys_copy.loc[2:5, "species_id"] = "FD"

We can see that these changes are reflected in the original surveys_copy object:

PYTHON

surveys_copy.loc[1:6, "species_id"]

OUTPUT

1    NL
2    FD
3    FD
4    FD
5    FD
6    PE
Name: species_id, dtype: object

Slicing with iloc

pandas provides another indexer, iloc, that allows us to select and modify data using row and column indexes instead of labels. Learn more in the pandas documentation.

Key Points

  • Use square brackets to access rows, columns, and specific cells
  • Use operators like +, -, and / to perform arithmetic on rows and columns
  • Store the results of calculations in a dataframe by adding a new column or overwriting an existing column
  • Sort data, rename columns, and get unique values in a dataframe using methods provided by pandas
  • By default, most dataframe operations return a copy of the original data

Content from Aggregating and Grouping Data


Last updated on 2022-07-18 | Edit this page

Estimated time: 60 minutes

Overview

Questions

  • How do we calculate summary statistics?
  • How do we group data?
  • How do null values affect calculations?

Objectives

  • Introduce aggregation calculations in pandas
  • Introduce grouping in pandas
  • Learn about how pandas handles null values

As always, we’ll begin by importing pandas and reading our CSV:

PYTHON

import pandas as pd

surveys = pd.read_csv("data/surveys.csv")

Aggregating data


Aggregation allows us to describe the data in our dataframe by calculating totals (like the number of records) and statistics (like the mean value of a column). pandas allows us to run these calculations on dataframes or subsets.

Suppose we need to know how many records are in our dataset. We’ve already seen that we can use the info() method to get high-level about the dataset, including the number of entries. What if just wanted the number of rows? One approach is to use the built-in function len(), which is used to calculate the number of items in an object (for example, the number of characters in a string or the number of items in a list). When used on a dataframe, len() returns the number of rows:

PYTHON

len(surveys)

OUTPUT

35549

pandas provides a suite of aggregation methods that go beyond this simple case. For example, we can count the number of non-NaN values in each column using count():

PYTHON

surveys.count()

OUTPUT

record_id          35549
month              35549
day                35549
year               35549
plot_id            35549
species_id         34786
sex                33038
hindfoot_length    31438
weight             32283
dtype: int64

Or we can find out the total weight of all individuals in grams using sum():

PYTHON

surveys["weight"].sum()

OUTPUT

1377594.0

Other aggregation methods supported by pandas include min(), max(), and mean(). These methods all ignore NaNs, so missing data does not affect these calculations.

Challenge

Calculate the total weight, average weight, minimum and maximum weights for all animals caught over the duration of the survey. Can you modify your code so that it outputs these values only for weights between 5 and 10 grams?

PYTHON

# Create a subset of only the animals between 5 and 10 grams
weights = surveys[(surveys["weight"] > 5) & (surveys["weight"] < 10)]["weight"]

# Display aggregation calculations using a dict
{
    "sum": weights.sum(),
    "mean": weights.mean(),
    "min": weights.min(),
    "max": weights.max(),
}

OUTPUT

{'sum': 16994.0, 'mean': 7.91523055426176, 'min': 6.0, 'max': 9.0}

To quickly generate summary statistics, we can use the describe() method instead of calling each aggregation method separately. When we use this method on a dataframe, it calculates stats for all columns with numeric data:

PYTHON

surveys.describe()
record_id month day year plot_id hindfoot_length weight
count 35549.000000 35549.000000 35549.000000 35549.000000 35549.000000 31438.000000 32283.000000
mean 17775.000000 6.477847 15.991195 1990.475231 11.397001 29.287932 42.672428
std 10262.256696 3.396925 8.257366 7.493355 6.799406 9.564759 36.631259
min 1.000000 1.000000 1.000000 1977.000000 1.000000 2.000000 4.000000
25% 8888.000000 4.000000 9.000000 1984.000000 5.000000 21.000000 20.000000
50% 17775.000000 6.000000 16.000000 1990.000000 11.000000 32.000000 37.000000
75% 26662.000000 10.000000 23.000000 1997.000000 17.000000 36.000000 48.000000
max 35549.000000 12.000000 31.000000 2002.000000 24.000000 70.000000 280.000000

You can see that describe() isn’t picky: It includes both ID and date columns in its results. Notice also that counts differ in between columns. This is because count() only counts non-NaN rows.

If desired, we can also describe a single column at a time:

PYTHON

surveys["weight"].describe()

OUTPUT

count    32283.000000
mean        42.672428
std         36.631259
min          4.000000
25%         20.000000
50%         37.000000
75%         48.000000
max        280.000000
Name: weight, dtype: float64

If we need more control over the output (for example, if we want to calculate the total weight of all animals, as in the challenge above), pandas provides the agg() method, which allows us to specify methods by column. The argument passed to this method is a dict. Each key must be a column name and each value a list of the names of aggregation methods. To calculate the total weight, mean weight, and mean hindfoot length of all records in the survey, we can use:

PYTHON

surveys.agg({"weight": ["sum", "mean"], "hindfoot_length": ["mean"]})
weight hindfoot_length
sum 1.377594e+06 NaN
mean 4.267243e+01 29.287932

Grouping data


Now, let’s find out how many individuals were counted for each species. We do this using groupby(), which creates an object similar to a dataframe where rows are grouped by the data in one or more columns. To group by species_id, use:

PYTHON

grouped = surveys.groupby("species_id")

When we aggregate grouped data, pandas makes separate calculations for each member of the group. In the example below, we’ll calculate the number of times each species appears in the dataset. Rather than outputting the full dataframe, we’ll limit the count to a single column. Because count ignores NaN cells, it’s good practice to use a column that does not contain nulls. Record ID fields are a good choice because they makes it clear that we are counting rows. The fields used to group the data also work.

PYTHON

grouped["species_id"].count()

OUTPUT

species_id
AB      303
AH      437
AS        2
BA       46
CB       50
CM       13
CQ       16
CS        1
CT        1
CU        1
CV        1
DM    10596
DO     3027
DS     2504
DX       40
NL     1252
OL     1006
OT     2249
OX       12
PB     2891
PC       39
PE     1299
PF     1597
PG        8
PH       32
PI        9
PL       36
PM      899
PP     3123
PU        5
PX        6
RF       75
RM     2609
RO        8
RX        2
SA       75
SC        1
SF       43
SH      147
SO       43
SS      248
ST        1
SU        5
UL        4
UP        8
UR       10
US        4
ZL        2
Name: species_id, dtype: int64

To group by multiple columns, we can pass a list to groupby():

PYTHON

surveys.groupby(["species_id", "year"])["record_id"].count()

OUTPUT

species_id  year
AB          1980     5
            1981     7
            1982    34
            1983    41
            1984    12
                    ..
UR          1994     1
            2001     2
            2002     1
US          2002     4
ZL          1988     2
Name: record_id, Length: 509, dtype: int64

Challenge

Write statements that return:

  1. How many individuals were counted in each year in total
  2. How many were counted each year, for each different species
  3. The average weights of each species in each year
  4. How many individuals were counted for each species that was observed more than 10 times

Can you get the answer to both 2 and 3 in a single query?

How many individuals were counted in each year in total?

PYTHON

# Individual counts per year
surveys.groupby("year")["record_id"].count()

OUTPUT

year
1977     503
1978    1048
1979     719
1980    1415
1981    1472
1982    1978
1983    1673
1984     981
1985    1438
1986     942
1987    1671
1988    1469
1989    1569
1990    1311
1991    1347
1992    1038
1993     750
1994     668
1995    1222
1996    1706
1997    2493
1998    1610
1999    1135
2000    1552
2001    1610
2002    2229
Name: record_id, dtype: int64

How many individuals were counted each year, for each different species?

PYTHON

# Individual counts by species and year
result = surveys.groupby(["year", "species_id"])["record_id"].count()

# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
    print(result)

OUTPUT

year  species_id
1977  DM            264
      DO             12
      DS             98
      NL             31
      OL             10
      OT             17
      OX              7
      PE              6
      PF             31
      PP              7
      RM              2
      SH              1
      SS              1
1978  AH              1
      DM            389
      DO             22
      DS            320
      NL             48
      OL             47
      OT             66
      OX              1
      PE             17
      PF             40
      PM              3
      PP             24
      RM              3
      SA              2
      SH              1
      SS              8
1979  DM            209
      DO             31
      DS            204
      NL             30
      OL             46
      OT             68
      PE             17
      PF             19
      PM              6
      PP             20
      RM              6
      SS              2
1980  AB              5
      CB              1
      CM             12
      CQ              2
      DM            493
      DO             89
      DS            346
      DX              1
      NL             57
      OL             85
      OT             96
      PE             20
      PF             95
      PM              1
      PP             17
      RM             46
      SS              9
1981  AB              7
      CB              1
      CQ              4
      DM            559
      DO            123
      DS            354
      NL             63
      OL             55
      OT             65
      PE             43
      PF             59
      PM             16
      PP             20
      RM             29
      SA             11
      SS              8
1982  AB             34
      AH              1
      CQ              8
      DM            609
      DO            117
      DS            354
      NL            111
      OL             83
      OT             94
      OX              1
      PC              2
      PE            107
      PF            152
      PM             72
      PP             32
      RF              2
      RM            154
      SA             17
      SH              2
      SS             12
1983  AB             41
      AH              2
      CB              3
      DM            528
      DO            167
      DS            280
      DX              1
      NL             98
      OL             65
      OT             86
      PC              3
      PE             38
      PF            114
      PG              1
      PM             42
      PP             32
      RM            127
      SA             10
      SH              2
      SS             12
1984  AB             12
      AH              2
      CB              1
      CQ              2
      DM            396
      DO            108
      DS             76
      NL             64
      OL             71
      OT             49
      OX              1
      PE             20
      PF              7
      PH              7
      PM             12
      PP             26
      PU              2
      RM             60
      SA             20
      SS             15
1985  AB             14
      AH             14
      CB              1
      DM            667
      DO            225
      DS             98
      NL             45
      OL             94
      OT             62
      PE             51
      PF              1
      PH              3
      PM             17
      PP             35
      RM             72
      SA              3
      SS             14
1986  AB              5
      AH             19
      CB              4
      DM            406
      DO            115
      DS             88
      DX              1
      NL             60
      OL             43
      OT             25
      PE             26
      PF              2
      PH              1
      PM             34
      PP             26
      RM             53
      SA              1
      SH              1
      SS             12
1987  AB             35
      AH             41
      CB              5
      CM              1
      CV              1
      DM            469
      DO            213
      DS            104
      NL            128
      OL             79
      OT             43
      PC              5
      PE            142
      PF              4
      PM            133
      PP             45
      RM            189
      SA              1
      SH              5
      SS             20
      UP              2
      UR              3
1988  AB             39
      AH             26
      CB              6
      DM            365
      DO            144
      DS             54
      DX              1
      NL            102
      OL             51
      OT             86
      PE            190
      PF              2
      PG              2
      PH              1
      PM             54
      PP             53
      RF             11
      RM            211
      SA              1
      SH             31
      SS             20
      UL              1
      UP              3
      UR              1
      ZL              2
1989  AB             31
      AH             30
      BA              3
      CB              1
      CS              1
      DM            321
      DO            119
      DS             33
      NL             67
      OL             57
      OT            102
      OX              1
      PC              1
      PE            177
      PF             13
      PH              2
      PM             27
      PP             45
      RF             49
      RM            398
      SA              3
      SC              1
      SF              5
      SH             55
      SS             18
      UP              2
1990  AB             27
      AH             22
      BA             11
      DM            462
      DO            171
      DS             17
      NL             29
      OL             25
      OT             77
      PC              8
      PE             75
      PF             51
      PG              4
      PH              7
      PP             34
      RF             12
      RM            231
      SF             18
      SH              9
      SS              5
      UP              1
1991  AB             15
      AH             21
      AS              1
      BA             26
      CB              2
      DM            404
      DO            122
      DS             11
      DX              4
      NL             30
      OL             36
      OT            107
      PC              6
      PE             65
      PF             62
      PP             88
      RM            307
      RO              1
      SA              2
      SF              7
      SH              1
      SO             11
      SS              2
      UR              1
1992  AB             10
      AH             16
      BA              6
      CB              7
      DM            307
      DO             94
      DS             18
      DX             12
      NL             15
      OL             45
      OT             42
      PC              9
      PE             41
      PF             57
      PH              7
      PI              1
      PP            131
      RM            158
      SF              3
      SO             20
      SS              7
      UL              1
      UR              1
1993  AB              9
      AH             15
      AS              1
      CB              5
      CU              1
      DM            253
      DO             29
      DS             18
      DX              3
      NL             24
      OL             35
      OT             22
      OX              1
      PC              1
      PE             13
      PF             70
      PP            100
      RM            103
      SF              3
      SO              6
      SS              1
      ST              1
1994  AB              7
      AH             19
      CB              1
      DM            293
      DO             25
      DS              9
      DX              2
      NL             10
      OL             21
      OT             21
      PC              1
      PE              3
      PF             73
      PP             74
      PU              2
      RM             40
      SO              3
      SS              7
      SU              1
      UL              2
      UR              1
1995  AB              4
      AH             36
      CB              2
      DM            436
      DO             58
      DX              3
      NL              8
      OL             29
      OT             38
      PB              7
      PC              1
      PE             14
      PF            158
      PH              3
      PI              1
      PL              3
      PM              8
      PP            277
      PU              1
      RM             81
      RX              1
      SS              7
      SU              4
1996  AH             25
      DM            492
      DO            174
      DX              1
      NL              7
      OL             13
      OT            108
      PB             39
      PC              1
      PE             36
      PF            330
      PL              6
      PM             50
      PP            298
      PX              2
      RM             90
      SA              1
      SS             16
1997  AH             37
      CB              3
      DM            576
      DO            253
      DS              6
      NL             48
      OL              8
      OT            258
      PB            259
      PE             57
      PF            186
      PH              1
      PL             19
      PM            271
      PP            325
      PX              2
      RF              1
      RM            158
      RX              1
      SA              1
      SH              7
      SO              3
      SS             12
1998  AB              2
      AH             33
      CB              4
      CT              1
      DM            503
      DO            111
      DS              9
      DX              2
      NL             32
      OT            164
      PB            329
      PC              1
      PE             24
      PF             26
      PL              7
      PM            103
      PP            208
      PX              1
      RM             13
      SA              2
      SS             15
1999  AH             14
      DM            348
      DO             84
      DS              7
      DX              3
      NL             20
      OT            105
      PB            272
      PE              7
      PM             44
      PP            167
      RM             28
      SH              2
      SS              7
2000  AB              3
      AH             12
      DM            233
      DO             91
      DX              4
      NL             29
      OT            154
      PB            555
      PE             15
      PG              1
      PL              1
      PM              2
      PP            381
      PX              1
      RM             15
      SH              8
      SS              4
2001  AB              1
      AH             23
      CB              3
      DM            305
      DO             81
      DX              1
      NL             48
      OT            167
      PB            538
      PE             35
      PF             27
      PI              2
      PM              3
      PP            273
      RM             15
      SH             11
      SS              5
      UR              2
2002  AB              2
      AH             28
      DM            309
      DO            249
      DX              1
      NL             48
      OL              8
      OT            127
      PB            892
      PE             60
      PF             18
      PI              5
      PM              1
      PP            385
      RM             20
      RO              7
      SF              7
      SH             11
      SS              9
      UR              1
      US              4
Name: record_id, dtype: int64

What was the average weight of each species in each year?

PYTHON

# Mean weight by species and year
result = surveys.groupby(["year", "species_id"])["weight"].mean()

# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
    print(result)

OUTPUT

year  species_id
1977  DM             41.141304
      DO             42.666667
      DS            121.437500
      NL                   NaN
      OL             21.666667
      OT             21.000000
      OX             21.333333
      PE             19.500000
      PF              7.173913
      PP             15.333333
      RM             10.000000
      SH                   NaN
      SS                   NaN
1978  AH                   NaN
      DM             40.795455
      DO             45.000000
      DS            115.198653
      NL            184.656250
      OL             31.027027
      OT             23.220000
      OX                   NaN
      PE             20.428571
      PF              7.088235
      PM             20.666667
      PP             14.869565
      RM              7.500000
      SA                   NaN
      SH             89.000000
      SS            130.000000
1979  DM             43.507317
      DO             45.862069
      DS            111.796954
      NL            138.000000
      OL             33.717391
      OT             23.075758
      PE             20.529412
      PF              7.529412
      PM             23.666667
      PP             15.105263
      RM              8.333333
      SS                   NaN
1980  AB                   NaN
      CB                   NaN
      CM                   NaN
      CQ                   NaN
      DM             44.136082
      DO             48.058824
      DS            120.725664
      DX                   NaN
      NL            158.982143
      OL             33.107143
      OT             24.083333
      PE             22.450000
      PF              7.455556
      PM             21.000000
      PP             14.176471
      RM             10.227273
      SS                   NaN
1981  AB                   NaN
      CB                   NaN
      CQ                   NaN
      DM             44.032907
      DO             49.111111
      DS            125.818444
      NL            165.903226
      OL             33.296296
      OT             24.415385
      PE             20.558140
      PF              7.152542
      PM             19.933333
      PP             13.950000
      RM             11.178571
      SA                   NaN
      SS             57.000000
1982  AB                   NaN
      AH                   NaN
      CQ                   NaN
      DM             41.975042
      DO             47.919643
      DS            115.647929
      NL            160.613208
      OL             35.590361
      OT             24.925532
      OX             24.000000
      PC                   NaN
      PE             21.173077
      PF              6.918919
      PM             21.391304
      PP             16.125000
      RF             11.500000
      RM             10.436242
      SA                   NaN
      SH             79.000000
      SS                   NaN
1983  AB                   NaN
      AH                   NaN
      CB                   NaN
      DM             40.944551
      DO             47.150602
      DS            122.033088
      DX                   NaN
      NL            156.691489
      OL             34.584615
      OT             24.697674
      PC                   NaN
      PE             20.864865
      PF              6.833333
      PG                   NaN
      PM             22.023810
      PP             15.468750
      RM              9.872000
      SA                   NaN
      SH                   NaN
      SS                   NaN
1984  AB                   NaN
      AH                   NaN
      CB                   NaN
      CQ                   NaN
      DM             40.765306
      DO             48.415094
      DS            124.082192
      NL            150.095238
      OL             32.550725
      OT             22.416667
      OX             18.000000
      PE             20.210526
      PF              7.428571
      PH             28.000000
      PM             19.545455
      PP             15.307692
      PU                   NaN
      RM             11.152542
      SA                   NaN
      SS                   NaN
1985  AB                   NaN
      AH                   NaN
      CB                   NaN
      DM             41.507645
      DO             47.956731
      DS            124.326316
      NL            148.829268
      OL             32.108696
      OT             22.064516
      PE             20.360000
      PF                   NaN
      PH             32.666667
      PM             19.000000
      PP             15.764706
      RM              8.371429
      SA                   NaN
      SS                   NaN
1986  AB                   NaN
      AH                   NaN
      CB                   NaN
      DM             43.361596
      DO             49.372727
      DS            128.071429
      DX                   NaN
      NL            159.724138
      OL             30.880952
      OT             21.500000
      PE             22.520000
      PF              8.000000
      PH             39.000000
      PM             21.696970
      PP             16.750000
      RM             10.750000
      SA                   NaN
      SH             55.000000
      SS                   NaN
1987  AB                   NaN
      AH                   NaN
      CB                   NaN
      CM                   NaN
      CV                   NaN
      DM             43.232609
      DO             50.087379
      DS            126.383838
      NL            158.840000
      OL             30.779221
      OT             23.069767
      PC                   NaN
      PE             21.625899
      PF              7.500000
      PM             22.297710
      PP             17.840909
      RM             10.657609
      SA                   NaN
      SH             73.800000
      SS                   NaN
      UP                   NaN
      UR                   NaN
1988  AB                   NaN
      AH                   NaN
      CB                   NaN
      DM             43.397790
      DO             51.463768
      DS            129.490566
      DX                   NaN
      NL            163.104167
      OL             30.306122
      OT             24.070588
      PE             22.625668
      PF              8.000000
      PG                   NaN
      PH             33.000000
      PM             21.759259
      PP             18.280000
      RF             13.818182
      RM             10.322115
      SA                   NaN
      SH             72.806452
      SS                   NaN
      UL                   NaN
      UP                   NaN
      UR                   NaN
      ZL                   NaN
1989  AB                   NaN
      AH                   NaN
      BA              7.000000
      CB                   NaN
      CS                   NaN
      DM             44.349206
      DO             51.025641
      DS            121.896552
      NL            151.278689
      OL             31.947368
      OT             24.663366
      OX             20.000000
      PC                   NaN
      PE             21.935673
      PF              7.230769
      PH             30.500000
      PM             20.222222
      PP             17.409091
      RF             13.346939
      RM             10.411311
      SA                   NaN
      SC                   NaN
      SF             54.800000
      SH             76.345455
      SS                   NaN
      UP                   NaN
1990  AB                   NaN
      AH                   NaN
      BA              8.000000
      DM             41.769912
      DO             48.512048
      DS            121.187500
      NL            154.275862
      OL             31.200000
      OT             23.675325
      PC                   NaN
      PE             21.671233
      PF              7.117647
      PG                   NaN
      PH             31.142857
      PP             16.029412
      RF             12.916667
      RM             10.305677
      SF             52.611111
      SH             76.888889
      SS                   NaN
      UP                   NaN
1991  AB                   NaN
      AH                   NaN
      AS                   NaN
      BA              9.240000
      CB                   NaN
      DM             43.148338
      DO             49.695652
      DS            113.000000
      DX                   NaN
      NL            148.785714
      OL             28.171429
      OT             24.588235
      PC                   NaN
      PE             21.435484
      PF              7.827586
      PP             17.904762
      RM             10.498305
      RO             11.000000
      SA                   NaN
      SF             93.166667
      SH             63.000000
      SO             53.909091
      SS                   NaN
      UR                   NaN
1992  AB                   NaN
      AH                   NaN
      BA              7.833333
      CB                   NaN
      DM             43.877966
      DO             48.367816
      DS            112.352941
      DX                   NaN
      NL            139.000000
      OL             28.454545
      OT             24.928571
      PC                   NaN
      PE             22.750000
      PF              8.767857
      PH             31.142857
      PI             18.000000
      PP             17.479675
      RM             10.904762
      SF             43.000000
      SO             55.263158
      SS                   NaN
      UL                   NaN
      UR                   NaN
1993  AB                   NaN
      AH                   NaN
      AS                   NaN
      CB                   NaN
      CU                   NaN
      DM             43.373984
      DO             48.461538
      DS            105.722222
      DX                   NaN
      NL            127.391304
      OL             27.545455
      OT             22.363636
      OX                   NaN
      PC                   NaN
      PE             19.230769
      PF              8.238806
      PP             17.954023
      RM             10.747475
      SF             44.000000
      SO             55.600000
      SS                   NaN
      ST                   NaN
1994  AB                   NaN
      AH                   NaN
      CB                   NaN
      DM             42.373288
      DO             47.750000
      DS            106.625000
      DX                   NaN
      NL            186.875000
      OL             21.900000
      OT             25.285714
      PC                   NaN
      PE             18.000000
      PF              7.855072
      PP             17.585714
      PU                   NaN
      RM             10.675000
      SO             62.333333
      SS                   NaN
      SU                   NaN
      UL                   NaN
      UR                   NaN
1995  AB                   NaN
      AH                   NaN
      CB                   NaN
      DM             44.806147
      DO             49.592593
      DX                   NaN
      NL            155.833333
      OL             27.296296
      OT             24.868421
      PB             34.000000
      PC                   NaN
      PE             21.230769
      PF              8.780645
      PH             35.333333
      PI                   NaN
      PL             22.333333
      PM             27.375000
      PP             16.844444
      PU                   NaN
      RM             12.653846
      RX             20.000000
      SS                   NaN
      SU                   NaN
1996  AH                   NaN
      DM             44.506173
      DO             47.234940
      DX                   NaN
      NL            148.000000
      OL             27.769231
      OT             23.453704
      PB             32.578947
      PC                   NaN
      PE             22.027778
      PF              8.393846
      PL             21.166667
      PM             20.224490
      PP             18.095563
      PX                   NaN
      RM             11.455556
      SA                   NaN
      SS                   NaN
1997  AH                   NaN
      CB                   NaN
      DM             44.708551
      DO             48.177866
      DS            111.000000
      NL            150.688889
      OL             33.625000
      OT             24.785156
      PB             31.085603
      PE             20.929825
      PF              8.448649
      PH             22.000000
      PL             18.789474
      PM             21.126394
      PP             18.175000
      PX             19.000000
      RF             20.000000
      RM             11.230769
      RX             11.000000
      SA                   NaN
      SH             50.857143
      SO             54.666667
      SS                   NaN
1998  AB                   NaN
      AH                   NaN
      CB                   NaN
      CT                   NaN
      DM             43.131403
      DO             49.731183
      DS            116.000000
      DX                   NaN
      NL            159.466667
      OT             24.675676
      PB             30.082237
      PC                   NaN
      PE             20.304348
      PF              8.720000
      PL             16.714286
      PM             20.591398
      PP             16.266990
      PX                   NaN
      RM             13.100000
      SA                   NaN
      SS                   NaN
1999  AH                   NaN
      DM             43.945402
      DO             48.012048
      DS            120.714286
      DX                   NaN
      NL            182.200000
      OT             25.723810
      PB             31.710037
      PE             25.428571
      PM             22.523810
      PP             16.521212
      RM             10.555556
      SH             54.000000
      SS                   NaN
2000  AB                   NaN
      AH                   NaN
      DM             43.126638
      DO             49.224719
      DX                   NaN
      NL            179.307692
      OT             25.303448
      PB             30.878899
      PE             21.615385
      PG                   NaN
      PL             21.000000
      PM             20.500000
      PP             16.788618
      PX                   NaN
      RM             11.400000
      SH             73.375000
      SS                   NaN
2001  AB                   NaN
      AH                   NaN
      CB                   NaN
      DM             45.442177
      DO             52.233766
      DX                   NaN
      NL            167.851064
      OT             23.707792
      PB             32.798851
      PE             20.400000
      PF              8.538462
      PI             18.000000
      PM             26.666667
      PP             17.752896
      RM             11.333333
      SH             79.900000
      SS                   NaN
      UR                   NaN
2002  AB                   NaN
      AH                   NaN
      DM             46.168317
      DO             49.514403
      DX                   NaN
      NL            182.159091
      OL             25.428571
      OT             23.833333
      PB             32.359447
      PE             21.719298
      PF              7.388889
      PI             20.000000
      PM             19.000000
      PP             17.018617
      RM             10.000000
      RO             10.142857
      SF             62.166667
      SH             64.666667
      SS                   NaN
      UR                   NaN
      US                   NaN
Name: weight, dtype: float64

How many individuals were counted for each species that was observed more than 10 times?

PYTHON

# Counts by species that appear more than 10 times
species = surveys.groupby("species_id")["record_id"].count()
species[species > 10]

OUTPUT

species_id
AB      303
AH      437
BA       46
CB       50
CM       13
CQ       16
DM    10596
DO     3027
DS     2504
DX       40
NL     1252
OL     1006
OT     2249
OX       12
PB     2891
PC       39
PE     1299
PF     1597
PH       32
PL       36
PM      899
PP     3123
RF       75
RM     2609
SA       75
SF       43
SH      147
SO       43
SS      248
Name: record_id, dtype: int64

Calculate the number of individuals observed and the average weight for each species in each year.

Note that weight contains NaNs, so counts for record_id and weight differ.

PYTHON

# Counts and mean weight by species and year
surveys.groupby(["year", "species_id"]).agg(
    {"record_id": "count", "weight": ["count", "mean"]}
)
record_id weight
count count mean
year species_id
1977 DM 264 184 41.141304
DO 12 12 42.666667
DS 98 32 121.437500
NL 31 0 NaN
OL 10 3 21.666667
2002 SF 7 6 62.166667
SH 11 9 64.666667
SS 9 0 NaN
UR 1 0 NaN
US 4 0 NaN

509 rows × 3 columns

Handling missing data


As we’ve discussed, some columns in the surveys dataframe have the value NaN instead of text or numbers. NaN, short for “not a number,” is a special type of float used by pandas to represent missing data. When reading from a CSV, as we have done throughout this lesson, pandas interprets certains values as NaN (see na_values in the pd.read_csv() documentation for the default list). NaNs are excluded from groups and most aggregation calculations in pandas, including counts.

It is crucial to understand how missing data is represented in a dataset. Failing to do so may introduce errors into our analysis. The ecology dataset used in this lesson uses empty cells to represent missing data, but other disciplines have different conventions. For example, some geographic datasets use -9999 to represent null values. Failure to convert such values to NaN will result in significant errors on any calculations performed on that dataset.

In some cases, it can be useful to fill in cells containing NaN with a non-null value. For example, the groupby() method excludes NaN cells. When looking at sex, the following code counts only those cells with either F (female) or M (male):

PYTHON

surveys.groupby("sex")["record_id"].count()

OUTPUT

sex
F    15690
M    17348
Name: record_id, dtype: int64

But not all records specify a sex. To include records where sex was not specified, we can use the fillna() method on the sex column. This method replaces each NaN cell with the value passed as the first argument to the function call. To replace all NaN values in sex with “U”, use:

PYTHON

surveys["sex"] = surveys["sex"].fillna("U")

The grouped calculation now accounts for all records in the dataframe:

PYTHON

surveys.groupby("sex")["record_id"].count()

OUTPUT

sex
F    15690
M    17348
U     2511
Name: record_id, dtype: int64

In other cases, we may want to ignore rows that contain NaNs. This can be done using dropna():

PYTHON

surveys.dropna()
record_id month day year plot_id species_id sex hindfoot_length weight
62 63 8 19 1977 3 DM M 35.0 40.0
63 64 8 19 1977 7 DM M 37.0 48.0
64 65 8 19 1977 4 DM F 34.0 29.0
65 66 8 19 1977 4 DM F 35.0 46.0
66 67 8 19 1977 7 DM M 35.0 36.0
35540 35541 12 31 2002 15 PB F 24.0 31.0
35541 35542 12 31 2002 15 PB F 26.0 29.0
35542 35543 12 31 2002 15 PB F 27.0 34.0
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0

30738 rows × 9 columns

This method returns a copy of the dataframe containing only those rows that have valid data in every field.

Key Points

  • Calculate individual summary statistics using dataframe methods like mean(), max(), and min()
  • Calculate multiple summary statistics at once using the dataframe methods describe() and agg()
  • Group data by one or more columns using the groupby() method
  • pandas uses NaN to represent missing data in a dataframe
  • Failing to consider how missing data is interpreted in a dataset can introduce errors into calculations

Content from Combining Dataframes


Last updated on 2022-07-18 | Edit this page

Estimated time: 60 minutes

Overview

Questions

  • How do we combine data from multiple sources using pandas?
  • How do we add data to an existing dataframe?
  • How do we split and combine data columns?

Objectives

  • Use pd.merge() to add species info to the survey dataset
  • Use pd.concat() to add additional rows the dataset
  • Use string methods to combine, split, and modify text columns using the str accessor

Dataframes can be used to organize and group data by common characteristics. Often, we need to combine elements from separate dataframes into one for analysis and visualization. A merge (or join) allows use to combine two dataframes using values common to each. Likewise, we may need to append data collected under different circumstances. In this chapter, we will show how to merge, concatenate, and split data using pandas.

Merging dataframes


The survey dataframe we’ve been using throughout this lesson has a column called species_id. We used this column in the previous lesson to calculate summary statistics about observations of each species. But the species_id is just a two-letter code—what does each code stand for? To find out, we’ll now load both the survey dataset and a second dataset containing more detailed information about the various species observed. Read the second dataframe from a file called species.csv:

PYTHON

import pandas as pd

surveys = pd.read_csv("data/surveys.csv")
species = pd.read_csv("data/species.csv")

species
species_id genus species taxa
0 AB Amphispiza bilineata Bird
1 AH Ammospermophilus harrisi Rodent
2 AS Ammodramus savannarum Bird
3 BA Baiomys taylori Rodent
4 CB Campylorhynchus brunneicapillus Bird
5 CM Calamospiza melanocorys Bird
6 CQ Callipepla squamata Bird
7 CS Crotalus scutalatus Reptile
8 CT Cnemidophorus tigris Reptile
9 CU Cnemidophorus uniparens Reptile
10 CV Crotalus viridis Reptile
11 DM Dipodomys merriami Rodent
12 DO Dipodomys ordii Rodent
13 DS Dipodomys spectabilis Rodent
14 DX Dipodomys sp. Rodent
15 EO Eumeces obsoletus Reptile
16 GS Gambelia silus Reptile
17 NL Neotoma albigula Rodent
18 NX Neotoma sp. Rodent
19 OL Onychomys leucogaster Rodent
20 OT Onychomys torridus Rodent
21 OX Onychomys sp. Rodent
22 PB Chaetodipus baileyi Rodent
23 PC Pipilo chlorurus Bird
24 PE Peromyscus eremicus Rodent
25 PF Perognathus flavus Rodent
26 PG Pooecetes gramineus Bird
27 PH Perognathus hispidus Rodent
28 PI Chaetodipus intermedius Rodent
29 PL Peromyscus leucopus Rodent
30 PM Peromyscus maniculatus Rodent
31 PP Chaetodipus penicillatus Rodent
32 PU Pipilo fuscus Bird
33 PX Chaetodipus sp. Rodent
34 RF Reithrodontomys fulvescens Rodent
35 RM Reithrodontomys megalotis Rodent
36 RO Reithrodontomys montanus Rodent
37 RX Reithrodontomys sp. Rodent
38 SA Sylvilagus audubonii Rabbit
39 SB Spizella breweri Bird
40 SC Sceloporus clarki Reptile
41 SF Sigmodon fulviventer Rodent
42 SH Sigmodon hispidus Rodent
43 SO Sigmodon ochrognathus Rodent
44 SS Spermophilus spilosoma Rodent
45 ST Spermophilus tereticaudus Rodent
46 SU Sceloporus undulatus Reptile
47 SX Sigmodon sp. Rodent
48 UL Lizard sp. Reptile
49 UP Pipilo sp. Bird
50 UR Rodent sp. Rodent
51 US Sparrow sp. Bird
52 ZL Zonotrichia leucophrys Bird
53 ZM Zenaida macroura Bird

We can see that the species dataframe includes a genus, species, and taxon for each species_id. This is much more useful than the species_id included in the original dataframe–how can we add that data to our surveys dataframe? Adding it by hand would be tedious and error prone. Fortunately, pandas provides the pd.merge() function to join two dataframes.

Managing repetitive data

Why store species data in a separate table in the first place? Species information is repetitive: Every observation of the same species has the same genus, species, and taxa. Storing it in the original survey table would require including that data in every record, increasing the complexity of the table and creating the possibility of errors. Storing that data in a separate table means we only have to enter and validate it once. A tool like pandas then allows us to access that data when we need it.

To merge the surveys and species dataframes, use:

PYTHON

merged = pd.merge(surveys, species)
merged
record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa
0 1 7 16 1977 2 NL M 32.0 NaN Neotoma albigula Rodent
1 2 7 16 1977 3 NL M 33.0 NaN Neotoma albigula Rodent
2 22 7 17 1977 15 NL F 31.0 NaN Neotoma albigula Rodent
3 38 7 17 1977 17 NL M 33.0 NaN Neotoma albigula Rodent
4 72 8 19 1977 2 NL M 31.0 NaN Neotoma albigula Rodent
34781 28988 12 23 1998 6 CT NaN NaN NaN Cnemidophorus tigris Reptile
34782 35512 12 31 2002 11 US NaN NaN NaN Sparrow sp. Bird
34783 35513 12 31 2002 11 US NaN NaN NaN Sparrow sp. Bird
34784 35528 12 31 2002 13 US NaN NaN NaN Sparrow sp. Bird
34785 35544 12 31 2002 15 US NaN NaN NaN Sparrow sp. Bird

34786 rows × 12 columns

Following the merge, the genus, species, and taxa columns have all been added to the survey dataframe. We can now use those columns to filter and summarize our data.

Joins

The pd.merge() method is equivalent to the JOIN operation in SQL

Challenge

Filter the merged dataframe to show the genus, the species name, and the weight for every individual captured at the site

PYTHON

merged[["genus", "species", "weight"]]
genus species weight
0 Neotoma albigula NaN
1 Neotoma albigula NaN
2 Neotoma albigula NaN
3 Neotoma albigula NaN
4 Neotoma albigula NaN
34781 Cnemidophorus tigris NaN
34782 Sparrow sp. NaN
34783 Sparrow sp. NaN
34784 Sparrow sp. NaN
34785 Sparrow sp. NaN

34786 rows × 3 columns

In the example above, we didn’t provide any information about how we wanted to merge the dataframes together, so pandas used its default arguments to make an educated guess. It looked at the columns in each of the dataframes, then merged them based on the columns that appear in both. Here, the only shared name is species_id, so that’s the column pandas used to merge. For more complex tables, we may want to specify the columns are used for merging. We can do so by passing one or more column names using the on keyword argument:

PYTHON

pd.merge(surveys, species, on="species_id")
record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa
0 1 7 16 1977 2 NL M 32.0 NaN Neotoma albigula Rodent
1 2 7 16 1977 3 NL M 33.0 NaN Neotoma albigula Rodent
2 22 7 17 1977 15 NL F 31.0 NaN Neotoma albigula Rodent
3 38 7 17 1977 17 NL M 33.0 NaN Neotoma albigula Rodent
4 72 8 19 1977 2 NL M 31.0 NaN Neotoma albigula Rodent
34781 28988 12 23 1998 6 CT NaN NaN NaN Cnemidophorus tigris Reptile
34782 35512 12 31 2002 11 US NaN NaN NaN Sparrow sp. Bird
34783 35513 12 31 2002 11 US NaN NaN NaN Sparrow sp. Bird
34784 35528 12 31 2002 13 US NaN NaN NaN Sparrow sp. Bird
34785 35544 12 31 2002 15 US NaN NaN NaN Sparrow sp. Bird

34786 rows × 12 columns

Challenge

Compare the number of rows in the original and merged survey dataframes. How do they differ? Why do you think that might be?

Hint: Use pd.unique() method to look at the species_id column in each dataframe.

PYTHON

pd.unique(surveys["species_id"].sort_values())

OUTPUT

array(['AB', 'AH', 'AS', 'BA', 'CB', 'CM', 'CQ', 'CS', 'CT', 'CU', 'CV',
       'DM', 'DO', 'DS', 'DX', 'NL', 'OL', 'OT', 'OX', 'PB', 'PC', 'PE',
       'PF', 'PG', 'PH', 'PI', 'PL', 'PM', 'PP', 'PU', 'PX', 'RF', 'RM',
       'RO', 'RX', 'SA', 'SC', 'SF', 'SH', 'SO', 'SS', 'ST', 'SU', 'UL',
       'UP', 'UR', 'US', 'ZL', nan], dtype=object)

PYTHON

pd.unique(species["species_id"].sort_values())

OUTPUT

array(['AB', 'AH', 'AS', 'BA', 'CB', 'CM', 'CQ', 'CS', 'CT', 'CU', 'CV',
       'DM', 'DO', 'DS', 'DX', 'EO', 'GS', 'NL', 'NX', 'OL', 'OT', 'OX',
       'PB', 'PC', 'PE', 'PF', 'PG', 'PH', 'PI', 'PL', 'PM', 'PP', 'PU',
       'PX', 'RF', 'RM', 'RO', 'RX', 'SA', 'SB', 'SC', 'SF', 'SH', 'SO',
       'SS', 'ST', 'SU', 'SX', 'UL', 'UP', 'UR', 'US', 'ZL', 'ZM'],
      dtype=object)

Some records in the surveys dataframe do not specify a species. By default, only records with a value that occurs in both the surveys and species dataframes appear in the merged dataframe, so rows without a species_id are excluded.

The built-in set type can be used to quickly assess differences like this:

PYTHON

set(surveys["species_id"]) - set(species["species_id"])

OUTPUT

{nan}

In practice, the values in the columns used to join two dataframes may not align exactly. Above, the surveys dataframe contains a few hundred rows where species_id is NaN. These are the rows that were dropped when the dataframes were merged.

By default, pd.merge() performs an inner join. This means that a row will only appear if the value in the shared column appears in both of the datasets being merged. In this case, that means that survey observations that don’t have a species_id or have a species_id that does not appear in the species dataframe will be dropped.

This is not always desirable behavior. Fortunately, pandas supports additional types of merges:

  • Inner: Include all rows with common values in the join column. This is the default behavior.
  • Left: Include all rows from the left dataframe. Columns from the right dataframe are populated if a common value exists and set to NaN if not.
  • Right: Include all rows from the right dataframe. Columns from the left dataframe are populated if a common value exists and set to NaN if not.
  • Outer: Includes all rows from both dataframes

We want to keep all of our observations, so let’s do a left join instead. To specify the type of merge, we use the how keyword argument:

PYTHON

pd.merge(surveys, species, how="left")
record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa
0 1 7 16 1977 2 NL M 32.0 NaN Neotoma albigula Rodent
1 2 7 16 1977 3 NL M 33.0 NaN Neotoma albigula Rodent
2 3 7 16 1977 2 DM F 37.0 NaN Dipodomys merriami Rodent
3 4 7 16 1977 7 DM M 36.0 NaN Dipodomys merriami Rodent
4 5 7 16 1977 3 DM M 35.0 NaN Dipodomys merriami Rodent
35544 35545 12 31 2002 15 AH NaN NaN NaN Ammospermophilus harrisi Rodent
35545 35546 12 31 2002 15 AH NaN NaN NaN Ammospermophilus harrisi Rodent
35546 35547 12 31 2002 10 RM F 15.0 14.0 Reithrodontomys megalotis Rodent
35547 35548 12 31 2002 7 DO M 36.0 51.0 Dipodomys ordii Rodent
35548 35549 12 31 2002 5 NaN NaN NaN NaN NaN NaN NaN

35549 rows × 12 columns

Now all 35,549 rows appear in the merged dataframe.

Appending rows to a dataframe


Merges address the case where information about the same set of observations is spread across multiple files. What about when the observations themselves are split into more than one file? For a survey like the one we’ve been looking at in this lesson, we might get a new file once a year with the same columns but a completely new set of observations. How can we add those new observations to our dataframe?

We’ll simulate this operation by splitting data from two different years, 2001 and 2002, into separate dataframes. We can do this using conditionals, as we saw in lesson 3:

PYTHON

surveys_2001 = surveys[surveys["year"] == 2001].copy()
surveys_2001
record_id month day year plot_id species_id sex hindfoot_length weight
31710 31711 1 21 2001 1 PB F 26.0 25.0
31711 31712 1 21 2001 1 DM M 37.0 43.0
31712 31713 1 21 2001 1 PB M 29.0 44.0
31713 31714 1 21 2001 1 DO M 34.0 53.0
31714 31715 1 21 2001 2 OT M 20.0 27.0
33315 33316 12 16 2001 11 NaN NaN NaN NaN
33316 33317 12 16 2001 13 NaN NaN NaN NaN
33317 33318 12 16 2001 14 NaN NaN NaN NaN
33318 33319 12 16 2001 15 NaN NaN NaN NaN
33319 33320 12 16 2001 16 NaN NaN NaN NaN

1610 rows × 9 columns

PYTHON

surveys_2002 = surveys[surveys["year"] == 2002].copy()
surveys_2002
record_id month day year plot_id species_id sex hindfoot_length weight
33320 33321 1 12 2002 1 DM M 38.0 44.0
33321 33322 1 12 2002 1 DO M 37.0 58.0
33322 33323 1 12 2002 1 PB M 28.0 45.0
33323 33324 1 12 2002 1 AB NaN NaN NaN
33324 33325 1 12 2002 1 DO M 35.0 29.0
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

2229 rows × 9 columns

We now have two different dataframes with the same columns but different data, one with 1,610 rows, the other with 2,229 rows. We can combine them into a new dataframe using pd.concat(), which stacks the dataframes vertically (that is, it appends records from the 2002 dataset to the 2001 dataset). This method accepts a list and will concatenate each item moving from left to right. We’re only combining two dataframes here but could do more if needed.

PYTHON

surveys_2001_2002 = pd.concat([surveys_2001, surveys_2002])
surveys_2001_2002
record_id month day year plot_id species_id sex hindfoot_length weight
31710 31711 1 21 2001 1 PB F 26.0 25.0
31711 31712 1 21 2001 1 DM M 37.0 43.0
31712 31713 1 21 2001 1 PB M 29.0 44.0
31713 31714 1 21 2001 1 DO M 34.0 53.0
31714 31715 1 21 2001 2 OT M 20.0 27.0
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0
35548 35549 12 31 2002 5 NaN NaN NaN NaN

3839 rows × 9 columns

The combined dataframe includes all rows from both dataframes.

In some cases, the exact columns may change from year to year even within the same project. For example, researchers may decide to add an additional column to track a new piece of data or to provide a quality check. If a column is present in only one dataset, you can still concatenate the datasets. Any column that does not appear in a given dataset will be set to NaN for those rows in the combined dataframe.

To illustrate this, we’ll add a validated column to the 2002 survey, then re-run pd.concat():

PYTHON

surveys_2002["validated"] = True

surveys_2001_2002 = pd.concat([surveys_2001, surveys_2002])
surveys_2001_2002
record_id month day year plot_id species_id sex hindfoot_length weight validated
31710 31711 1 21 2001 1 PB F 26.0 25.0 NaN
31711 31712 1 21 2001 1 DM M 37.0 43.0 NaN
31712 31713 1 21 2001 1 PB M 29.0 44.0 NaN
31713 31714 1 21 2001 1 DO M 34.0 53.0 NaN
31714 31715 1 21 2001 2 OT M 20.0 27.0 NaN
35544 35545 12 31 2002 15 AH NaN NaN NaN True
35545 35546 12 31 2002 15 AH NaN NaN NaN True
35546 35547 12 31 2002 10 RM F 15.0 14.0 True
35547 35548 12 31 2002 7 DO M 36.0 51.0 True
35548 35549 12 31 2002 5 NaN NaN NaN NaN True

3839 rows × 10 columns

As expected, the validated column has a value of NaN for the 2001 data in the combined dataframe.

Joining and splitting columns


Sometimes we’d like to combine values from two or more columns into a single column. For example, we might want to refer to the species in each record by both its genus and species names. In Python, we use the + operator to concatenate (or join) strings, and pandas works the same way:

PYTHON

species["genus_species"] = species["genus"] + " " + species["species"]
species["genus_species"]

OUTPUT

0                Amphispiza bilineata
1            Ammospermophilus harrisi
2               Ammodramus savannarum
3                     Baiomys taylori
4     Campylorhynchus brunneicapillus
5             Calamospiza melanocorys
6                 Callipepla squamata
7                 Crotalus scutalatus
8                Cnemidophorus tigris
9             Cnemidophorus uniparens
10                   Crotalus viridis
11                 Dipodomys merriami
12                    Dipodomys ordii
13              Dipodomys spectabilis
14                      Dipodomys sp.
15                  Eumeces obsoletus
16                     Gambelia silus
17                   Neotoma albigula
18                        Neotoma sp.
19              Onychomys leucogaster
20                 Onychomys torridus
21                      Onychomys sp.
22                Chaetodipus baileyi
23                   Pipilo chlorurus
24                Peromyscus eremicus
25                 Perognathus flavus
26                Pooecetes gramineus
27               Perognathus hispidus
28            Chaetodipus intermedius
29                Peromyscus leucopus
30             Peromyscus maniculatus
31           Chaetodipus penicillatus
32                      Pipilo fuscus
33                    Chaetodipus sp.
34         Reithrodontomys fulvescens
35          Reithrodontomys megalotis
36           Reithrodontomys montanus
37                Reithrodontomys sp.
38               Sylvilagus audubonii
39                   Spizella breweri
40                  Sceloporus clarki
41               Sigmodon fulviventer
42                  Sigmodon hispidus
43              Sigmodon ochrognathus
44             Spermophilus spilosoma
45          Spermophilus tereticaudus
46               Sceloporus undulatus
47                       Sigmodon sp.
48                         Lizard sp.
49                         Pipilo sp.
50                         Rodent sp.
51                        Sparrow sp.
52             Zonotrichia leucophrys
53                   Zenaida macroura
Name: genus_species, dtype: object

Note that the + operator is also used to add numeric columns. In Python, the same operator can be used to perform different operations for different data types (but keep reading for an important caveat.)

Another common need is to join or split dates. In the ecology dataset, the date is split across year, month, and day columns. However, pandas has a special data type, datetime64, for representing dates that is useful for plotting, comparing, and resampling time series data. To make use of that functionality, we can concatenate the date columns and convert them to a datetime object. For clarity, we’ll use an unambiguous date format: YYYY-MM-DD.

We need to perform an additional step before combining the date columns. Year, month, and day are all stored as integers in our dataframe (specifically, they use the int64 data type). If we try to concatenate them as they are, we’ll receive an error. This is because the + operator only works when each object has a similar type, that is, all objects are either the same type or can be coerced to a common type (like int and float, which are distinct types that generally play well together). In this case, the columns are integers and the hyphens are strings. pandas cannot determine exactly how the user wants to combine the values, so it gives an error.

To resolve the error, we can use the astype() method to convert each column to a string before combining the columns:

PYTHON

year = surveys["year"].astype(str)
month = surveys["month"].astype(str)
day = surveys["day"].astype(str)

surveys["date"] = year + "-" + month + "-" + day
surveys["date"]

OUTPUT

0         1977-7-16
1         1977-7-16
2         1977-7-16
3         1977-7-16
4         1977-7-16
            ...
35544    2002-12-31
35545    2002-12-31
35546    2002-12-31
35547    2002-12-31
35548    2002-12-31
Name: date, Length: 35549, dtype: object

Note that some of the dates look a little funny because single-digit days and months do not include a leading zero. For example, in the first row we have 1977-7-16 instead of 1977-07-16. This is usually not a big deal but can be neatened up using the str accessor.

In pandas, an accessor is an attribute that provides additional functionality to an object. Here, the str accessor allows us to access many of the methods from the built-in str data type, including zfill(), which pads a string to a given length by adding zeroes to the start of the string:

PYTHON

text = "2"
text.zfill(3)

OUTPUT

'002'

Using the str accessor, we can use that method to zero-pad the data in a Series:

PYTHON

# Pad month and day to two characters
month = month.str.zfill(2)
day = day.str.zfill(2)

surveys["date"] = year + "-" + month + "-" + day
surveys["date"]

OUTPUT

0        1977-07-16
1        1977-07-16
2        1977-07-16
3        1977-07-16
4        1977-07-16
            ...
35544    2002-12-31
35545    2002-12-31
35546    2002-12-31
35547    2002-12-31
35548    2002-12-31
Name: date, Length: 35549, dtype: object

The month and date values in date are now padded to a length of two, allowing us to create a well-formed YYYY-MM-DD date string. Other string methods, like upper() and lower(), can be used in the same way.

Before we convert the date column to a datetime, we’re going to use the date string to show the opposite operation: Splitting a value stored in one column into multiple columns. One way to do this in pandas is to use str.split(), which splits each value in a series based on a delimiter, a character used as a boundary between parts of a string. Here, a hyphen is used to delimit the year, month, and date in each date. By splitting the column on a hyphen, we can extract each of those components into its own column. We also pass True to the expand keyword argument, which makes the str.split() method return a dataframe:

PYTHON

surveys["date"].str.split("-", expand=True)
0 1 2
0 1977 07 16
1 1977 07 16
2 1977 07 16
3 1977 07 16
4 1977 07 16
35544 2002 12 31
35545 2002 12 31
35546 2002 12 31
35547 2002 12 31
35548 2002 12 31

35549 rows × 3 columns

Now let’s go ahead and convert our date column into a datetime object using pd.to_datetime():

PYTHON

surveys["date"] = pd.to_datetime(surveys["date"])
surveys["date"]

OUTPUT

0       1977-07-16
1       1977-07-16
2       1977-07-16
3       1977-07-16
4       1977-07-16
           ...
35544   2002-12-31
35545   2002-12-31
35546   2002-12-31
35547   2002-12-31
35548   2002-12-31
Name: date, Length: 35549, dtype: datetime64[ns]

Challenge

pandas can help us ask specific questions which we want to answer about our data. The real skill is to know how to translate our scientific questions into a sensible approach (and subsequently visualize and interpret our results).

Try using pandas to answer the following questions.

  1. How many specimens of each sex are there for each year, including those whose sex is unknown?
  2. What is the average weight of each taxa?
  3. What are the minimum, maximum and average weight for each species of Rodent?
  4. What is the average hindfoot length for male and female rodent of each species? Is there a male/female difference?
  5. What is the average weight of each rodent species over the course of the years? Is there any noticeable trend for any of the species?

How many specimens of each sex are there for each year, including those whose sex is unknown?

PYTHON

# Fill in NaN values in sex with U
surveys["sex"] = surveys["sex"].fillna("U")

# Count records by sex
result = surveys.groupby(["year", "sex"])["record_id"].count()

# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
    print(result)

OUTPUT

year  sex
1977  F       204
      M       214
      U        85
1978  F       503
      M       433
      U       112
1979  F       327
      M       324
      U        68
1980  F       605
      M       727
      U        83
1981  F       631
      M       745
      U        96
1982  F       823
      M      1027
      U       128
1983  F       771
      M       797
      U       105
1984  F       445
      M       443
      U        93
1985  F       636
      M       716
      U        86
1986  F       414
      M       455
      U        73
1987  F       677
      M       862
      U       132
1988  F       604
      M       737
      U       128
1989  F       678
      M       780
      U       111
1990  F       581
      M       636
      U        94
1991  F       606
      M       637
      U       104
1992  F       443
      M       477
      U       118
1993  F       285
      M       380
      U        85
1994  F       243
      M       327
      U        98
1995  F       574
      M       534
      U       114
1996  F       725
      M       910
      U        71
1997  F      1071
      M      1357
      U        65
1998  F       717
      M       798
      U        95
1999  F       545
      M       530
      U        60
2000  F       690
      M       779
      U        83
2001  F       743
      M       744
      U       123
2002  F      1149
      M       979
      U       101
Name: record_id, dtype: int64

What is the average weight of each taxa?

PYTHON

# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")

# Group by taxa
grouped = merged.groupby("taxa")

# Calculate the min, max, and mean weight
grouped["weight"].mean()

OUTPUT

taxa
Bird             NaN
Rabbit           NaN
Reptile          NaN
Rodent     42.672428
Name: weight, dtype: float64

What are the minimum, maximum and average weight for each species of Rodent?

PYTHON

# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")

# Limit merged dataframe to rodents
rodents = merged[merged["taxa"] == "Rodent"]

# Group rodents by species
grouped = rodents.groupby("species_id")

# Calculate the min, max, and mean weight
grouped.agg({"weight": ["min", "max", "mean"]})
weight
min max mean
species_id
AH NaN NaN NaN
BA 6.0 18.0 8.600000
DM 10.0 66.0 43.157864
DO 12.0 76.0 48.870523
DS 12.0 190.0 120.130546
DX NaN NaN NaN
NL 30.0 280.0 159.245660
OL 10.0 56.0 31.575258
OT 5.0 46.0 24.230556
OX 18.0 24.0 21.000000
PB 12.0 55.0 31.735943
PE 8.0 40.0 21.586508
PF 4.0 25.0 7.923127
PH 18.0 48.0 31.064516
PI 17.0 21.0 19.250000
PL 8.0 27.0 19.138889
PM 7.0 49.0 21.364155
PP 4.0 74.0 17.173942
PX 18.0 20.0 19.000000
RF 9.0 20.0 13.386667
RM 4.0 29.0 10.585010
RO 8.0 13.0 10.250000
RX 11.0 20.0 15.500000
SF 24.0 199.0 58.878049
SH 16.0 140.0 73.148936
SO 15.0 105.0 55.414634
SS 57.0 130.0 93.500000
ST NaN NaN NaN
UR NaN NaN NaN

What is the average hindfoot length for male and female rodent of each species? Is there a male/female difference?

PYTHON

# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")

# Limit merged dataframe to rodents
rodents = merged[merged["taxa"] == "Rodent"]

# Group rodents by species and sex
grouped = rodents.groupby(["species_id", "sex"])

# Calculate the mean hindfoot length, plus count and standard deviation
# to better assess the question
with pd.option_context("display.max_rows", None):
    print(grouped["hindfoot_length"].agg(["count", "mean", "std"]))

OUTPUT

                count       mean        std
species_id sex
AH         M        0        NaN        NaN
           U        2  33.000000   2.828427
BA         F       31  13.161290   1.529636
           M       14  12.642857   2.097880
           U        0        NaN        NaN
DM         F     4302  35.712692   1.433067
           M     5658  36.188229   1.455396
           U       12  35.583333   0.996205
DO         F     1244  35.486334   1.726414
           M     1640  35.698780   1.611656
           U        3  36.000000   0.000000
DS         F     1046  49.583174   1.973573
           M     1083  50.301939   2.128120
           U        3  50.000000   1.732051
DX         U        0        NaN        NaN
NL         F      620  32.024194   1.669181
           M      452  32.577434   1.903603
           U        2  52.000000  25.455844
OL         F      438  20.287671   1.101022
           M      480  20.758333   1.651195
           U        2  20.000000   0.000000
OT         F     1019  20.281649   1.519412
           M     1115  20.257399   1.087354
           U        5  19.600000   1.140175
OX         F        3  19.666667   1.154701
           M        5  18.800000   3.346640
           U        0        NaN        NaN
PB         F     1645  25.882675   0.928842
           M     1213  26.431987   1.444130
           U        6  26.166667   0.983192
PE         F      544  20.272059   1.156084
           M      666  20.132132   1.177721
           U        2  20.500000   0.707107
PF         F      721  15.550624   1.381627
           M      770  15.620779   1.142208
           U        2  13.000000   4.242641
PH         F       20  25.900000   1.293709
           M       11  25.545455   1.752920
PI         M        8  22.500000   0.534522
           U        1  20.000000        NaN
PL         F       16  19.750000   2.720294
           M       19  20.263158   0.933459
           U        1  20.000000        NaN
PM         F      361  20.279778   1.162782
           M      483  20.530021   1.275621
           U        3  21.333333   4.041452
PP         F     1574  21.676620   1.316529
           M     1444  21.838643   1.201006
           U        9  20.888889   2.027588
PX         F        1  19.000000        NaN
           M        1  20.000000        NaN
           U        0        NaN        NaN
RF         F       55  17.527273   0.813191
           M       18  17.500000   0.985184
RM         F     1133  16.391880   1.123946
           M     1296  16.491512   1.134202
           U       13  15.846154   2.444250
RO         F        4  14.750000   0.957427
           M        4  16.000000   1.154701
RX         M        2  18.500000   2.121320
SF         F       16  27.500000   2.683282
           M       23  26.347826   3.524056
           U        2  24.500000   0.707107
SH         F       71  29.028169   2.298893
           M       60  27.983333   2.801281
           U        0        NaN        NaN
SO         F       30  25.633333   4.139660
           M       11  25.727273   1.954017
SS         F        0        NaN        NaN
           M        0        NaN        NaN
           U        0        NaN        NaN
ST         U        0        NaN        NaN
UR         U        0        NaN        NaN

What is the average weight of each rodent species over the course of the years? Is there any noticeable trend for any of the species?

PYTHON

# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")

# Limit merged dataframe to rodents
rodents = merged[merged["taxa"] == "Rodent"]

# Group rodents by species and year
grouped = rodents.groupby(["species_id", "year"])

# Calculate the mean weight by year
result = grouped["weight"].mean()

# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
    print(result)

OUTPUT

species_id  year
AH          1978           NaN
            1982           NaN
            1983           NaN
            1984           NaN
            1985           NaN
            1986           NaN
            1987           NaN
            1988           NaN
            1989           NaN
            1990           NaN
            1991           NaN
            1992           NaN
            1993           NaN
            1994           NaN
            1995           NaN
            1996           NaN
            1997           NaN
            1998           NaN
            1999           NaN
            2000           NaN
            2001           NaN
            2002           NaN
BA          1989      7.000000
            1990      8.000000
            1991      9.240000
            1992      7.833333
DM          1977     41.141304
            1978     40.795455
            1979     43.507317
            1980     44.136082
            1981     44.032907
            1982     41.975042
            1983     40.944551
            1984     40.765306
            1985     41.507645
            1986     43.361596
            1987     43.232609
            1988     43.397790
            1989     44.349206
            1990     41.769912
            1991     43.148338
            1992     43.877966
            1993     43.373984
            1994     42.373288
            1995     44.806147
            1996     44.506173
            1997     44.708551
            1998     43.131403
            1999     43.945402
            2000     43.126638
            2001     45.442177
            2002     46.168317
DO          1977     42.666667
            1978     45.000000
            1979     45.862069
            1980     48.058824
            1981     49.111111
            1982     47.919643
            1983     47.150602
            1984     48.415094
            1985     47.956731
            1986     49.372727
            1987     50.087379
            1988     51.463768
            1989     51.025641
            1990     48.512048
            1991     49.695652
            1992     48.367816
            1993     48.461538
            1994     47.750000
            1995     49.592593
            1996     47.234940
            1997     48.177866
            1998     49.731183
            1999     48.012048
            2000     49.224719
            2001     52.233766
            2002     49.514403
DS          1977    121.437500
            1978    115.198653
            1979    111.796954
            1980    120.725664
            1981    125.818444
            1982    115.647929
            1983    122.033088
            1984    124.082192
            1985    124.326316
            1986    128.071429
            1987    126.383838
            1988    129.490566
            1989    121.896552
            1990    121.187500
            1991    113.000000
            1992    112.352941
            1993    105.722222
            1994    106.625000
            1997    111.000000
            1998    116.000000
            1999    120.714286
DX          1980           NaN
            1983           NaN
            1986           NaN
            1988           NaN
            1991           NaN
            1992           NaN
            1993           NaN
            1994           NaN
            1995           NaN
            1996           NaN
            1998           NaN
            1999           NaN
            2000           NaN
            2001           NaN
            2002           NaN
NL          1977           NaN
            1978    184.656250
            1979    138.000000
            1980    158.982143
            1981    165.903226
            1982    160.613208
            1983    156.691489
            1984    150.095238
            1985    148.829268
            1986    159.724138
            1987    158.840000
            1988    163.104167
            1989    151.278689
            1990    154.275862
            1991    148.785714
            1992    139.000000
            1993    127.391304
            1994    186.875000
            1995    155.833333
            1996    148.000000
            1997    150.688889
            1998    159.466667
            1999    182.200000
            2000    179.307692
            2001    167.851064
            2002    182.159091
OL          1977     21.666667
            1978     31.027027
            1979     33.717391
            1980     33.107143
            1981     33.296296
            1982     35.590361
            1983     34.584615
            1984     32.550725
            1985     32.108696
            1986     30.880952
            1987     30.779221
            1988     30.306122
            1989     31.947368
            1990     31.200000
            1991     28.171429
            1992     28.454545
            1993     27.545455
            1994     21.900000
            1995     27.296296
            1996     27.769231
            1997     33.625000
            2002     25.428571
OT          1977     21.000000
            1978     23.220000
            1979     23.075758
            1980     24.083333
            1981     24.415385
            1982     24.925532
            1983     24.697674
            1984     22.416667
            1985     22.064516
            1986     21.500000
            1987     23.069767
            1988     24.070588
            1989     24.663366
            1990     23.675325
            1991     24.588235
            1992     24.928571
            1993     22.363636
            1994     25.285714
            1995     24.868421
            1996     23.453704
            1997     24.785156
            1998     24.675676
            1999     25.723810
            2000     25.303448
            2001     23.707792
            2002     23.833333
OX          1977     21.333333
            1978           NaN
            1982     24.000000
            1984     18.000000
            1989     20.000000
            1993           NaN
PB          1995     34.000000
            1996     32.578947
            1997     31.085603
            1998     30.082237
            1999     31.710037
            2000     30.878899
            2001     32.798851
            2002     32.359447
PE          1977     19.500000
            1978     20.428571
            1979     20.529412
            1980     22.450000
            1981     20.558140
            1982     21.173077
            1983     20.864865
            1984     20.210526
            1985     20.360000
            1986     22.520000
            1987     21.625899
            1988     22.625668
            1989     21.935673
            1990     21.671233
            1991     21.435484
            1992     22.750000
            1993     19.230769
            1994     18.000000
            1995     21.230769
            1996     22.027778
            1997     20.929825
            1998     20.304348
            1999     25.428571
            2000     21.615385
            2001     20.400000
            2002     21.719298
PF          1977      7.173913
            1978      7.088235
            1979      7.529412
            1980      7.455556
            1981      7.152542
            1982      6.918919
            1983      6.833333
            1984      7.428571
            1985           NaN
            1986      8.000000
            1987      7.500000
            1988      8.000000
            1989      7.230769
            1990      7.117647
            1991      7.827586
            1992      8.767857
            1993      8.238806
            1994      7.855072
            1995      8.780645
            1996      8.393846
            1997      8.448649
            1998      8.720000
            2001      8.538462
            2002      7.388889
PH          1984     28.000000
            1985     32.666667
            1986     39.000000
            1988     33.000000
            1989     30.500000
            1990     31.142857
            1992     31.142857
            1995     35.333333
            1997     22.000000
PI          1992     18.000000
            1995           NaN
            2001     18.000000
            2002     20.000000
PL          1995     22.333333
            1996     21.166667
            1997     18.789474
            1998     16.714286
            2000     21.000000
PM          1978     20.666667
            1979     23.666667
            1980     21.000000
            1981     19.933333
            1982     21.391304
            1983     22.023810
            1984     19.545455
            1985     19.000000
            1986     21.696970
            1987     22.297710
            1988     21.759259
            1989     20.222222
            1995     27.375000
            1996     20.224490
            1997     21.126394
            1998     20.591398
            1999     22.523810
            2000     20.500000
            2001     26.666667
            2002     19.000000
PP          1977     15.333333
            1978     14.869565
            1979     15.105263
            1980     14.176471
            1981     13.950000
            1982     16.125000
            1983     15.468750
            1984     15.307692
            1985     15.764706
            1986     16.750000
            1987     17.840909
            1988     18.280000
            1989     17.409091
            1990     16.029412
            1991     17.904762
            1992     17.479675
            1993     17.954023
            1994     17.585714
            1995     16.844444
            1996     18.095563
            1997     18.175000
            1998     16.266990
            1999     16.521212
            2000     16.788618
            2001     17.752896
            2002     17.018617
PX          1996           NaN
            1997     19.000000
            1998           NaN
            2000           NaN
RF          1982     11.500000
            1988     13.818182
            1989     13.346939
            1990     12.916667
            1997     20.000000
RM          1977     10.000000
            1978      7.500000
            1979      8.333333
            1980     10.227273
            1981     11.178571
            1982     10.436242
            1983      9.872000
            1984     11.152542
            1985      8.371429
            1986     10.750000
            1987     10.657609
            1988     10.322115
            1989     10.411311
            1990     10.305677
            1991     10.498305
            1992     10.904762
            1993     10.747475
            1994     10.675000
            1995     12.653846
            1996     11.455556
            1997     11.230769
            1998     13.100000
            1999     10.555556
            2000     11.400000
            2001     11.333333
            2002     10.000000
RO          1991     11.000000
            2002     10.142857
RX          1995     20.000000
            1997     11.000000
SF          1989     54.800000
            1990     52.611111
            1991     93.166667
            1992     43.000000
            1993     44.000000
            2002     62.166667
SH          1977           NaN
            1978     89.000000
            1982     79.000000
            1983           NaN
            1986     55.000000
            1987     73.800000
            1988     72.806452
            1989     76.345455
            1990     76.888889
            1991     63.000000
            1997     50.857143
            1999     54.000000
            2000     73.375000
            2001     79.900000
            2002     64.666667
SO          1991     53.909091
            1992     55.263158
            1993     55.600000
            1994     62.333333
            1997     54.666667
SS          1977           NaN
            1978    130.000000
            1979           NaN
            1980           NaN
            1981     57.000000
            1982           NaN
            1983           NaN
            1984           NaN
            1985           NaN
            1986           NaN
            1987           NaN
            1988           NaN
            1989           NaN
            1990           NaN
            1991           NaN
            1992           NaN
            1993           NaN
            1994           NaN
            1995           NaN
            1996           NaN
            1997           NaN
            1998           NaN
            1999           NaN
            2000           NaN
            2001           NaN
            2002           NaN
ST          1993           NaN
UR          1987           NaN
            1988           NaN
            1991           NaN
            1992           NaN
            1994           NaN
            2001           NaN
            2002           NaN
Name: weight, dtype: float64

Key Points

  • Combine two dataframes on one or more common values using pd.merge()
  • Append rows from one dataframe to another using pd.concat()
  • Combine multiple text columns into one using the + operator
  • Use the str accessor to use string methods like split() and zfill() on text columns
  • Convert date strings to datetime objects using pd.to_datetime()

Content from Data Workflows and Automation


Last updated on 2024-07-17 | Edit this page

Estimated time: 90 minutes

Overview

Questions

  • Can I automate operations in Python?
  • What are functions and why should I use them?

Objectives

  • Describe why for loops are used in Python.
  • Employ for loops to automate data analysis.
  • Write unique filenames in Python.
  • Build reusable code in Python.
  • Write functions using conditional statements (if, then, else).

So far, we’ve used Python and the pandas library to explore and manipulate individual datasets by hand, much like we would do in a spreadsheet. The beauty of using a programming language like Python, though, comes from the ability to automate data processing through the use of loops and functions.

For loops


Loops allow us to repeat a workflow (or series of actions) a given number of times or while some condition is true. We would use a loop to automatically process data that’s stored in multiple files (daily values with one file per year, for example). Loops lighten our work load by performing repeated tasks without our direct involvement and make it less likely that we’ll introduce errors by making mistakes while processing each file by hand.

Let’s write a simple for loop that simulates what a kid might see during a visit to the zoo:

PYTHON

animals = ['lion', 'tiger', 'crocodile', 'vulture', 'hippo']
print(animals)

OUTPUT

['lion', 'tiger', 'crocodile', 'vulture', 'hippo']

PYTHON

for creature in animals:
    print(creature)

OUTPUT

lion
tiger
crocodile
vulture
hippo

The line defining the loop must start with for and end with a colon, and the body of the loop must be indented.

In this example, creature is the loop variable that takes the value of the next entry in animals every time the loop goes around. We can call the loop variable anything we like. After the loop finishes, the loop variable will still exist and will have the value of the last entry in the collection:

PYTHON

animals = ['lion', 'tiger', 'crocodile', 'vulture', 'hippo']
for creature in animals:
    pass

OUTPUT

PYTHON

print('The loop variable is now: ' + creature)

OUTPUT

The loop variable is now: hippo

We are not asking Python to print the value of the loop variable anymore, but the for loop still runs and the value of creature changes on each pass through the loop. The statement pass in the body of the loop means “do nothing”.

Challenge - Loops

  1. What happens if we don’t include the pass statement?

  2. Rewrite the loop so that the animals are separated by commas, not new lines (Hint: You can concatenate strings using a plus sign. For example, print(string1 + string2) outputs ‘string1string2’).

  1. PYTHON

    animals = ['lion', 'tiger', 'crocodile', 'vulture', 'hippo']
    for creature in animals:

    ERROR

    IndentationError: expected an indented block
  2. Using the end argument to print:

    PYTHON

    for creature in animals:
        print(creature + ',', end='')

    OUTPUT

    lion,tiger,crocodile,vulture,hippo,

    This puts a comma on the end of the list, which is not ideal. To avoid this, we need to use an altogether different approach: string objects in Python have a join method, which can be used to concatenate items in a list with the string in between, e.g.

    PYTHON

    ', '.join(animals)

    OUTPUT

    'lion, tiger, crocodile, vulture, hippo'

Automating data processing using For Loops


The file we’ve been using so far, surveys.csv, contains 25 years of data and is very large. We would like to separate the data for each year into a separate file.

Let’s start by making a new directory inside the folder data to store all of these files using the module os:

PYTHON

import os

os.mkdir('data/yearly_files')

The command os.mkdir is equivalent to mkdir in the shell. Just so we are sure, we can check that the new directory was created within the data folder:

PYTHON

os.listdir('data')

OUTPUT

['plots.csv',
 'portal_mammals.sqlite',
 'species.csv',
 'survey2001.csv',
 'survey2002.csv',
 'surveys.csv',
 'surveys2002_temp.csv',
 'yearly_files']

The command os.listdir is equivalent to ls in the shell.

In previous lessons, we saw how to use the library pandas to load the species data into memory as a DataFrame, how to select a subset of the data using some criteria, and how to write the DataFrame into a CSV file. Let’s write a script that performs those three steps in sequence for the year 2002:

PYTHON

import pandas as pd

# Load the data into a DataFrame
surveys_df = pd.read_csv('data/surveys.csv')

# Select only data for the year 2002
surveys2002 = surveys_df[surveys_df.year == 2002]

# Write the new DataFrame to a CSV file
surveys2002.to_csv('data/yearly_files/surveys2002.csv')

To create yearly data files, we could repeat the last two commands over and over, once for each year of data. Repeating code is neither elegant nor practical, and is very likely to introduce errors into your code. We want to turn what we’ve just written into a loop that repeats the last two commands for every year in the dataset.

Let’s start by writing a loop that prints the names of the files we want to create - the dataset we are using covers 1977 through 2002, and we’ll create a separate file for each of those years. Listing the filenames is a good way to confirm that the loop is behaving as we expect.

We have seen that we can loop over a list of items, so we need a list of years to loop over. We can get the years in our DataFrame with:

PYTHON

surveys_df['year']

OUTPUT

0        1977
1        1977
2        1977
3        1977
         ...
35545    2002
35546    2002
35547    2002
35548    2002

but we want only unique years, which we can get using the unique method which we have already seen.

PYTHON

surveys_df['year'].unique()

OUTPUT

array([1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987,
       1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998,
       1999, 2000, 2001, 2002], dtype=int64)

Putting this into our for loop we get

PYTHON

for year in surveys_df['year'].unique():
   filename='data/yearly_files/surveys' + str(year) + '.csv'
   print(filename)

OUTPUT

data/yearly_files/surveys1977.csv
data/yearly_files/surveys1978.csv
data/yearly_files/surveys1979.csv
data/yearly_files/surveys1980.csv
data/yearly_files/surveys1981.csv
data/yearly_files/surveys1982.csv
data/yearly_files/surveys1983.csv
data/yearly_files/surveys1984.csv
data/yearly_files/surveys1985.csv
data/yearly_files/surveys1986.csv
data/yearly_files/surveys1987.csv
data/yearly_files/surveys1988.csv
data/yearly_files/surveys1989.csv
data/yearly_files/surveys1990.csv
data/yearly_files/surveys1991.csv
data/yearly_files/surveys1992.csv
data/yearly_files/surveys1993.csv
data/yearly_files/surveys1994.csv
data/yearly_files/surveys1995.csv
data/yearly_files/surveys1996.csv
data/yearly_files/surveys1997.csv
data/yearly_files/surveys1998.csv
data/yearly_files/surveys1999.csv
data/yearly_files/surveys2000.csv
data/yearly_files/surveys2001.csv
data/yearly_files/surveys2002.csv

We can now add the rest of the steps we need to create separate text files:

PYTHON

# Load the data into a DataFrame
surveys_df = pd.read_csv('data/surveys.csv')

for year in surveys_df['year'].unique():

    # Select data for the year
    surveys_year = surveys_df[surveys_df.year == year]

    # Write the new DataFrame to a CSV file
    filename = 'data/yearly_files/surveys' + str(year) + '.csv'
    surveys_year.to_csv(filename)

Look inside the yearly_files directory and check a couple of the files you just created to confirm that everything worked as expected.

Writing Unique File Names


Notice that the code above created a unique filename for each year.

PYTHON

filename = 'data/yearly_files/surveys' + str(year) + '.csv'

Let’s break down the parts of this name:

  • The first part is some text that specifies the directory to store our data file in (data/yearly_files/) and the first part of the file name (surveys): 'data/yearly_files/surveys'
  • We can concatenate this with the value of a variable, in this case year by using the plus + sign and the variable we want to add to the file name: + str(year)
  • Then we add the file extension as another text string: + '.csv'

Notice that we use single quotes to add text strings. The variable is not surrounded by quotes. This code produces the string data/yearly_files/surveys2002.csv which contains the path to the new filename AND the file name itself.

Challenge - Modifying loops

  1. Some of the surveys you saved are missing data (they have null values that show up as NaN - Not A Number - in the DataFrames and do not show up in the text files). Modify the for loop so that the entries with null values are not included in the yearly files.

  2. Let’s say you only want to look at data from a given multiple of years. How would you modify your loop in order to generate a data file for only every 5th year, starting from 1977?

  3. Instead of splitting out the data by years, a colleague wants to do analyses each species separately. How would you write a unique CSV file for each species?

  1. PYTHON

    surveys_year = surveys_df[surveys_df.year == year].dropna()
  2. You could just make a list manually, however, why not check the first and last year making use of the code itself?

    PYTHON

    n_year = 5  # better overview by making variable from it
    first_year = surveys_df['year'].min()
    last_year = surveys_df['year'].max()
    
    for year in range(first_year, last_year, n_year):
        print(year)
    
        # Select data for the year
        surveys_year = surveys_df[surveys_df.year == year].dropna()
  3. python for species in surveys_df['species_id'].dropna().unique(): surveys_species = surveys_df[surveys_df.species_id == species] filename = 'episodes/data/species_files/surveys' + species + '.csv' surveys_species.to_csv(filename)