Content from Introducing Python
Last updated on 2022-07-18 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- What is Python?
- How do we assign variables in Python?
- How do we perform actions in Python?
- How does Python handle different types of data?
Objectives
- Introduce some Python terminology
- Learn how to assign values to variables
- Learn how to use functions and methods
- Introduce the built-in Python data types and functions
- Introduce the Python Standard Library
About Python
Python is a high-level, general purpose language. In practical terms, that means it dispenses with many of the complexities of writing code. It is widely used and has an active community of developers, making it suitable for a large number of tasks.
A core philosophical tenet of Python is that code is read more often than it is written. Best practices for coding in Python therefore emphasize writing well-documented, consistently formatted code that follows language-wide standards. In a Pythonic world, everyone’s code would look the same.
One feature to be aware of going in is that Python uses whitespace (as opposed to braces) to organize blocks of code. This can be a sticking point for people used to other languages.
Getting started
Everything in Python is an object. Each object has a type, which controls what the object can do. For example, we can add integers together or append a string to a list. An object can be assigned to a variable, and a function performs an action.
Python is an object-oriented language. This means that each object includes built-in variables and functions that can be used from the object itself. When working from an object, we use different terms to refer to these concepts:
- an attribute stores a piece of data
- a method performs an action
Let’s look at one of the main built-in types, str
, to see how this works in practice. In Python, str
, short for string, is used to store and manipulate text. To get started, we’ll assign the string “hello world” to the variable text. In Python, we use a single equal sign for assignment and quotes to create a string.
Using quotes
Either single or double quotes can be used to create strings, but try to use them consistently! We will use double quotes in the examples here.
First we’ll define a variable called text. Variables can be any type of object supported by Python, but in this case we’ll use a string:
PYTHON
= "hello world" text
Now, instead of retyping the full string every time we need it, we can access that string by referring to the variable:
PYTHON
text
OUTPUT
'hello world'
We can check the type of the variable using the type()
function:
PYTHON
type(text)
OUTPUT
str
We can modify the text by calling one of the methods built into the str
class. We call methods by appending a period and the name of the method to the object. For example, to display the string in upper case, use upper()
:
PYTHON
text.upper()
OUTPUT
'HELLO WORLD'
After running a cell, Jupyter displays a representation of the object that appears in the last line of a cell (but note that some actions, like assigning an object, have no output). Some cells will have more than one line, and it is often useful to display content produced earlier in a cell. This can be accomplished using the built-in print()
function. In the cell below, we’ll use print()
to show a series of strings:
PYTHON
"hello"
print("h")
print("e")
print("l")
print("l")
print("o")
"world"
OUTPUT
h
e
l
l
o
'world'
Note that the string “hello” at the top of the cell lacks a print statement and does not appear in the output, whereas the text “world” does appear in the output because it is the last line of the cell.
Each object may contain many attributes and methods. Use the help()
function on any object, including functions or methods, to show a description of the object and list the available methods.
PYTHON
help(str)
OUTPUT
Help on class str in module builtins:
class str(object)
| str(object='') -> str
| str(bytes_or_buffer[, encoding[, errors]]) -> str
|
| Create a new string object from the given object. If encoding or
| errors is specified, then the object must expose a data buffer
| that will be decoded using the given encoding and error handler.
| Otherwise, returns the result of object.__str__() (if defined)
| or repr(object).
| encoding defaults to sys.getdefaultencoding().
| errors defaults to 'strict'.
|
| Methods defined here:
|
| __add__(self, value, /)
| Return self+value.
|
| __contains__(self, key, /)
| Return key in self.
|
| __eq__(self, value, /)
| Return self==value.
|
| __format__(self, format_spec, /)
| Return a formatted version of the string as described by format_spec.
|
| __ge__(self, value, /)
| Return self>=value.
|
| __getattribute__(self, name, /)
| Return getattr(self, name).
|
| __getitem__(self, key, /)
| Return self[key].
|
| __getnewargs__(...)
|
| __gt__(self, value, /)
| Return self>value.
|
| __hash__(self, /)
| Return hash(self).
|
| __iter__(self, /)
| Implement iter(self).
|
| __le__(self, value, /)
| Return self<=value.
|
| __len__(self, /)
| Return len(self).
|
| __lt__(self, value, /)
| Return self<value.
|
| __mod__(self, value, /)
| Return self%value.
|
| __mul__(self, value, /)
| Return self*value.
|
| __ne__(self, value, /)
| Return self!=value.
|
| __repr__(self, /)
| Return repr(self).
|
| __rmod__(self, value, /)
| Return value%self.
|
| __rmul__(self, value, /)
| Return value*self.
|
| __sizeof__(self, /)
| Return the size of the string in memory, in bytes.
|
| __str__(self, /)
| Return str(self).
|
| capitalize(self, /)
| Return a capitalized version of the string.
|
| More specifically, make the first character have upper case and the rest lower
| case.
|
| casefold(self, /)
| Return a version of the string suitable for caseless comparisons.
|
| center(self, width, fillchar=' ', /)
| Return a centered string of length width.
|
| Padding is done using the specified fill character (default is a space).
|
| count(...)
| S.count(sub[, start[, end]]) -> int
|
| Return the number of non-overlapping occurrences of substring sub in
| string S[start:end]. Optional arguments start and end are
| interpreted as in slice notation.
|
| encode(self, /, encoding='utf-8', errors='strict')
| Encode the string using the codec registered for encoding.
|
| encoding
| The encoding in which to encode the string.
| errors
| The error handling scheme to use for encoding errors.
| The default is 'strict' meaning that encoding errors raise a
| UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
| 'xmlcharrefreplace' as well as any other name registered with
| codecs.register_error that can handle UnicodeEncodeErrors.
|
| endswith(...)
| S.endswith(suffix[, start[, end]]) -> bool
|
| Return True if S ends with the specified suffix, False otherwise.
| With optional start, test S beginning at that position.
| With optional end, stop comparing S at that position.
| suffix can also be a tuple of strings to try.
|
| expandtabs(self, /, tabsize=8)
| Return a copy where all tab characters are expanded using spaces.
|
| If tabsize is not given, a tab size of 8 characters is assumed.
|
| find(...)
| S.find(sub[, start[, end]]) -> int
|
| Return the lowest index in S where substring sub is found,
| such that sub is contained within S[start:end]. Optional
| arguments start and end are interpreted as in slice notation.
|
| Return -1 on failure.
|
| format(...)
| S.format(*args, **kwargs) -> str
|
| Return a formatted version of S, using substitutions from args and kwargs.
| The substitutions are identified by braces ('{' and '}').
|
| format_map(...)
| S.format_map(mapping) -> str
|
| Return a formatted version of S, using substitutions from mapping.
| The substitutions are identified by braces ('{' and '}').
|
| index(...)
| S.index(sub[, start[, end]]) -> int
|
| Return the lowest index in S where substring sub is found,
| such that sub is contained within S[start:end]. Optional
| arguments start and end are interpreted as in slice notation.
|
| Raises ValueError when the substring is not found.
|
| isalnum(self, /)
| Return True if the string is an alpha-numeric string, False otherwise.
|
| A string is alpha-numeric if all characters in the string are alpha-numeric and
| there is at least one character in the string.
|
| isalpha(self, /)
| Return True if the string is an alphabetic string, False otherwise.
|
| A string is alphabetic if all characters in the string are alphabetic and there
| is at least one character in the string.
|
| isascii(self, /)
| Return True if all characters in the string are ASCII, False otherwise.
|
| ASCII characters have code points in the range U+0000-U+007F.
| Empty string is ASCII too.
|
| isdecimal(self, /)
| Return True if the string is a decimal string, False otherwise.
|
| A string is a decimal string if all characters in the string are decimal and
| there is at least one character in the string.
|
| isdigit(self, /)
| Return True if the string is a digit string, False otherwise.
|
| A string is a digit string if all characters in the string are digits and there
| is at least one character in the string.
|
| isidentifier(self, /)
| Return True if the string is a valid Python identifier, False otherwise.
|
| Call keyword.iskeyword(s) to test whether string s is a reserved identifier,
| such as "def" or "class".
|
| islower(self, /)
| Return True if the string is a lowercase string, False otherwise.
|
| A string is lowercase if all cased characters in the string are lowercase and
| there is at least one cased character in the string.
|
| isnumeric(self, /)
| Return True if the string is a numeric string, False otherwise.
|
| A string is numeric if all characters in the string are numeric and there is at
| least one character in the string.
|
| isprintable(self, /)
| Return True if the string is printable, False otherwise.
|
| A string is printable if all of its characters are considered printable in
| repr() or if it is empty.
|
| isspace(self, /)
| Return True if the string is a whitespace string, False otherwise.
|
| A string is whitespace if all characters in the string are whitespace and there
| is at least one character in the string.
|
| istitle(self, /)
| Return True if the string is a title-cased string, False otherwise.
|
| In a title-cased string, upper- and title-case characters may only
| follow uncased characters and lowercase characters only cased ones.
|
| isupper(self, /)
| Return True if the string is an uppercase string, False otherwise.
|
| A string is uppercase if all cased characters in the string are uppercase and
| there is at least one cased character in the string.
|
| join(self, iterable, /)
| Concatenate any number of strings.
|
| The string whose method is called is inserted in between each given string.
| The result is returned as a new string.
|
| Example: '.'.join(['ab', 'pq', 'rs']) -> 'ab.pq.rs'
|
| ljust(self, width, fillchar=' ', /)
| Return a left-justified string of length width.
|
| Padding is done using the specified fill character (default is a space).
|
| lower(self, /)
| Return a copy of the string converted to lowercase.
|
| lstrip(self, chars=None, /)
| Return a copy of the string with leading whitespace removed.
|
| If chars is given and not None, remove characters in chars instead.
|
| partition(self, sep, /)
| Partition the string into three parts using the given separator.
|
| This will search for the separator in the string. If the separator is found,
| returns a 3-tuple containing the part before the separator, the separator
| itself, and the part after it.
|
| If the separator is not found, returns a 3-tuple containing the original string
| and two empty strings.
|
| removeprefix(self, prefix, /)
| Return a str with the given prefix string removed if present.
|
| If the string starts with the prefix string, return string[len(prefix):].
| Otherwise, return a copy of the original string.
|
| removesuffix(self, suffix, /)
| Return a str with the given suffix string removed if present.
|
| If the string ends with the suffix string and that suffix is not empty,
| return string[:-len(suffix)]. Otherwise, return a copy of the original
| string.
|
| replace(self, old, new, count=-1, /)
| Return a copy with all occurrences of substring old replaced by new.
|
| count
| Maximum number of occurrences to replace.
| -1 (the default value) means replace all occurrences.
|
| If the optional argument count is given, only the first count occurrences are
| replaced.
|
| rfind(...)
| S.rfind(sub[, start[, end]]) -> int
|
| Return the highest index in S where substring sub is found,
| such that sub is contained within S[start:end]. Optional
| arguments start and end are interpreted as in slice notation.
|
| Return -1 on failure.
|
| rindex(...)
| S.rindex(sub[, start[, end]]) -> int
|
| Return the highest index in S where substring sub is found,
| such that sub is contained within S[start:end]. Optional
| arguments start and end are interpreted as in slice notation.
|
| Raises ValueError when the substring is not found.
|
| rjust(self, width, fillchar=' ', /)
| Return a right-justified string of length width.
|
| Padding is done using the specified fill character (default is a space).
|
| rpartition(self, sep, /)
| Partition the string into three parts using the given separator.
|
| This will search for the separator in the string, starting at the end. If
| the separator is found, returns a 3-tuple containing the part before the
| separator, the separator itself, and the part after it.
|
| If the separator is not found, returns a 3-tuple containing two empty strings
| and the original string.
|
| rsplit(self, /, sep=None, maxsplit=-1)
| Return a list of the words in the string, using sep as the delimiter string.
|
| sep
| The delimiter according which to split the string.
| None (the default value) means split according to any whitespace,
| and discard empty strings from the result.
| maxsplit
| Maximum number of splits to do.
| -1 (the default value) means no limit.
|
| Splits are done starting at the end of the string and working to the front.
|
| rstrip(self, chars=None, /)
| Return a copy of the string with trailing whitespace removed.
|
| If chars is given and not None, remove characters in chars instead.
|
| split(self, /, sep=None, maxsplit=-1)
| Return a list of the words in the string, using sep as the delimiter string.
|
| sep
| The delimiter according which to split the string.
| None (the default value) means split according to any whitespace,
| and discard empty strings from the result.
| maxsplit
| Maximum number of splits to do.
| -1 (the default value) means no limit.
|
| splitlines(self, /, keepends=False)
| Return a list of the lines in the string, breaking at line boundaries.
|
| Line breaks are not included in the resulting list unless keepends is given and
| true.
|
| startswith(...)
| S.startswith(prefix[, start[, end]]) -> bool
|
| Return True if S starts with the specified prefix, False otherwise.
| With optional start, test S beginning at that position.
| With optional end, stop comparing S at that position.
| prefix can also be a tuple of strings to try.
|
| strip(self, chars=None, /)
| Return a copy of the string with leading and trailing whitespace removed.
|
| If chars is given and not None, remove characters in chars instead.
|
| swapcase(self, /)
| Convert uppercase characters to lowercase and lowercase characters to uppercase.
|
| title(self, /)
| Return a version of the string where each word is titlecased.
|
| More specifically, words start with uppercased characters and all remaining
| cased characters have lower case.
|
| translate(self, table, /)
| Replace each character in the string using the given translation table.
|
| table
| Translation table, which must be a mapping of Unicode ordinals to
| Unicode ordinals, strings, or None.
|
| The table must implement lookup/indexing via __getitem__, for instance a
| dictionary or list. If this operation raises LookupError, the character is
| left untouched. Characters mapped to None are deleted.
|
| upper(self, /)
| Return a copy of the string converted to uppercase.
|
| zfill(self, width, /)
| Pad a numeric string with zeros on the left, to fill a field of the given width.
|
| The string is never truncated.
|
| ----------------------------------------------------------------------
| Static methods defined here:
|
| __new__(*args, **kwargs) from builtins.type
| Create and return a new object. See help(type) for accurate signature.
|
| maketrans(...)
| Return a translation table usable for str.translate().
|
| If there is only one argument, it must be a dictionary mapping Unicode
| ordinals (integers) or characters to Unicode ordinals, strings or None.
| Character keys will be then converted to ordinals.
| If there are two arguments, they must be strings of equal length, and
| in the resulting dictionary, each character in x will be mapped to the
| character at the same position in y. If there is a third argument, it
| must be a string, whose characters will be mapped to None in the result.
Data types
Python also includes data types for representing other types of data, including numbers or collections of data. The core Python data types are introduced in the table below. We’ll talk more about some of these as we encounter them in the lesson:
Type | Definition | Example |
---|---|---|
str | Character string | "hello world" |
int | Integer numerical | 42 |
float | Approximate numerical | 0.406 |
bool | Stores True or False values |
True or False
|
list | Sequence that can be modified | ["a", "b", "c"] |
tuple | Sequence that cannot be modified | ("a", "b", "c") |
dict | Mapping of keys to values | {"DC": "District of Columbia", "MD": "Maryland", "VA": "Virginia"} |
set | Collection of unique values | {"1", "2", 1} |
Libraries
Python includes a number of built-in functions that are available wherever Python is installed. See the table below for some examples.
Examples of built-in functions
Name | Description | Example |
---|---|---|
abs() |
Gets the absolute value of a number | abs(-1.5) # returns 1.5 |
max() |
Gets the highest value in a sequence | min([1, 2, 3]) # returns 3 |
min() |
Gets the lowest value in a sequence | min([1, 2, 3]) # returns 1 |
round() |
Rounds a number to the nearest integer | round(5.4) # returns 5 |
Python also includes a number of built-in libraries. A library bundles functions and other code related to a single task or data type. They are used to simplify the performance of common tasks. By using a common code base, a library allows coders to work more quickly and with fewer errors.
The libraries built into Python are referred to as the Python Standard Library. They can be accessed through a typical Python installation and do not require any additional downloads. A few examples are included in the table below, but as with the table of built-in functions, there are more where those came from.
Examples from the Python Standard Library
Library | Description |
---|---|
datetime | Reads, writes, and analyzes dates and times |
os | Create, manipulate, and get information about files and paths |
random | Generate pseudo-random numbers |
Documentation
The documentation for each library can be viewed by clicking the link in the table. Documentation is an invaluable resource. It provides descriptions of what a library does and detailed information about how it can be used, often including examples.
Unlike the built-in functions, we must import a library before we can use it. We do so using the import
statement:
PYTHON
import datetime
Once imported, the library is available for use anywhere in the current document. When using a library, we must include the name of the library to access its functions. For example, to create a datetime.date
object (that is, a date
object defined by the datetime
library), we include both the library and method name:
PYTHON
= datetime.date(1970, 1, 1) date
As with the built-in types, the datetime.date
object includes its own suite of attributes and methods. We can, for example, use the year attribute to get the year:
PYTHON
date.year
OUTPUT
1970
Or convert the date to a string using the strftime()
method and date format codes. Here, %Y corresponds to YYYY, %m to MM, and %d to DD.
PYTHON
"%Y-%m-%d") date.strftime(
OUTPUT
'1970-01-01'
Like Python itself, the Python Standard Library is maintained by the Python Software Foundation. The built-in libraries are limited to a relatively small set of operations expected to be useful to a broad population of users. However, Python allows users to create their own libraries to perform actions that are beyond the scope of core Python. The rest of this lesson will focus on an external library called pandas.
Key Points
- Python is a widely used language that can be used for a variety of tasks, including analyzing data
- Python uses different data types to handle text, numbers, collections, and other kinds of data
- Assign values to variables using the
=
operator - Use functions and methods to perform specific actions
- Python’s functionality can be extended using libraries, including libraries written by members of the community that address discipline-specific needs
- Use the
help()
function and developer documentation to learn more about Python and Python libraries
Content from Introducing pandas
Last updated on 2022-07-18 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- What data will we be working with in this lesson?
- What is pandas?
- Why use pandas for data analysis?
- How do we read and write data using pandas?
Objectives
- Learn about the dataset we’ll be working with
- Look at the benefits of using pandas to analyze data
- Import data from a CSV into a pandas dataframe
- Learn how pandas handles different types of data
- Write a dataframe to a CSV
Dataset description
The dataset description was taken from Data Management with SQL for Ecologists (CC-BY-4.0)
The data we will be using is a time-series for a small mammal community in southern Arizona. This is part of a project studying the effects of rodents and ants on the plant community that has been running for almost 40 years. The rodents are sampled on a series of 24 plots, with different experimental manipulations controlling which rodents are allowed to access which plots.
This is a real dataset that has been used in over 100 publications. We’ve simplified it for the workshop, but you can download the full dataset and work with it using exactly the same tools we’ll learn about today.
Answering questions using data
Let’s look at some of the cleaned spreadsheets we downloaded during Setup to complete this challenge. We’ll need the following two files:
- surveys.csv
- species.csv
Challenge
Open each of these csv files and explore them. What information is contained in each file? Specifically, if we had the following research questions:
- How has the hindfoot length and weight of Dipodomys species changed over time?
- What is the average weight of each species, per year?
- What information can I learn about Dipodomys species in the 2000s, over time?
What would we need to answer these questions? Which files have the data we need? What operations would we need to perform if we were doing these analyses by hand?
Hint: We can view CSV files using JupyterLab using the left sidebar. Click on the Folder icon in the top left of the sidebar to see the files, then go to the data directory to see the CSV we’ve downloaded for this lesson.
In order to answer the questions described above, we’ll need to do the following basic data operations:
- select subsets of the data (rows and columns)
- group subsets of data
- do math and other calculations
- combine data across spreadsheets
In addition, we don’t want to do this manually! Instead of searching for the right pieces of data ourselves, or clicking between spreadsheets, or manually sorting columns, we want to make the computer do the work.
In particular, we want to use a tool where it’s easy to repeat our analysis in case our data changes. We also want to do all this searching without actually modifying our source data.
Why use pandas?
The Python Data Analysis Library, or pandas, is a Python library used to work with dataframes. A dataframe is a representation of tabular data very similar to a spreadsheet, consisting of rows (representing records) and columns (representing fields or variables). It is a very common format for representing scientific data and is likely to be very familiar to anyone taking this course.
pandas offers the same advantages as any well-written library: It creates a common codebase for working on a single task, in this case, analyzing data. Some benefits of this approach include:
- Reliability: Provides flexible, well-tested methods for reading, querying, aggregating, grouping, and plotting data
- Repeatability: Repeat the same analyses when data is added or changed
- Speed: Faster in many cases than coding our own functions in Python
- Reproducibility: Document and share code in narrative form using tools like Jupyter notebooks
- Community: Access a large, active community for help when we run into problems
Importing data using pandas
Unlike the libraries in the Python Standard Library discussed in the previous lesson, pandas is not part of the typical Python installation. Once it has been installed, however, it can be accessed using the same import
command used to import the built-in libraries. By convention, pandas is imported using the alias “pd”. We can assign the alias using the as
keyword:
PYTHON
import pandas as pd
Using aliases
Why use an alias? We will refer to pandas many, many times when writing a script, so it’s useful to abbreviate the name and save some keystrokes. But it’s also a matter of consistency with the larger community. Many of the core scientific Python libraries, including pandas, recommend a specific alias, so most code shared online will use those aliases as well.
Now that pandas has been imported, we can access the function we need to load data from a CSV, pd.read_csv()
. The function call has three parts:
- The name (or in this case, alias) of the object that defines the function. This can be a library or any other object. It can also be omitted in some cases (for example, when using a function built into Python).
- The name of the method we’d like to use
- A set of parentheses that tells the function to run.
Many functions define parameters that allow the user to modify the behavior of the function. Parameters may be positional or named. In Python, data passed to positional parameters are called arguments (often abbreviated as args), and data passed to named parameters are called keyword arguments (often abbreviated as kwargs). In either case, the arguments are included inside the parentheses used to call the function.
Below, we will pass a single argument to pd.read_csv()
: A string that tells the function where to find the surveys.csv file.
PYTHON
"data/surveys.csv") pd.read_csv(
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
… | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
35549 rows × 9 columns
Here are a few things to observe about how the dataframe is structured:
- By default, JupyterLab displays the first and last five rows of the dataframe
- Each row represents a record
- Each column represents a field
- The unlabeled column on the far left is called the row label
- pandas has done a lot of work behind the scenes when reading the data, including:
- Assigning the row index as the row label
- Assigning each column a data type based on its contents
- Assigning certain cells the value NaN, which stands for “not a number” and is used to designate null values in the dataset. Here, those cells represent blank cells in the spreadsheet.
Much of this behavior can be controlled when the spreadsheet is first read by using keyword arguments. For example, to force pd.read_csv()
to use the existing record_id column as the row label, use the index_col keyword argument:
PYTHON
"data/surveys.csv", index_col="record_id") pd.read_csv(
month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|
record_id | ||||||||
1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
… | … | … | … | … | … | … | … | … |
35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
35549 rows × 8 columns
Now record_id appears on the far left, indicating that pandas is using that column as the row label.
Challenge
Run help()
on the pd.read_csv()
method (note that you should omit the trailing parentheses). Alternatively, take a look at the much prettier documentation provided by the pandas developers. Based on that information, answer the following questions:
- What parameter would you use to control how null values are interpreted?
- What values are interpreted as NaN by default?
na_values
In addition to empty cells, the following values are interpreted as NaN, or null, by pandas. These strings may look familiar from programs like Excel.
- #N/A
- #N/A N/A
- #NA
- -1.#IND
- -1.#QNAN
- -NaN
- -nan
- 1.#IND
- 1.#QNAN
- <NA>
- N/A
- NA
- NULL
- NaN
- n/a
- nan
- null
Assigning to variables
We can assign the dataframe to a variable to make it easier to access. As we saw in the previous lesson, we use a single equals sign to assign an object to a variable. The variable name should be short and descriptive. By convention, variable names in Python use snake_case (that is, lower case with individual words separated by underscores).
PYTHON
= pd.read_csv("data/surveys.csv")
surveys surveys
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
… | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
35549 rows × 9 columns
Understanding data types in pandas
We can use the info()
method to see how pandas interpreted each column in the dataset. This method gives use the name, count, and data type of each column and provides some information about the dataset as a whole (for example, memory usage, which is helpful to know when working with large datasets):
PYTHON
surveys.info()
OUTPUT
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35549 entries, 0 to 35548
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 record_id 35549 non-null int64
1 month 35549 non-null int64
2 day 35549 non-null int64
3 year 35549 non-null int64
4 plot_id 35549 non-null int64
5 species_id 34786 non-null object
6 sex 33038 non-null object
7 hindfoot_length 31438 non-null float64
8 weight 32283 non-null float64
dtypes: float64(2), int64(5), object(2)
memory usage: 2.4+ MB
Note that the data types used by pandas look a little different than the ones used by Python itself. See the table below for the rough equivalents.
Saving a dataframe
When analyzing a dataset, we’ll often want to save our work to a file. The to_csv()
method can be used to write a dataframe to a CSV. As when we read a CSV from a file above, we need to provide a path to which to save the file. The example below also includes the index keyword argument. Setting that parameter to False tells pandas not to include the row label when writing the CSV.
PYTHON
"data/surveys_mod.csv", index=False) surveys.to_csv(
This is far from the only option for writing data–pandas supports a variety of file types for both reading and writing. Try searching for “read_” in the pandas API reference to see the supported formats.
Key Points
- This lesson uses real data from a decades-long survey of rodents in Arizona
- pandas is a data analysis library that allows users to read, manipulate, and view tabular data using Python
- pandas represents data as a dataframe consisting of rows (records) and columns (fields or variables)
- Read a dataframe from CSV using the
pd.read_csv()
function - Write a dataframe to CSV using the
to_csv()
method - The behavior of a function can be modified by including arguments and keyword arguments when the function is called
- pandas uses its own classes to represent text, numbers, booleans, and datetimes
Content from Accessing Data in a Dataframe
Last updated on 2022-07-18 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- How can we look at individual rows and columns in a dataframe?
- How can we perform calculations?
- How can we modify the table and data?
Objectives
- Access individual rows and columns
- Access multiple columns at once using a list
- Perform calculations like addition and subtraction
- Rename columns using a dictionary
- Access rows containing specific data
- Sort the data returned by a query
- Modify data using loc
We’ll begin by importing pandas and reading our CSV, as we did in the previous lesson:
PYTHON
import pandas as pd
= pd.read_csv("data/surveys.csv")
surveys surveys
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
… | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
35549 rows × 9 columns
We will now look at how to access rows and columns in the dataframe.
Getting columns
We can get the values from a single column by passing a string inside square brackets to the dataframe object. For example, to look at the year column, use:
PYTHON
"year"] surveys[
OUTPUT
0 1977
1 1977
2 1977
3 1977
4 1977
...
35544 2002
35545 2002
35546 2002
35547 2002
35548 2002
Name: year, Length: 35549, dtype: int64
A single column is returned as a Series
, which is why the output for this cell is formatted differently than the other cells in this lesson. The main difference between a Series
and a DataFrame
is that a DataFrame
can have multiple columns. The two classes share many but not all attributes and methods. Note also that this is a copy of data from the original dataframe. Changing values in the series will have no effect on the original dataframe.
Using lists to get more than one column at a time
Python uses the built-in list
data type to store sequences, that is, an ordered list of objects. Any type of Python object can be stored in a list: strings, integers, floats, even other lists or collections. Once created, a list can be modified by appending, inserting, or deleting items. We will not be going into detail about these operations here, but as always you can learn more about how to use a list using help()
or the Python docs.
Create a list using square brackets. Let’s create a list of the three columns in our dataframe that together give the date of an observation:
PYTHON
= ["year", "month", "day"] cols
When we pass this list to the survey dataframe using square brackets, as we did above, we retrieve a copy of the dataframe containing just those columns. Note that, when we get more than one column, pandas returns a dataframe:
PYTHON
surveys[cols]
year | month | day | |
---|---|---|---|
0 | 1977 | 7 | 16 |
1 | 1977 | 7 | 16 |
2 | 1977 | 7 | 16 |
3 | 1977 | 7 | 16 |
4 | 1977 | 7 | 16 |
… | … | … | … |
35544 | 2002 | 12 | 31 |
35545 | 2002 | 12 | 31 |
35546 | 2002 | 12 | 31 |
35547 | 2002 | 12 | 31 |
35548 | 2002 | 12 | 31 |
35549 rows × 3 columns
Getting rows
We can get the rows at the beginning of the table using the head method:
PYTHON
surveys.head()
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
By default, this methods returns the first five rows. We can provide a number inside the parentheses if we need a specific number of rows:
PYTHON
20) surveys.head(
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
5 | 6 | 7 | 16 | 1977 | 1 | PF | M | 14.0 | NaN |
6 | 7 | 7 | 16 | 1977 | 2 | PE | F | NaN | NaN |
7 | 8 | 7 | 16 | 1977 | 1 | DM | M | 37.0 | NaN |
8 | 9 | 7 | 16 | 1977 | 1 | DM | F | 34.0 | NaN |
9 | 10 | 7 | 16 | 1977 | 6 | PF | F | 20.0 | NaN |
10 | 11 | 7 | 16 | 1977 | 5 | DS | F | 53.0 | NaN |
11 | 12 | 7 | 16 | 1977 | 7 | DM | M | 38.0 | NaN |
12 | 13 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
13 | 14 | 7 | 16 | 1977 | 8 | DM | NaN | NaN | NaN |
14 | 15 | 7 | 16 | 1977 | 6 | DM | F | 36.0 | NaN |
15 | 16 | 7 | 16 | 1977 | 4 | DM | F | 36.0 | NaN |
16 | 17 | 7 | 16 | 1977 | 3 | DS | F | 48.0 | NaN |
17 | 18 | 7 | 16 | 1977 | 2 | PP | M | 22.0 | NaN |
18 | 19 | 7 | 16 | 1977 | 4 | PF | NaN | NaN | NaN |
19 | 20 | 7 | 17 | 1977 | 11 | DS | F | 48.0 | NaN |
The tail()
method is similar, except it returns rows from the end of the table:
PYTHON
surveys.tail()
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
The head()
and tail()
methods are useful for getting a feel for how our data is structured, but we’ll also want to be able to look at specific rows. As when we selected columns above, we can use square brackets to extract slices from the dataframe. A slice is a subset of the dataframe starting at one row and ending at another. To get a slice, we pass the starting and ending indexes to the square brackets as start:end
:
PYTHON
2:5] surveys[
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
There are three things to be aware of when slicing a dataframe:
- Row indexes are zero-based. That is, the first row has an index of 0, not 1.
- When slicing, the slice includes start but not the end index. In this case, that means the slice includes rows 2, 3, and 4 but not 5.
- The row label can be different from the row index. They happen to be the same here, but don’t count on that being true.
Core Python types like list
and tuple
use the same conventions, as do most libraries that deal with sequences.
Getting unique values
Recall that we can use square brackets to return a single column from a dataframe. We can get the list of unique values within that column using the unique()
method:
PYTHON
"species_id"].unique() surveys[
OUTPUT
array(['NL', 'DM', 'PF', 'PE', 'DS', 'PP', 'SH', 'OT', 'DO', 'OX', 'SS',
'OL', 'RM', nan, 'SA', 'PM', 'AH', 'DX', 'AB', 'CB', 'CM', 'CQ',
'RF', 'PC', 'PG', 'PH', 'PU', 'CV', 'UR', 'UP', 'ZL', 'UL', 'CS',
'SC', 'BA', 'SF', 'RO', 'AS', 'SO', 'PI', 'ST', 'CU', 'SU', 'RX',
'PB', 'PL', 'PX', 'CT', 'US'], dtype=object)
To do the same across multiple columns, we can use the drop_duplicates()
method on a copy of the dataframe containing only the columns we’re interested in. Like any well-named method, drop_duplicates()
does exactly what the name implies: It returns a copy of the dataframe with all duplicate rows removed.
PYTHON
"plot_id", "species_id"]].drop_duplicates() surveys[[
plot_id | species_id | |
---|---|---|
0 | 2 | NL |
1 | 3 | NL |
2 | 2 | DM |
3 | 7 | DM |
4 | 3 | DM |
… | … | … |
35474 | 22 | SF |
35511 | 11 | US |
35519 | 9 | SF |
35527 | 13 | US |
35543 | 15 | US |
612 rows × 2 columns
Calculating values
The survey dataset includes two columns, hindfoot_length and weight, that are stored as numbers and represent measurements. We may want to perform calculations using numbers like these in our own data. We can do so using Python’s built-in mathematical operators, including:
-
x + y
for addition -
x - y
for subtraction -
x * y
for multiplication -
x / y
for division -
x % y
for calculating remainders -
x ** y
for exponents
To make the examples in this section a little more useful, we’re going to remove all rows that contain null values using the dropna()
method. This will filter out any rows that don’t have a valid hindfoot_length or weight, as well as those that have a null value in any other cell. (This is an inelegant solution to the problem of missing data. We’ll talk about more nuanced solutions later in the lesson.)
PYTHON
= surveys.dropna().copy()
surveys_nona surveys_nona
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
62 | 63 | 8 | 19 | 1977 | 3 | DM | M | 35.0 | 40.0 |
63 | 64 | 8 | 19 | 1977 | 7 | DM | M | 37.0 | 48.0 |
64 | 65 | 8 | 19 | 1977 | 4 | DM | F | 34.0 | 29.0 |
65 | 66 | 8 | 19 | 1977 | 4 | DM | F | 35.0 | 46.0 |
66 | 67 | 8 | 19 | 1977 | 7 | DM | M | 35.0 | 36.0 |
… | … | … | … | … | … | … | … | … | … |
35540 | 35541 | 12 | 31 | 2002 | 15 | PB | F | 24.0 | 31.0 |
35541 | 35542 | 12 | 31 | 2002 | 15 | PB | F | 26.0 | 29.0 |
35542 | 35543 | 12 | 31 | 2002 | 15 | PB | F | 27.0 | 34.0 |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
30676 rows × 9 columns
Suppose we want to convert the weight column from grams to milligrams. To do so, we can multiply that column by 1000:
PYTHON
"weight"] * 1000 surveys_nona[
OUTPUT
62 40000.0
63 48000.0
64 29000.0
65 46000.0
66 36000.0
...
35540 31000.0
35541 29000.0
35542 34000.0
35546 14000.0
35547 51000.0
Name: weight, Length: 30676, dtype: float64
To convert it to kilograms, we can divide by 1000:
PYTHON
"weight"] / 1000 surveys_nona[
OUTPUT
62 0.040
63 0.048
64 0.029
65 0.046
66 0.036
...
35540 0.031
35541 0.029
35542 0.034
35546 0.014
35547 0.051
Name: weight, Length: 30676, dtype: float64
Note that calculations do not modify the original dataset. If we want to retain the result, we have to assign it to a new column:
PYTHON
"weight_mg"] = surveys_nona["weight"] * 1000
surveys_nona[ surveys_nona
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | weight_mg | |
---|---|---|---|---|---|---|---|---|---|---|
62 | 63 | 8 | 19 | 1977 | 3 | DM | M | 35.0 | 40.0 | 40000.0 |
63 | 64 | 8 | 19 | 1977 | 7 | DM | M | 37.0 | 48.0 | 48000.0 |
64 | 65 | 8 | 19 | 1977 | 4 | DM | F | 34.0 | 29.0 | 29000.0 |
65 | 66 | 8 | 19 | 1977 | 4 | DM | F | 35.0 | 46.0 | 46000.0 |
66 | 67 | 8 | 19 | 1977 | 7 | DM | M | 35.0 | 36.0 | 36000.0 |
… | … | … | … | … | … | … | … | … | … | … |
35540 | 35541 | 12 | 31 | 2002 | 15 | PB | F | 24.0 | 31.0 | 31000.0 |
35541 | 35542 | 12 | 31 | 2002 | 15 | PB | F | 26.0 | 29.0 | 29000.0 |
35542 | 35543 | 12 | 31 | 2002 | 15 | PB | F | 27.0 | 34.0 | 34000.0 |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 | 14000.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 | 51000.0 |
30676 rows × 10 columns
We can also add, subtract, multiply, and divide columns, as in the following (admittedly nonsensical) calculation, which adds together the hindfoot_length and weight columns:
PYTHON
"hindfoot_length"] + surveys_nona["weight"] surveys_nona[
OUTPUT
62 75.0
63 85.0
64 63.0
65 81.0
66 71.0
...
35540 55.0
35541 55.0
35542 61.0
35546 29.0
35547 87.0
Length: 30676, dtype: float64
Renaming columns
The hindfoot_length and weight columns don’t specify a unit, which may get confusing if we want to perform unit conversions like the one above. Fortunately, dataframes allow us to rename existing columns using the rename()
method.
The rename()
method uses a dictionary (or dict
) to map between the old and new column names. As with list
above, the dict
data type is built into Python–we don’t need to import anything to use it. A dict
maps keys to values. We can create one using curly braces:
PYTHON
= {"key1": "val1", "key2": "val2"}
dct dct
OUTPUT
{'key1': 'val1', 'key2': 'val2'}
Dictionaries are a useful and highly flexible data type. As with list
above, we’ll be giving them short shrift here, but you can learn more about them at the Python docs.
Here we’ll use a dict
to specify how we want to rename our columns. The keys will be the current column names and the values the new column names. Note that we explicitly assign the result of rename()
to the original variable–by default, rename()
returns a copy of the original dataframe instead of modifying the original dataframe.
PYTHON
# Create a dict that maps from the old to the new column name
= {
cols "hindfoot_length": "hindfoot_length_mm",
"weight": "weight_g",
}
# Assign the result of the rename method back to surveys_nona
= surveys_nona.rename(columns=cols)
surveys_nona
# View the dataframe with the new column names
surveys_nona
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length_mm | weight_g | weight_mg | |
---|---|---|---|---|---|---|---|---|---|---|
62 | 63 | 8 | 19 | 1977 | 3 | DM | M | 35.0 | 40.0 | 40000.0 |
63 | 64 | 8 | 19 | 1977 | 7 | DM | M | 37.0 | 48.0 | 48000.0 |
64 | 65 | 8 | 19 | 1977 | 4 | DM | F | 34.0 | 29.0 | 29000.0 |
65 | 66 | 8 | 19 | 1977 | 4 | DM | F | 35.0 | 46.0 | 46000.0 |
66 | 67 | 8 | 19 | 1977 | 7 | DM | M | 35.0 | 36.0 | 36000.0 |
… | … | … | … | … | … | … | … | … | … | … |
35540 | 35541 | 12 | 31 | 2002 | 15 | PB | F | 24.0 | 31.0 | 31000.0 |
35541 | 35542 | 12 | 31 | 2002 | 15 | PB | F | 26.0 | 29.0 | 29000.0 |
35542 | 35543 | 12 | 31 | 2002 | 15 | PB | F | 27.0 | 34.0 | 34000.0 |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 | 14000.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 | 51000.0 |
30676 rows × 10 columns
Challenge
Create a dataframe that returns the year, month, day, species_id and weight in mg.
PYTHON
# Assign the weight in milligrams to the weight_mg column
"weight_mg"] = surveys_nona["weight_g"] * 1000
surveys_nona[
# Display a copy of survey with only the desired columns
"year", "month", "day", "species_id", "weight_mg"]] surveys_nona[[
year | month | day | species_id | weight_mg | |
---|---|---|---|---|---|
62 | 1977 | 8 | 19 | DM | 40000.0 |
63 | 1977 | 8 | 19 | DM | 48000.0 |
64 | 1977 | 8 | 19 | DM | 29000.0 |
65 | 1977 | 8 | 19 | DM | 46000.0 |
66 | 1977 | 8 | 19 | DM | 36000.0 |
… | … | … | … | … | … |
35540 | 2002 | 12 | 31 | PB | 31000.0 |
35541 | 2002 | 12 | 31 | PB | 29000.0 |
35542 | 2002 | 12 | 31 | PB | 34000.0 |
35546 | 2002 | 12 | 31 | RM | 14000.0 |
35547 | 2002 | 12 | 31 | DO | 51000.0 |
30676 rows × 5 columns
Filtering data
pandas provides a variety of ways to filter a dataframe. For example, suppose we want to look at a specific species in the surveys_nona dataframe. We can view the rows matching a given species using the same square brackets we used above to select specific columns and rows. Here, however, instead of passing a value or list of values, we will pass a conditional expression.
A conditional expression is a statement that evaluates as either True or False. They often make use of inequality operators, for example:
-
==
for equals -
!=
for does not equal -
>
for greater than -
>=
for greater than or equal to -
<
for less than -
<=
for less than or equal to
Examples of conditional statements include:
-
"a" == "b"
evaluates False -
"a" != b"
evaluates True -
3 > 4
evaluates False
Note that, when comparing strings, evaluations are case sensitive:
-
"a" == "A"
evaluates False
= for assignment, == for equality
Remember that, in Python, a single equal sign is used to assign values to variables. We’ve already used the assignment operator in this lesson, for example, when we created a new column in the dataframe.
To limit the dataframe to rows matching the species “DM”, include the conditional surveys_nona["species_id"] == "DM"
inside the square brackets:
PYTHON
"species_id"] == "DM"] surveys_nona[surveys_nona[
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length_mm | weight_g | weight_mg | |
---|---|---|---|---|---|---|---|---|---|---|
62 | 63 | 8 | 19 | 1977 | 3 | DM | M | 35.0 | 40.0 | 40000.0 |
63 | 64 | 8 | 19 | 1977 | 7 | DM | M | 37.0 | 48.0 | 48000.0 |
64 | 65 | 8 | 19 | 1977 | 4 | DM | F | 34.0 | 29.0 | 29000.0 |
65 | 66 | 8 | 19 | 1977 | 4 | DM | F | 35.0 | 46.0 | 46000.0 |
66 | 67 | 8 | 19 | 1977 | 7 | DM | M | 35.0 | 36.0 | 36000.0 |
… | … | … | … | … | … | … | … | … | … | … |
35532 | 35533 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 48.0 | 48000.0 |
35533 | 35534 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 56.0 | 56000.0 |
35534 | 35535 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 53.0 | 53000.0 |
35535 | 35536 | 12 | 31 | 2002 | 14 | DM | F | 35.0 | 42.0 | 42000.0 |
35536 | 35537 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 46.0 | 46000.0 |
9727 rows × 10 columns
To limit our results to observations made in or after 2000, use:
PYTHON
"year"] >= 2000] surveys_nona[surveys_nona[
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length_mm | weight_g | weight_mg | |
---|---|---|---|---|---|---|---|---|---|---|
30158 | 30159 | 1 | 8 | 2000 | 1 | PP | F | 22.0 | 17.0 | 17000.0 |
30159 | 30160 | 1 | 8 | 2000 | 1 | DO | M | 35.0 | 53.0 | 53000.0 |
30160 | 30161 | 1 | 8 | 2000 | 1 | PP | F | 21.0 | 17.0 | 17000.0 |
30161 | 30162 | 1 | 8 | 2000 | 1 | DM | M | 36.0 | 50.0 | 50000.0 |
30162 | 30163 | 1 | 8 | 2000 | 1 | PP | M | 20.0 | 16.0 | 16000.0 |
… | … | … | … | … | … | … | … | … | … | … |
35540 | 35541 | 12 | 31 | 2002 | 15 | PB | F | 24.0 | 31.0 | 31000.0 |
35541 | 35542 | 12 | 31 | 2002 | 15 | PB | F | 26.0 | 29.0 | 29000.0 |
35542 | 35543 | 12 | 31 | 2002 | 15 | PB | F | 27.0 | 34.0 | 34000.0 |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 | 14000.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 | 51000.0 |
4947 rows × 10 columns
As with rename()
above, each filtering operation returns a copy of the dataframe. We will look at how to make changes to the original dataframe at the end of this lesson.
Building more complex queries
We can combine conditionals using what are called bitwise operators:
-
&
: True if conditions on both sides of the operator are True (and) -
|
: True if a condition on either side is True (or)
To return all observations of DM in or after 2000, we can combine the two conditionals we used previously. Note that, when joining conditionals using &
or |
, we must wrap each individual condition in parentheses. If we omit the parentheses, pandas will not perform the comparisons in the expected order.
PYTHON
"species_id"] == "DM") & (surveys_nona["year"] >= 2000)] surveys_nona[(surveys_nona[
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length_mm | weight_g | weight_mg | |
---|---|---|---|---|---|---|---|---|---|---|
30161 | 30162 | 1 | 8 | 2000 | 1 | DM | M | 36.0 | 50.0 | 50000.0 |
30178 | 30179 | 1 | 8 | 2000 | 12 | DM | M | 36.0 | 60.0 | 60000.0 |
30195 | 30196 | 1 | 8 | 2000 | 17 | DM | M | 37.0 | 52.0 | 52000.0 |
30196 | 30197 | 1 | 8 | 2000 | 17 | DM | F | 34.0 | 43.0 | 43000.0 |
30209 | 30210 | 1 | 8 | 2000 | 22 | DM | M | 38.0 | 56.0 | 56000.0 |
… | … | … | … | … | … | … | … | … | … | … |
35532 | 35533 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 48.0 | 48000.0 |
35533 | 35534 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 56.0 | 56000.0 |
35534 | 35535 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 53.0 | 53000.0 |
35535 | 35536 | 12 | 31 | 2002 | 14 | DM | F | 35.0 | 42.0 | 42000.0 |
35536 | 35537 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 46.0 | 46000.0 |
823 rows × 10 columns
Some column methods can also be used for filtering. One example is isin()
, which is used to match a list of values. This method can be combined with other conditionals as above. The example below returns rows from 2000 or later with either “DM”, “DO”, or “DS” in the species_id column:
PYTHON
surveys_nona["species_id"].isin(["DM", "DO", "DS"]) & (surveys_nona["year"] >= 2000)
surveys_nona[ ]
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length_mm | weight_g | weight_mg | |
---|---|---|---|---|---|---|---|---|---|---|
30159 | 30160 | 1 | 8 | 2000 | 1 | DO | M | 35.0 | 53.0 | 53000.0 |
30161 | 30162 | 1 | 8 | 2000 | 1 | DM | M | 36.0 | 50.0 | 50000.0 |
30166 | 30167 | 1 | 8 | 2000 | 1 | DO | M | 35.0 | 41.0 | 41000.0 |
30170 | 30171 | 1 | 8 | 2000 | 2 | DO | M | 36.0 | 52.0 | 52000.0 |
30172 | 30173 | 1 | 8 | 2000 | 2 | DO | F | 35.0 | 54.0 | 54000.0 |
… | … | … | … | … | … | … | … | … | … | … |
35533 | 35534 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 56.0 | 56000.0 |
35534 | 35535 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 53.0 | 53000.0 |
35535 | 35536 | 12 | 31 | 2002 | 14 | DM | F | 35.0 | 42.0 | 42000.0 |
35536 | 35537 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 46.0 | 46000.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 | 51000.0 |
1229 rows × 10 columns
Sorting data
We can sort a dataframe using the sort_values()
method. For this example, we’ll work from the subset defined above. First we need to assign that subset to a variable:
PYTHON
= surveys_nona[
results "species_id"].isin(["DM", "DO", "DS"]) & (surveys_nona["year"] >= 2000)
surveys_nona[ ]
Now we’ll sort the results by weight_g. To do so, pass that column name as an argument (that is, inside the trailing parentheses) to the sort_values()
method:
PYTHON
"weight_g") results.sort_values(
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length_mm | weight_g | weight_mg | |
---|---|---|---|---|---|---|---|---|---|---|
30614 | 30615 | 4 | 30 | 2000 | 12 | DM | M | 34.0 | 18.0 | 18000.0 |
34790 | 34791 | 10 | 5 | 2002 | 17 | DM | M | 34.0 | 20.0 | 20000.0 |
32650 | 32651 | 9 | 22 | 2001 | 17 | DM | F | 33.0 | 21.0 | 21000.0 |
32782 | 32783 | 9 | 23 | 2001 | 14 | DM | M | 34.0 | 21.0 | 21000.0 |
32968 | 32969 | 10 | 14 | 2001 | 9 | DM | M | 32.0 | 22.0 | 22000.0 |
… | … | … | … | … | … | … | … | … | … | … |
35159 | 35160 | 11 | 10 | 2002 | 13 | DO | F | 36.0 | 65.0 | 65000.0 |
30289 | 30290 | 2 | 5 | 2000 | 1 | DO | F | 36.0 | 65.0 | 65000.0 |
31942 | 31943 | 4 | 21 | 2001 | 1 | DO | F | 37.0 | 68.0 | 68000.0 |
32041 | 32042 | 5 | 26 | 2001 | 1 | DO | F | 37.0 | 68.0 | 68000.0 |
31955 | 31956 | 4 | 21 | 2001 | 24 | DO | F | 36.0 | 76.0 | 76000.0 |
1229 rows × 10 columns
By default, rows are sorted in ascending order (smallest to largest). We can modify this behavior using the ascending keyword argument:
PYTHON
"weight_g", ascending=False) results.sort_values(
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length_mm | weight_g | weight_mg | |
---|---|---|---|---|---|---|---|---|---|---|
31955 | 31956 | 4 | 21 | 2001 | 24 | DO | F | 36.0 | 76.0 | 76000.0 |
32041 | 32042 | 5 | 26 | 2001 | 1 | DO | F | 37.0 | 68.0 | 68000.0 |
31942 | 31943 | 4 | 21 | 2001 | 1 | DO | F | 37.0 | 68.0 | 68000.0 |
35159 | 35160 | 11 | 10 | 2002 | 13 | DO | F | 36.0 | 65.0 | 65000.0 |
30289 | 30290 | 2 | 5 | 2000 | 1 | DO | F | 36.0 | 65.0 | 65000.0 |
… | … | … | … | … | … | … | … | … | … | … |
32957 | 32958 | 10 | 14 | 2001 | 9 | DM | F | 33.0 | 22.0 | 22000.0 |
32650 | 32651 | 9 | 22 | 2001 | 17 | DM | F | 33.0 | 21.0 | 21000.0 |
32782 | 32783 | 9 | 23 | 2001 | 14 | DM | M | 34.0 | 21.0 | 21000.0 |
34790 | 34791 | 10 | 5 | 2002 | 17 | DM | M | 34.0 | 20.0 | 20000.0 |
30614 | 30615 | 4 | 30 | 2000 | 12 | DM | M | 34.0 | 18.0 | 18000.0 |
1229 rows × 10 columns
We can sort on multiple fields at once by passing a list of column names. We can control how each column sorts by passing a list with the same number of values (that is, one value per column) to the ascending keyword. The cell below sorts the results first by species_id (largest to smallest), then by weight (smallest to largest):
PYTHON
"species_id", "weight_g"], ascending=[False, True]) results.sort_values([
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length_mm | weight_g | weight_mg | |
---|---|---|---|---|---|---|---|---|---|---|
30687 | 30688 | 5 | 1 | 2000 | 8 | DO | M | 35.0 | 23.0 | 23000.0 |
30768 | 30769 | 6 | 3 | 2000 | 17 | DO | M | 35.0 | 24.0 | 24000.0 |
33339 | 33340 | 1 | 12 | 2002 | 12 | DO | M | 34.0 | 24.0 | 24000.0 |
34082 | 34083 | 5 | 16 | 2002 | 6 | DO | F | 33.0 | 24.0 | 24000.0 |
34096 | 34097 | 5 | 16 | 2002 | 6 | DO | M | 35.0 | 24.0 | 24000.0 |
… | … | … | … | … | … | … | … | … | … | … |
31858 | 31859 | 3 | 24 | 2001 | 2 | DM | F | 36.0 | 60.0 | 60000.0 |
32909 | 32910 | 10 | 14 | 2001 | 4 | DM | F | 37.0 | 60.0 | 60000.0 |
34604 | 34605 | 7 | 14 | 2002 | 9 | DM | F | 36.0 | 60.0 | 60000.0 |
34675 | 34676 | 9 | 8 | 2002 | 12 | DM | F | 36.0 | 60.0 | 60000.0 |
34126 | 34127 | 5 | 16 | 2002 | 9 | DM | F | 35.0 | 64.0 | 64000.0 |
1229 rows × 10 columns
As with the dataframe methods above, sort_values()
returns a copy of the original dataframe and leaves the original untouched.
Challenge
Write a query that returns year, species_id, and weight in kg from the surveys_nona table, sorted with the largest weights at the top.
PYTHON
# Create a new column with weight in kg
"weight_kg"] = surveys_nona["weight_g"] / 1000
surveys_nona[
# Create a subset containing only year, species_id, and weight_kg
= surveys_nona[["year", "species_id", "weight_kg"]]
subset
# Sort the subset by weight_kg
"weight_kg", ascending=False) subset.sort_values(
year | species_id | weight_kg | |
---|---|---|---|
33048 | 2001 | NL | 0.280 |
12870 | 1987 | NL | 0.278 |
15458 | 1989 | NL | 0.275 |
2132 | 1979 | NL | 0.274 |
12728 | 1987 | NL | 0.270 |
… | … | … | … |
29905 | 1999 | PP | 0.004 |
9908 | 1985 | RM | 0.004 |
10438 | 1985 | RM | 0.004 |
8735 | 1983 | RM | 0.004 |
7083 | 1982 | PF | 0.004 |
30676 rows × 3 columns
Modifying data
We’ve already shown how to modify an existing a dataframe by adding a new column. What if we want to modify existing cells instead? As we’ve seen, this can be a little tricky in pandas because most of its methods return a copy of the original dataframe. For example, we can get subsets of a dataframe using square brackets. The cell below returns the species_id column for rows 2 through 5:
PYTHON
2:6]["species_id"] surveys[
OUTPUT
2 DM
3 DM
4 DM
5 PF
Name: species_id, dtype: object
But trying to set new values using this syntax may not work as expected. When working with the full dataframe, we can use
Say we want to set the species_id column to a new value, “FD”. Try running the code in the cell below:
PYTHON
2:6]["species_id"] = "FD" surveys[
OUTPUT
C:\...\1234567890.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
surveys[2:6]["species_id"] = "FD"
You should have received a SettingWithCopyWarning
warning after running that cell. This warning tells us that the data in the original dataframe has not been modified. This is because the square brackets returned a copy, meaning that any changes will be reflected in the copy, not the original. We can verify that the original dataframe has not been changed by displaying the rows that would have been modified:
PYTHON
2:6] surveys[
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
5 | 6 | 7 | 16 | 1977 | 1 | PF | M | 14.0 | NaN |
Using loc to modify existing cells
One way to modify existing data in pandas is to use the loc
attribute. This attribute allows you to extract and modify cells in a DataFrame using the following syntax: df.loc[row_indexer, col_indexer]
.
The row_indexer argument is used to select one or more rows. It can be: - A row label (i.e., the bold column on the far left) - 0
returns the row with label 0
- A slice including multiple rows:
-
:
returns all rows -
:2
returns all rows from the beginning of the dataframe to the row labeled 2, inclusive -
2:
returns all rows from the row labeled 2 to the end of the dataframe, inclusive -
2:5
returns all rows between those labeled 2 and 5, inclusive
-
- A conditional, as in the examples above.
The col_indexer argument is used to select one or more columns. It will typically be a list of column names and can be omitted, in which case all columns will be returned.
Row labels and indexes
The row labels in this dataframe happen to be numeric and aligned exactly to the row’s index (so for the first row both the index and the label are 0, for the second row both the index and the label are 1, etc.) This is often but not always true in pandas. For example, if we used record_id as the label, the row labels would be one-based and the row indexes would be zero-based.
loc slicing behavior
Slices using loc
are inclusive–rows matching both the start and end values are included in the returned slice. This differs from list slices, where the start but not end value is included in the slice. loc
works this way because it is looking at the row label, not the row index.
We’ll be making some changes to our data, so let’s work from a copy instead of modifying the original. Create a copy using the copy()
method:
PYTHON
= surveys.copy() surveys_copy
To select a subset of rows and columns using loc
, use:
PYTHON
2:5, "species_id"] surveys_copy.loc[
OUTPUT
2 DM
3 DM
4 DM
5 PF
Name: species_id, dtype: object
Unlike the methods earlier in the lesson, this is a view, not a copy, of the data in the surveys_copy dataframe. That means that the object returned by loc is live and can be used to change the original dataframe. We can now assign a new value to the species_id column in the matching rows of the original dataframe:
PYTHON
2:5, "species_id"] = "FD" surveys_copy.loc[
We can see that these changes are reflected in the original surveys_copy object:
PYTHON
1:6, "species_id"] surveys_copy.loc[
OUTPUT
1 NL
2 FD
3 FD
4 FD
5 FD
6 PE
Name: species_id, dtype: object
Slicing with iloc
pandas provides another indexer, iloc
, that allows us to select and modify data using row and column indexes instead of labels. Learn more in the pandas documentation.
Key Points
- Use square brackets to access rows, columns, and specific cells
- Use operators like
+
,-
, and/
to perform arithmetic on rows and columns - Store the results of calculations in a dataframe by adding a new column or overwriting an existing column
- Sort data, rename columns, and get unique values in a dataframe using methods provided by pandas
- By default, most dataframe operations return a copy of the original data
Content from Aggregating and Grouping Data
Last updated on 2022-07-18 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- How do we calculate summary statistics?
- How do we group data?
- How do null values affect calculations?
Objectives
- Introduce aggregation calculations in pandas
- Introduce grouping in pandas
- Learn about how pandas handles null values
As always, we’ll begin by importing pandas and reading our CSV:
PYTHON
import pandas as pd
= pd.read_csv("data/surveys.csv") surveys
Aggregating data
Aggregation allows us to describe the data in our dataframe by calculating totals (like the number of records) and statistics (like the mean value of a column). pandas allows us to run these calculations on dataframes or subsets.
Suppose we need to know how many records are in our dataset. We’ve already seen that we can use the info()
method to get high-level about the dataset, including the number of entries. What if just wanted the number of rows? One approach is to use the built-in function len()
, which is used to calculate the number of items in an object (for example, the number of characters in a string or the number of items in a list). When used on a dataframe, len()
returns the number of rows:
PYTHON
len(surveys)
OUTPUT
35549
pandas provides a suite of aggregation methods that go beyond this simple case. For example, we can count the number of non-NaN values in each column using count()
:
PYTHON
surveys.count()
OUTPUT
record_id 35549
month 35549
day 35549
year 35549
plot_id 35549
species_id 34786
sex 33038
hindfoot_length 31438
weight 32283
dtype: int64
Or we can find out the total weight of all individuals in grams using sum()
:
PYTHON
"weight"].sum() surveys[
OUTPUT
1377594.0
Other aggregation methods supported by pandas include min()
, max()
, and mean()
. These methods all ignore NaNs, so missing data does not affect these calculations.
Challenge
Calculate the total weight, average weight, minimum and maximum weights for all animals caught over the duration of the survey. Can you modify your code so that it outputs these values only for weights between 5 and 10 grams?
PYTHON
# Create a subset of only the animals between 5 and 10 grams
= surveys[(surveys["weight"] > 5) & (surveys["weight"] < 10)]["weight"]
weights
# Display aggregation calculations using a dict
{"sum": weights.sum(),
"mean": weights.mean(),
"min": weights.min(),
"max": weights.max(),
}
OUTPUT
{'sum': 16994.0, 'mean': 7.91523055426176, 'min': 6.0, 'max': 9.0}
To quickly generate summary statistics, we can use the describe()
method instead of calling each aggregation method separately. When we use this method on a dataframe, it calculates stats for all columns with numeric data:
PYTHON
surveys.describe()
record_id | month | day | year | plot_id | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|
count | 35549.000000 | 35549.000000 | 35549.000000 | 35549.000000 | 35549.000000 | 31438.000000 | 32283.000000 |
mean | 17775.000000 | 6.477847 | 15.991195 | 1990.475231 | 11.397001 | 29.287932 | 42.672428 |
std | 10262.256696 | 3.396925 | 8.257366 | 7.493355 | 6.799406 | 9.564759 | 36.631259 |
min | 1.000000 | 1.000000 | 1.000000 | 1977.000000 | 1.000000 | 2.000000 | 4.000000 |
25% | 8888.000000 | 4.000000 | 9.000000 | 1984.000000 | 5.000000 | 21.000000 | 20.000000 |
50% | 17775.000000 | 6.000000 | 16.000000 | 1990.000000 | 11.000000 | 32.000000 | 37.000000 |
75% | 26662.000000 | 10.000000 | 23.000000 | 1997.000000 | 17.000000 | 36.000000 | 48.000000 |
max | 35549.000000 | 12.000000 | 31.000000 | 2002.000000 | 24.000000 | 70.000000 | 280.000000 |
You can see that describe()
isn’t picky: It includes both ID and date columns in its results. Notice also that counts differ in between columns. This is because count()
only counts non-NaN rows.
If desired, we can also describe a single column at a time:
PYTHON
"weight"].describe() surveys[
OUTPUT
count 32283.000000
mean 42.672428
std 36.631259
min 4.000000
25% 20.000000
50% 37.000000
75% 48.000000
max 280.000000
Name: weight, dtype: float64
If we need more control over the output (for example, if we want to calculate the total weight of all animals, as in the challenge above), pandas provides the agg()
method, which allows us to specify methods by column. The argument passed to this method is a dict
. Each key must be a column name and each value a list of the names of aggregation methods. To calculate the total weight, mean weight, and mean hindfoot length of all records in the survey, we can use:
PYTHON
"weight": ["sum", "mean"], "hindfoot_length": ["mean"]}) surveys.agg({
weight | hindfoot_length | |
---|---|---|
sum | 1.377594e+06 | NaN |
mean | 4.267243e+01 | 29.287932 |
Grouping data
Now, let’s find out how many individuals were counted for each species. We do this using groupby()
, which creates an object similar to a dataframe where rows are grouped by the data in one or more columns. To group by species_id, use:
PYTHON
= surveys.groupby("species_id") grouped
When we aggregate grouped data, pandas makes separate calculations for each member of the group. In the example below, we’ll calculate the number of times each species appears in the dataset. Rather than outputting the full dataframe, we’ll limit the count to a single column. Because count ignores NaN cells, it’s good practice to use a column that does not contain nulls. Record ID fields are a good choice because they makes it clear that we are counting rows. The fields used to group the data also work.
PYTHON
"species_id"].count() grouped[
OUTPUT
species_id
AB 303
AH 437
AS 2
BA 46
CB 50
CM 13
CQ 16
CS 1
CT 1
CU 1
CV 1
DM 10596
DO 3027
DS 2504
DX 40
NL 1252
OL 1006
OT 2249
OX 12
PB 2891
PC 39
PE 1299
PF 1597
PG 8
PH 32
PI 9
PL 36
PM 899
PP 3123
PU 5
PX 6
RF 75
RM 2609
RO 8
RX 2
SA 75
SC 1
SF 43
SH 147
SO 43
SS 248
ST 1
SU 5
UL 4
UP 8
UR 10
US 4
ZL 2
Name: species_id, dtype: int64
To group by multiple columns, we can pass a list to groupby()
:
PYTHON
"species_id", "year"])["record_id"].count() surveys.groupby([
OUTPUT
species_id year
AB 1980 5
1981 7
1982 34
1983 41
1984 12
..
UR 1994 1
2001 2
2002 1
US 2002 4
ZL 1988 2
Name: record_id, Length: 509, dtype: int64
Challenge
Write statements that return:
- How many individuals were counted in each year in total
- How many were counted each year, for each different species
- The average weights of each species in each year
- How many individuals were counted for each species that was observed more than 10 times
Can you get the answer to both 2 and 3 in a single query?
How many individuals were counted in each year in total?
PYTHON
# Individual counts per year
"year")["record_id"].count() surveys.groupby(
OUTPUT
year
1977 503
1978 1048
1979 719
1980 1415
1981 1472
1982 1978
1983 1673
1984 981
1985 1438
1986 942
1987 1671
1988 1469
1989 1569
1990 1311
1991 1347
1992 1038
1993 750
1994 668
1995 1222
1996 1706
1997 2493
1998 1610
1999 1135
2000 1552
2001 1610
2002 2229
Name: record_id, dtype: int64
How many individuals were counted each year, for each different species?
PYTHON
# Individual counts by species and year
= surveys.groupby(["year", "species_id"])["record_id"].count()
result
# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
print(result)
OUTPUT
year species_id
1977 DM 264
DO 12
DS 98
NL 31
OL 10
OT 17
OX 7
PE 6
PF 31
PP 7
RM 2
SH 1
SS 1
1978 AH 1
DM 389
DO 22
DS 320
NL 48
OL 47
OT 66
OX 1
PE 17
PF 40
PM 3
PP 24
RM 3
SA 2
SH 1
SS 8
1979 DM 209
DO 31
DS 204
NL 30
OL 46
OT 68
PE 17
PF 19
PM 6
PP 20
RM 6
SS 2
1980 AB 5
CB 1
CM 12
CQ 2
DM 493
DO 89
DS 346
DX 1
NL 57
OL 85
OT 96
PE 20
PF 95
PM 1
PP 17
RM 46
SS 9
1981 AB 7
CB 1
CQ 4
DM 559
DO 123
DS 354
NL 63
OL 55
OT 65
PE 43
PF 59
PM 16
PP 20
RM 29
SA 11
SS 8
1982 AB 34
AH 1
CQ 8
DM 609
DO 117
DS 354
NL 111
OL 83
OT 94
OX 1
PC 2
PE 107
PF 152
PM 72
PP 32
RF 2
RM 154
SA 17
SH 2
SS 12
1983 AB 41
AH 2
CB 3
DM 528
DO 167
DS 280
DX 1
NL 98
OL 65
OT 86
PC 3
PE 38
PF 114
PG 1
PM 42
PP 32
RM 127
SA 10
SH 2
SS 12
1984 AB 12
AH 2
CB 1
CQ 2
DM 396
DO 108
DS 76
NL 64
OL 71
OT 49
OX 1
PE 20
PF 7
PH 7
PM 12
PP 26
PU 2
RM 60
SA 20
SS 15
1985 AB 14
AH 14
CB 1
DM 667
DO 225
DS 98
NL 45
OL 94
OT 62
PE 51
PF 1
PH 3
PM 17
PP 35
RM 72
SA 3
SS 14
1986 AB 5
AH 19
CB 4
DM 406
DO 115
DS 88
DX 1
NL 60
OL 43
OT 25
PE 26
PF 2
PH 1
PM 34
PP 26
RM 53
SA 1
SH 1
SS 12
1987 AB 35
AH 41
CB 5
CM 1
CV 1
DM 469
DO 213
DS 104
NL 128
OL 79
OT 43
PC 5
PE 142
PF 4
PM 133
PP 45
RM 189
SA 1
SH 5
SS 20
UP 2
UR 3
1988 AB 39
AH 26
CB 6
DM 365
DO 144
DS 54
DX 1
NL 102
OL 51
OT 86
PE 190
PF 2
PG 2
PH 1
PM 54
PP 53
RF 11
RM 211
SA 1
SH 31
SS 20
UL 1
UP 3
UR 1
ZL 2
1989 AB 31
AH 30
BA 3
CB 1
CS 1
DM 321
DO 119
DS 33
NL 67
OL 57
OT 102
OX 1
PC 1
PE 177
PF 13
PH 2
PM 27
PP 45
RF 49
RM 398
SA 3
SC 1
SF 5
SH 55
SS 18
UP 2
1990 AB 27
AH 22
BA 11
DM 462
DO 171
DS 17
NL 29
OL 25
OT 77
PC 8
PE 75
PF 51
PG 4
PH 7
PP 34
RF 12
RM 231
SF 18
SH 9
SS 5
UP 1
1991 AB 15
AH 21
AS 1
BA 26
CB 2
DM 404
DO 122
DS 11
DX 4
NL 30
OL 36
OT 107
PC 6
PE 65
PF 62
PP 88
RM 307
RO 1
SA 2
SF 7
SH 1
SO 11
SS 2
UR 1
1992 AB 10
AH 16
BA 6
CB 7
DM 307
DO 94
DS 18
DX 12
NL 15
OL 45
OT 42
PC 9
PE 41
PF 57
PH 7
PI 1
PP 131
RM 158
SF 3
SO 20
SS 7
UL 1
UR 1
1993 AB 9
AH 15
AS 1
CB 5
CU 1
DM 253
DO 29
DS 18
DX 3
NL 24
OL 35
OT 22
OX 1
PC 1
PE 13
PF 70
PP 100
RM 103
SF 3
SO 6
SS 1
ST 1
1994 AB 7
AH 19
CB 1
DM 293
DO 25
DS 9
DX 2
NL 10
OL 21
OT 21
PC 1
PE 3
PF 73
PP 74
PU 2
RM 40
SO 3
SS 7
SU 1
UL 2
UR 1
1995 AB 4
AH 36
CB 2
DM 436
DO 58
DX 3
NL 8
OL 29
OT 38
PB 7
PC 1
PE 14
PF 158
PH 3
PI 1
PL 3
PM 8
PP 277
PU 1
RM 81
RX 1
SS 7
SU 4
1996 AH 25
DM 492
DO 174
DX 1
NL 7
OL 13
OT 108
PB 39
PC 1
PE 36
PF 330
PL 6
PM 50
PP 298
PX 2
RM 90
SA 1
SS 16
1997 AH 37
CB 3
DM 576
DO 253
DS 6
NL 48
OL 8
OT 258
PB 259
PE 57
PF 186
PH 1
PL 19
PM 271
PP 325
PX 2
RF 1
RM 158
RX 1
SA 1
SH 7
SO 3
SS 12
1998 AB 2
AH 33
CB 4
CT 1
DM 503
DO 111
DS 9
DX 2
NL 32
OT 164
PB 329
PC 1
PE 24
PF 26
PL 7
PM 103
PP 208
PX 1
RM 13
SA 2
SS 15
1999 AH 14
DM 348
DO 84
DS 7
DX 3
NL 20
OT 105
PB 272
PE 7
PM 44
PP 167
RM 28
SH 2
SS 7
2000 AB 3
AH 12
DM 233
DO 91
DX 4
NL 29
OT 154
PB 555
PE 15
PG 1
PL 1
PM 2
PP 381
PX 1
RM 15
SH 8
SS 4
2001 AB 1
AH 23
CB 3
DM 305
DO 81
DX 1
NL 48
OT 167
PB 538
PE 35
PF 27
PI 2
PM 3
PP 273
RM 15
SH 11
SS 5
UR 2
2002 AB 2
AH 28
DM 309
DO 249
DX 1
NL 48
OL 8
OT 127
PB 892
PE 60
PF 18
PI 5
PM 1
PP 385
RM 20
RO 7
SF 7
SH 11
SS 9
UR 1
US 4
Name: record_id, dtype: int64
What was the average weight of each species in each year?
PYTHON
# Mean weight by species and year
= surveys.groupby(["year", "species_id"])["weight"].mean()
result
# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
print(result)
OUTPUT
year species_id
1977 DM 41.141304
DO 42.666667
DS 121.437500
NL NaN
OL 21.666667
OT 21.000000
OX 21.333333
PE 19.500000
PF 7.173913
PP 15.333333
RM 10.000000
SH NaN
SS NaN
1978 AH NaN
DM 40.795455
DO 45.000000
DS 115.198653
NL 184.656250
OL 31.027027
OT 23.220000
OX NaN
PE 20.428571
PF 7.088235
PM 20.666667
PP 14.869565
RM 7.500000
SA NaN
SH 89.000000
SS 130.000000
1979 DM 43.507317
DO 45.862069
DS 111.796954
NL 138.000000
OL 33.717391
OT 23.075758
PE 20.529412
PF 7.529412
PM 23.666667
PP 15.105263
RM 8.333333
SS NaN
1980 AB NaN
CB NaN
CM NaN
CQ NaN
DM 44.136082
DO 48.058824
DS 120.725664
DX NaN
NL 158.982143
OL 33.107143
OT 24.083333
PE 22.450000
PF 7.455556
PM 21.000000
PP 14.176471
RM 10.227273
SS NaN
1981 AB NaN
CB NaN
CQ NaN
DM 44.032907
DO 49.111111
DS 125.818444
NL 165.903226
OL 33.296296
OT 24.415385
PE 20.558140
PF 7.152542
PM 19.933333
PP 13.950000
RM 11.178571
SA NaN
SS 57.000000
1982 AB NaN
AH NaN
CQ NaN
DM 41.975042
DO 47.919643
DS 115.647929
NL 160.613208
OL 35.590361
OT 24.925532
OX 24.000000
PC NaN
PE 21.173077
PF 6.918919
PM 21.391304
PP 16.125000
RF 11.500000
RM 10.436242
SA NaN
SH 79.000000
SS NaN
1983 AB NaN
AH NaN
CB NaN
DM 40.944551
DO 47.150602
DS 122.033088
DX NaN
NL 156.691489
OL 34.584615
OT 24.697674
PC NaN
PE 20.864865
PF 6.833333
PG NaN
PM 22.023810
PP 15.468750
RM 9.872000
SA NaN
SH NaN
SS NaN
1984 AB NaN
AH NaN
CB NaN
CQ NaN
DM 40.765306
DO 48.415094
DS 124.082192
NL 150.095238
OL 32.550725
OT 22.416667
OX 18.000000
PE 20.210526
PF 7.428571
PH 28.000000
PM 19.545455
PP 15.307692
PU NaN
RM 11.152542
SA NaN
SS NaN
1985 AB NaN
AH NaN
CB NaN
DM 41.507645
DO 47.956731
DS 124.326316
NL 148.829268
OL 32.108696
OT 22.064516
PE 20.360000
PF NaN
PH 32.666667
PM 19.000000
PP 15.764706
RM 8.371429
SA NaN
SS NaN
1986 AB NaN
AH NaN
CB NaN
DM 43.361596
DO 49.372727
DS 128.071429
DX NaN
NL 159.724138
OL 30.880952
OT 21.500000
PE 22.520000
PF 8.000000
PH 39.000000
PM 21.696970
PP 16.750000
RM 10.750000
SA NaN
SH 55.000000
SS NaN
1987 AB NaN
AH NaN
CB NaN
CM NaN
CV NaN
DM 43.232609
DO 50.087379
DS 126.383838
NL 158.840000
OL 30.779221
OT 23.069767
PC NaN
PE 21.625899
PF 7.500000
PM 22.297710
PP 17.840909
RM 10.657609
SA NaN
SH 73.800000
SS NaN
UP NaN
UR NaN
1988 AB NaN
AH NaN
CB NaN
DM 43.397790
DO 51.463768
DS 129.490566
DX NaN
NL 163.104167
OL 30.306122
OT 24.070588
PE 22.625668
PF 8.000000
PG NaN
PH 33.000000
PM 21.759259
PP 18.280000
RF 13.818182
RM 10.322115
SA NaN
SH 72.806452
SS NaN
UL NaN
UP NaN
UR NaN
ZL NaN
1989 AB NaN
AH NaN
BA 7.000000
CB NaN
CS NaN
DM 44.349206
DO 51.025641
DS 121.896552
NL 151.278689
OL 31.947368
OT 24.663366
OX 20.000000
PC NaN
PE 21.935673
PF 7.230769
PH 30.500000
PM 20.222222
PP 17.409091
RF 13.346939
RM 10.411311
SA NaN
SC NaN
SF 54.800000
SH 76.345455
SS NaN
UP NaN
1990 AB NaN
AH NaN
BA 8.000000
DM 41.769912
DO 48.512048
DS 121.187500
NL 154.275862
OL 31.200000
OT 23.675325
PC NaN
PE 21.671233
PF 7.117647
PG NaN
PH 31.142857
PP 16.029412
RF 12.916667
RM 10.305677
SF 52.611111
SH 76.888889
SS NaN
UP NaN
1991 AB NaN
AH NaN
AS NaN
BA 9.240000
CB NaN
DM 43.148338
DO 49.695652
DS 113.000000
DX NaN
NL 148.785714
OL 28.171429
OT 24.588235
PC NaN
PE 21.435484
PF 7.827586
PP 17.904762
RM 10.498305
RO 11.000000
SA NaN
SF 93.166667
SH 63.000000
SO 53.909091
SS NaN
UR NaN
1992 AB NaN
AH NaN
BA 7.833333
CB NaN
DM 43.877966
DO 48.367816
DS 112.352941
DX NaN
NL 139.000000
OL 28.454545
OT 24.928571
PC NaN
PE 22.750000
PF 8.767857
PH 31.142857
PI 18.000000
PP 17.479675
RM 10.904762
SF 43.000000
SO 55.263158
SS NaN
UL NaN
UR NaN
1993 AB NaN
AH NaN
AS NaN
CB NaN
CU NaN
DM 43.373984
DO 48.461538
DS 105.722222
DX NaN
NL 127.391304
OL 27.545455
OT 22.363636
OX NaN
PC NaN
PE 19.230769
PF 8.238806
PP 17.954023
RM 10.747475
SF 44.000000
SO 55.600000
SS NaN
ST NaN
1994 AB NaN
AH NaN
CB NaN
DM 42.373288
DO 47.750000
DS 106.625000
DX NaN
NL 186.875000
OL 21.900000
OT 25.285714
PC NaN
PE 18.000000
PF 7.855072
PP 17.585714
PU NaN
RM 10.675000
SO 62.333333
SS NaN
SU NaN
UL NaN
UR NaN
1995 AB NaN
AH NaN
CB NaN
DM 44.806147
DO 49.592593
DX NaN
NL 155.833333
OL 27.296296
OT 24.868421
PB 34.000000
PC NaN
PE 21.230769
PF 8.780645
PH 35.333333
PI NaN
PL 22.333333
PM 27.375000
PP 16.844444
PU NaN
RM 12.653846
RX 20.000000
SS NaN
SU NaN
1996 AH NaN
DM 44.506173
DO 47.234940
DX NaN
NL 148.000000
OL 27.769231
OT 23.453704
PB 32.578947
PC NaN
PE 22.027778
PF 8.393846
PL 21.166667
PM 20.224490
PP 18.095563
PX NaN
RM 11.455556
SA NaN
SS NaN
1997 AH NaN
CB NaN
DM 44.708551
DO 48.177866
DS 111.000000
NL 150.688889
OL 33.625000
OT 24.785156
PB 31.085603
PE 20.929825
PF 8.448649
PH 22.000000
PL 18.789474
PM 21.126394
PP 18.175000
PX 19.000000
RF 20.000000
RM 11.230769
RX 11.000000
SA NaN
SH 50.857143
SO 54.666667
SS NaN
1998 AB NaN
AH NaN
CB NaN
CT NaN
DM 43.131403
DO 49.731183
DS 116.000000
DX NaN
NL 159.466667
OT 24.675676
PB 30.082237
PC NaN
PE 20.304348
PF 8.720000
PL 16.714286
PM 20.591398
PP 16.266990
PX NaN
RM 13.100000
SA NaN
SS NaN
1999 AH NaN
DM 43.945402
DO 48.012048
DS 120.714286
DX NaN
NL 182.200000
OT 25.723810
PB 31.710037
PE 25.428571
PM 22.523810
PP 16.521212
RM 10.555556
SH 54.000000
SS NaN
2000 AB NaN
AH NaN
DM 43.126638
DO 49.224719
DX NaN
NL 179.307692
OT 25.303448
PB 30.878899
PE 21.615385
PG NaN
PL 21.000000
PM 20.500000
PP 16.788618
PX NaN
RM 11.400000
SH 73.375000
SS NaN
2001 AB NaN
AH NaN
CB NaN
DM 45.442177
DO 52.233766
DX NaN
NL 167.851064
OT 23.707792
PB 32.798851
PE 20.400000
PF 8.538462
PI 18.000000
PM 26.666667
PP 17.752896
RM 11.333333
SH 79.900000
SS NaN
UR NaN
2002 AB NaN
AH NaN
DM 46.168317
DO 49.514403
DX NaN
NL 182.159091
OL 25.428571
OT 23.833333
PB 32.359447
PE 21.719298
PF 7.388889
PI 20.000000
PM 19.000000
PP 17.018617
RM 10.000000
RO 10.142857
SF 62.166667
SH 64.666667
SS NaN
UR NaN
US NaN
Name: weight, dtype: float64
How many individuals were counted for each species that was observed more than 10 times?
PYTHON
# Counts by species that appear more than 10 times
= surveys.groupby("species_id")["record_id"].count()
species > 10] species[species
OUTPUT
species_id
AB 303
AH 437
BA 46
CB 50
CM 13
CQ 16
DM 10596
DO 3027
DS 2504
DX 40
NL 1252
OL 1006
OT 2249
OX 12
PB 2891
PC 39
PE 1299
PF 1597
PH 32
PL 36
PM 899
PP 3123
RF 75
RM 2609
SA 75
SF 43
SH 147
SO 43
SS 248
Name: record_id, dtype: int64
Calculate the number of individuals observed and the average weight for each species in each year.
Note that weight contains NaNs, so counts for record_id and weight differ.
PYTHON
# Counts and mean weight by species and year
"year", "species_id"]).agg(
surveys.groupby(["record_id": "count", "weight": ["count", "mean"]}
{ )
record_id | weight | |||
---|---|---|---|---|
count | count | mean | ||
year | species_id | |||
1977 | DM | 264 | 184 | 41.141304 |
DO | 12 | 12 | 42.666667 | |
DS | 98 | 32 | 121.437500 | |
NL | 31 | 0 | NaN | |
OL | 10 | 3 | 21.666667 | |
… | … | … | … | … |
2002 | SF | 7 | 6 | 62.166667 |
SH | 11 | 9 | 64.666667 | |
SS | 9 | 0 | NaN | |
UR | 1 | 0 | NaN | |
US | 4 | 0 | NaN |
509 rows × 3 columns
Handling missing data
As we’ve discussed, some columns in the surveys dataframe have the value NaN instead of text or numbers. NaN, short for “not a number,” is a special type of float used by pandas to represent missing data. When reading from a CSV, as we have done throughout this lesson, pandas interprets certains values as NaN (see na_values in the pd.read_csv() documentation for the default list). NaNs are excluded from groups and most aggregation calculations in pandas, including counts.
It is crucial to understand how missing data is represented in a dataset. Failing to do so may introduce errors into our analysis. The ecology dataset used in this lesson uses empty cells to represent missing data, but other disciplines have different conventions. For example, some geographic datasets use -9999 to represent null values. Failure to convert such values to NaN will result in significant errors on any calculations performed on that dataset.
In some cases, it can be useful to fill in cells containing NaN with a non-null value. For example, the groupby()
method excludes NaN cells. When looking at sex, the following code counts only those cells with either F (female) or M (male):
PYTHON
"sex")["record_id"].count() surveys.groupby(
OUTPUT
sex
F 15690
M 17348
Name: record_id, dtype: int64
But not all records specify a sex. To include records where sex was not specified, we can use the fillna()
method on the sex column. This method replaces each NaN cell with the value passed as the first argument to the function call. To replace all NaN values in sex with “U”, use:
PYTHON
"sex"] = surveys["sex"].fillna("U") surveys[
The grouped calculation now accounts for all records in the dataframe:
PYTHON
"sex")["record_id"].count() surveys.groupby(
OUTPUT
sex
F 15690
M 17348
U 2511
Name: record_id, dtype: int64
In other cases, we may want to ignore rows that contain NaNs. This can be done using dropna()
:
PYTHON
surveys.dropna()
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
62 | 63 | 8 | 19 | 1977 | 3 | DM | M | 35.0 | 40.0 |
63 | 64 | 8 | 19 | 1977 | 7 | DM | M | 37.0 | 48.0 |
64 | 65 | 8 | 19 | 1977 | 4 | DM | F | 34.0 | 29.0 |
65 | 66 | 8 | 19 | 1977 | 4 | DM | F | 35.0 | 46.0 |
66 | 67 | 8 | 19 | 1977 | 7 | DM | M | 35.0 | 36.0 |
… | … | … | … | … | … | … | … | … | … |
35540 | 35541 | 12 | 31 | 2002 | 15 | PB | F | 24.0 | 31.0 |
35541 | 35542 | 12 | 31 | 2002 | 15 | PB | F | 26.0 | 29.0 |
35542 | 35543 | 12 | 31 | 2002 | 15 | PB | F | 27.0 | 34.0 |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
30738 rows × 9 columns
This method returns a copy of the dataframe containing only those rows that have valid data in every field.
Key Points
- Calculate individual summary statistics using dataframe methods like
mean()
,max()
, andmin()
- Calculate multiple summary statistics at once using the dataframe methods
describe()
andagg()
- Group data by one or more columns using the
groupby()
method - pandas uses NaN to represent missing data in a dataframe
- Failing to consider how missing data is interpreted in a dataset can introduce errors into calculations
Content from Combining Dataframes
Last updated on 2022-07-18 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- How do we combine data from multiple sources using pandas?
- How do we add data to an existing dataframe?
- How do we split and combine data columns?
Objectives
- Use
pd.merge()
to add species info to the survey dataset - Use
pd.concat()
to add additional rows the dataset - Use string methods to combine, split, and modify text columns using the
str
accessor
Dataframes can be used to organize and group data by common characteristics. Often, we need to combine elements from separate dataframes into one for analysis and visualization. A merge (or join) allows use to combine two dataframes using values common to each. Likewise, we may need to append data collected under different circumstances. In this chapter, we will show how to merge, concatenate, and split data using pandas.
Merging dataframes
The survey dataframe we’ve been using throughout this lesson has a column called species_id. We used this column in the previous lesson to calculate summary statistics about observations of each species. But the species_id is just a two-letter code—what does each code stand for? To find out, we’ll now load both the survey dataset and a second dataset containing more detailed information about the various species observed. Read the second dataframe from a file called species.csv:
PYTHON
import pandas as pd
= pd.read_csv("data/surveys.csv")
surveys = pd.read_csv("data/species.csv")
species
species
species_id | genus | species | taxa | |
---|---|---|---|---|
0 | AB | Amphispiza | bilineata | Bird |
1 | AH | Ammospermophilus | harrisi | Rodent |
2 | AS | Ammodramus | savannarum | Bird |
3 | BA | Baiomys | taylori | Rodent |
4 | CB | Campylorhynchus | brunneicapillus | Bird |
5 | CM | Calamospiza | melanocorys | Bird |
6 | CQ | Callipepla | squamata | Bird |
7 | CS | Crotalus | scutalatus | Reptile |
8 | CT | Cnemidophorus | tigris | Reptile |
9 | CU | Cnemidophorus | uniparens | Reptile |
10 | CV | Crotalus | viridis | Reptile |
11 | DM | Dipodomys | merriami | Rodent |
12 | DO | Dipodomys | ordii | Rodent |
13 | DS | Dipodomys | spectabilis | Rodent |
14 | DX | Dipodomys | sp. | Rodent |
15 | EO | Eumeces | obsoletus | Reptile |
16 | GS | Gambelia | silus | Reptile |
17 | NL | Neotoma | albigula | Rodent |
18 | NX | Neotoma | sp. | Rodent |
19 | OL | Onychomys | leucogaster | Rodent |
20 | OT | Onychomys | torridus | Rodent |
21 | OX | Onychomys | sp. | Rodent |
22 | PB | Chaetodipus | baileyi | Rodent |
23 | PC | Pipilo | chlorurus | Bird |
24 | PE | Peromyscus | eremicus | Rodent |
25 | PF | Perognathus | flavus | Rodent |
26 | PG | Pooecetes | gramineus | Bird |
27 | PH | Perognathus | hispidus | Rodent |
28 | PI | Chaetodipus | intermedius | Rodent |
29 | PL | Peromyscus | leucopus | Rodent |
30 | PM | Peromyscus | maniculatus | Rodent |
31 | PP | Chaetodipus | penicillatus | Rodent |
32 | PU | Pipilo | fuscus | Bird |
33 | PX | Chaetodipus | sp. | Rodent |
34 | RF | Reithrodontomys | fulvescens | Rodent |
35 | RM | Reithrodontomys | megalotis | Rodent |
36 | RO | Reithrodontomys | montanus | Rodent |
37 | RX | Reithrodontomys | sp. | Rodent |
38 | SA | Sylvilagus | audubonii | Rabbit |
39 | SB | Spizella | breweri | Bird |
40 | SC | Sceloporus | clarki | Reptile |
41 | SF | Sigmodon | fulviventer | Rodent |
42 | SH | Sigmodon | hispidus | Rodent |
43 | SO | Sigmodon | ochrognathus | Rodent |
44 | SS | Spermophilus | spilosoma | Rodent |
45 | ST | Spermophilus | tereticaudus | Rodent |
46 | SU | Sceloporus | undulatus | Reptile |
47 | SX | Sigmodon | sp. | Rodent |
48 | UL | Lizard | sp. | Reptile |
49 | UP | Pipilo | sp. | Bird |
50 | UR | Rodent | sp. | Rodent |
51 | US | Sparrow | sp. | Bird |
52 | ZL | Zonotrichia | leucophrys | Bird |
53 | ZM | Zenaida | macroura | Bird |
We can see that the species dataframe includes a genus, species, and taxon for each species_id. This is much more useful than the species_id included in the original dataframe–how can we add that data to our surveys dataframe? Adding it by hand would be tedious and error prone. Fortunately, pandas provides the pd.merge()
function to join two dataframes.
Managing repetitive data
Why store species data in a separate table in the first place? Species information is repetitive: Every observation of the same species has the same genus, species, and taxa. Storing it in the original survey table would require including that data in every record, increasing the complexity of the table and creating the possibility of errors. Storing that data in a separate table means we only have to enter and validate it once. A tool like pandas then allows us to access that data when we need it.
To merge the surveys and species dataframes, use:
PYTHON
= pd.merge(surveys, species)
merged merged
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN | Neotoma | albigula | Rodent |
2 | 22 | 7 | 17 | 1977 | 15 | NL | F | 31.0 | NaN | Neotoma | albigula | Rodent |
3 | 38 | 7 | 17 | 1977 | 17 | NL | M | 33.0 | NaN | Neotoma | albigula | Rodent |
4 | 72 | 8 | 19 | 1977 | 2 | NL | M | 31.0 | NaN | Neotoma | albigula | Rodent |
… | … | … | … | … | … | … | … | … | … | … | … | … |
34781 | 28988 | 12 | 23 | 1998 | 6 | CT | NaN | NaN | NaN | Cnemidophorus | tigris | Reptile |
34782 | 35512 | 12 | 31 | 2002 | 11 | US | NaN | NaN | NaN | Sparrow | sp. | Bird |
34783 | 35513 | 12 | 31 | 2002 | 11 | US | NaN | NaN | NaN | Sparrow | sp. | Bird |
34784 | 35528 | 12 | 31 | 2002 | 13 | US | NaN | NaN | NaN | Sparrow | sp. | Bird |
34785 | 35544 | 12 | 31 | 2002 | 15 | US | NaN | NaN | NaN | Sparrow | sp. | Bird |
34786 rows × 12 columns
Following the merge, the genus, species, and taxa columns have all been added to the survey dataframe. We can now use those columns to filter and summarize our data.
Joins
The pd.merge()
method is equivalent to the JOIN operation in SQL
Challenge
Filter the merged dataframe to show the genus, the species name, and the weight for every individual captured at the site
PYTHON
"genus", "species", "weight"]] merged[[
genus | species | weight | |
---|---|---|---|
0 | Neotoma | albigula | NaN |
1 | Neotoma | albigula | NaN |
2 | Neotoma | albigula | NaN |
3 | Neotoma | albigula | NaN |
4 | Neotoma | albigula | NaN |
… | … | … | … |
34781 | Cnemidophorus | tigris | NaN |
34782 | Sparrow | sp. | NaN |
34783 | Sparrow | sp. | NaN |
34784 | Sparrow | sp. | NaN |
34785 | Sparrow | sp. | NaN |
34786 rows × 3 columns
In the example above, we didn’t provide any information about how we wanted to merge the dataframes together, so pandas used its default arguments to make an educated guess. It looked at the columns in each of the dataframes, then merged them based on the columns that appear in both. Here, the only shared name is species_id, so that’s the column pandas used to merge. For more complex tables, we may want to specify the columns are used for merging. We can do so by passing one or more column names using the on keyword argument:
PYTHON
="species_id") pd.merge(surveys, species, on
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN | Neotoma | albigula | Rodent |
2 | 22 | 7 | 17 | 1977 | 15 | NL | F | 31.0 | NaN | Neotoma | albigula | Rodent |
3 | 38 | 7 | 17 | 1977 | 17 | NL | M | 33.0 | NaN | Neotoma | albigula | Rodent |
4 | 72 | 8 | 19 | 1977 | 2 | NL | M | 31.0 | NaN | Neotoma | albigula | Rodent |
… | … | … | … | … | … | … | … | … | … | … | … | … |
34781 | 28988 | 12 | 23 | 1998 | 6 | CT | NaN | NaN | NaN | Cnemidophorus | tigris | Reptile |
34782 | 35512 | 12 | 31 | 2002 | 11 | US | NaN | NaN | NaN | Sparrow | sp. | Bird |
34783 | 35513 | 12 | 31 | 2002 | 11 | US | NaN | NaN | NaN | Sparrow | sp. | Bird |
34784 | 35528 | 12 | 31 | 2002 | 13 | US | NaN | NaN | NaN | Sparrow | sp. | Bird |
34785 | 35544 | 12 | 31 | 2002 | 15 | US | NaN | NaN | NaN | Sparrow | sp. | Bird |
34786 rows × 12 columns
Challenge
Compare the number of rows in the original and merged survey dataframes. How do they differ? Why do you think that might be?
Hint: Use pd.unique()
method to look at the species_id column in each dataframe.
PYTHON
"species_id"].sort_values()) pd.unique(surveys[
OUTPUT
array(['AB', 'AH', 'AS', 'BA', 'CB', 'CM', 'CQ', 'CS', 'CT', 'CU', 'CV',
'DM', 'DO', 'DS', 'DX', 'NL', 'OL', 'OT', 'OX', 'PB', 'PC', 'PE',
'PF', 'PG', 'PH', 'PI', 'PL', 'PM', 'PP', 'PU', 'PX', 'RF', 'RM',
'RO', 'RX', 'SA', 'SC', 'SF', 'SH', 'SO', 'SS', 'ST', 'SU', 'UL',
'UP', 'UR', 'US', 'ZL', nan], dtype=object)
PYTHON
"species_id"].sort_values()) pd.unique(species[
OUTPUT
array(['AB', 'AH', 'AS', 'BA', 'CB', 'CM', 'CQ', 'CS', 'CT', 'CU', 'CV',
'DM', 'DO', 'DS', 'DX', 'EO', 'GS', 'NL', 'NX', 'OL', 'OT', 'OX',
'PB', 'PC', 'PE', 'PF', 'PG', 'PH', 'PI', 'PL', 'PM', 'PP', 'PU',
'PX', 'RF', 'RM', 'RO', 'RX', 'SA', 'SB', 'SC', 'SF', 'SH', 'SO',
'SS', 'ST', 'SU', 'SX', 'UL', 'UP', 'UR', 'US', 'ZL', 'ZM'],
dtype=object)
Some records in the surveys dataframe do not specify a species. By default, only records with a value that occurs in both the surveys and species dataframes appear in the merged dataframe, so rows without a species_id are excluded.
The built-in set
type can be used to quickly assess differences like this:
PYTHON
set(surveys["species_id"]) - set(species["species_id"])
OUTPUT
{nan}
In practice, the values in the columns used to join two dataframes may not align exactly. Above, the surveys dataframe contains a few hundred rows where species_id is NaN. These are the rows that were dropped when the dataframes were merged.
By default, pd.merge()
performs an inner join. This means that a row will only appear if the value in the shared column appears in both of the datasets being merged. In this case, that means that survey observations that don’t have a species_id or have a species_id that does not appear in the species dataframe will be dropped.
This is not always desirable behavior. Fortunately, pandas supports additional types of merges:
- Inner: Include all rows with common values in the join column. This is the default behavior.
- Left: Include all rows from the left dataframe. Columns from the right dataframe are populated if a common value exists and set to NaN if not.
- Right: Include all rows from the right dataframe. Columns from the left dataframe are populated if a common value exists and set to NaN if not.
- Outer: Includes all rows from both dataframes
We want to keep all of our observations, so let’s do a left join instead. To specify the type of merge, we use the how keyword argument:
PYTHON
="left") pd.merge(surveys, species, how
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN | Neotoma | albigula | Rodent |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN | Dipodomys | merriami | Rodent |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN | Dipodomys | merriami | Rodent |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN | Dipodomys | merriami | Rodent |
… | … | … | … | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | Ammospermophilus | harrisi | Rodent |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | Ammospermophilus | harrisi | Rodent |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 | Reithrodontomys | megalotis | Rodent |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 | Dipodomys | ordii | Rodent |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
35549 rows × 12 columns
Now all 35,549 rows appear in the merged dataframe.
Appending rows to a dataframe
Merges address the case where information about the same set of observations is spread across multiple files. What about when the observations themselves are split into more than one file? For a survey like the one we’ve been looking at in this lesson, we might get a new file once a year with the same columns but a completely new set of observations. How can we add those new observations to our dataframe?
We’ll simulate this operation by splitting data from two different years, 2001 and 2002, into separate dataframes. We can do this using conditionals, as we saw in lesson 3:
PYTHON
= surveys[surveys["year"] == 2001].copy()
surveys_2001 surveys_2001
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
31710 | 31711 | 1 | 21 | 2001 | 1 | PB | F | 26.0 | 25.0 |
31711 | 31712 | 1 | 21 | 2001 | 1 | DM | M | 37.0 | 43.0 |
31712 | 31713 | 1 | 21 | 2001 | 1 | PB | M | 29.0 | 44.0 |
31713 | 31714 | 1 | 21 | 2001 | 1 | DO | M | 34.0 | 53.0 |
31714 | 31715 | 1 | 21 | 2001 | 2 | OT | M | 20.0 | 27.0 |
… | … | … | … | … | … | … | … | … | … |
33315 | 33316 | 12 | 16 | 2001 | 11 | NaN | NaN | NaN | NaN |
33316 | 33317 | 12 | 16 | 2001 | 13 | NaN | NaN | NaN | NaN |
33317 | 33318 | 12 | 16 | 2001 | 14 | NaN | NaN | NaN | NaN |
33318 | 33319 | 12 | 16 | 2001 | 15 | NaN | NaN | NaN | NaN |
33319 | 33320 | 12 | 16 | 2001 | 16 | NaN | NaN | NaN | NaN |
1610 rows × 9 columns
PYTHON
= surveys[surveys["year"] == 2002].copy()
surveys_2002 surveys_2002
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
33320 | 33321 | 1 | 12 | 2002 | 1 | DM | M | 38.0 | 44.0 |
33321 | 33322 | 1 | 12 | 2002 | 1 | DO | M | 37.0 | 58.0 |
33322 | 33323 | 1 | 12 | 2002 | 1 | PB | M | 28.0 | 45.0 |
33323 | 33324 | 1 | 12 | 2002 | 1 | AB | NaN | NaN | NaN |
33324 | 33325 | 1 | 12 | 2002 | 1 | DO | M | 35.0 | 29.0 |
… | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
2229 rows × 9 columns
We now have two different dataframes with the same columns but different data, one with 1,610 rows, the other with 2,229 rows. We can combine them into a new dataframe using pd.concat()
, which stacks the dataframes vertically (that is, it appends records from the 2002 dataset to the 2001 dataset). This method accepts a list and will concatenate each item moving from left to right. We’re only combining two dataframes here but could do more if needed.
PYTHON
= pd.concat([surveys_2001, surveys_2002])
surveys_2001_2002 surveys_2001_2002
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
31710 | 31711 | 1 | 21 | 2001 | 1 | PB | F | 26.0 | 25.0 |
31711 | 31712 | 1 | 21 | 2001 | 1 | DM | M | 37.0 | 43.0 |
31712 | 31713 | 1 | 21 | 2001 | 1 | PB | M | 29.0 | 44.0 |
31713 | 31714 | 1 | 21 | 2001 | 1 | DO | M | 34.0 | 53.0 |
31714 | 31715 | 1 | 21 | 2001 | 2 | OT | M | 20.0 | 27.0 |
… | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
3839 rows × 9 columns
The combined dataframe includes all rows from both dataframes.
In some cases, the exact columns may change from year to year even within the same project. For example, researchers may decide to add an additional column to track a new piece of data or to provide a quality check. If a column is present in only one dataset, you can still concatenate the datasets. Any column that does not appear in a given dataset will be set to NaN for those rows in the combined dataframe.
To illustrate this, we’ll add a validated column to the 2002 survey, then re-run pd.concat()
:
PYTHON
"validated"] = True
surveys_2002[
= pd.concat([surveys_2001, surveys_2002])
surveys_2001_2002 surveys_2001_2002
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | validated | |
---|---|---|---|---|---|---|---|---|---|---|
31710 | 31711 | 1 | 21 | 2001 | 1 | PB | F | 26.0 | 25.0 | NaN |
31711 | 31712 | 1 | 21 | 2001 | 1 | DM | M | 37.0 | 43.0 | NaN |
31712 | 31713 | 1 | 21 | 2001 | 1 | PB | M | 29.0 | 44.0 | NaN |
31713 | 31714 | 1 | 21 | 2001 | 1 | DO | M | 34.0 | 53.0 | NaN |
31714 | 31715 | 1 | 21 | 2001 | 2 | OT | M | 20.0 | 27.0 | NaN |
… | … | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | True |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | True |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 | True |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 | True |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN | True |
3839 rows × 10 columns
As expected, the validated column has a value of NaN for the 2001 data in the combined dataframe.
Joining and splitting columns
Sometimes we’d like to combine values from two or more columns into a single column. For example, we might want to refer to the species in each record by both its genus and species names. In Python, we use the +
operator to concatenate (or join) strings, and pandas works the same way:
PYTHON
"genus_species"] = species["genus"] + " " + species["species"]
species["genus_species"] species[
OUTPUT
0 Amphispiza bilineata
1 Ammospermophilus harrisi
2 Ammodramus savannarum
3 Baiomys taylori
4 Campylorhynchus brunneicapillus
5 Calamospiza melanocorys
6 Callipepla squamata
7 Crotalus scutalatus
8 Cnemidophorus tigris
9 Cnemidophorus uniparens
10 Crotalus viridis
11 Dipodomys merriami
12 Dipodomys ordii
13 Dipodomys spectabilis
14 Dipodomys sp.
15 Eumeces obsoletus
16 Gambelia silus
17 Neotoma albigula
18 Neotoma sp.
19 Onychomys leucogaster
20 Onychomys torridus
21 Onychomys sp.
22 Chaetodipus baileyi
23 Pipilo chlorurus
24 Peromyscus eremicus
25 Perognathus flavus
26 Pooecetes gramineus
27 Perognathus hispidus
28 Chaetodipus intermedius
29 Peromyscus leucopus
30 Peromyscus maniculatus
31 Chaetodipus penicillatus
32 Pipilo fuscus
33 Chaetodipus sp.
34 Reithrodontomys fulvescens
35 Reithrodontomys megalotis
36 Reithrodontomys montanus
37 Reithrodontomys sp.
38 Sylvilagus audubonii
39 Spizella breweri
40 Sceloporus clarki
41 Sigmodon fulviventer
42 Sigmodon hispidus
43 Sigmodon ochrognathus
44 Spermophilus spilosoma
45 Spermophilus tereticaudus
46 Sceloporus undulatus
47 Sigmodon sp.
48 Lizard sp.
49 Pipilo sp.
50 Rodent sp.
51 Sparrow sp.
52 Zonotrichia leucophrys
53 Zenaida macroura
Name: genus_species, dtype: object
Note that the +
operator is also used to add numeric columns. In Python, the same operator can be used to perform different operations for different data types (but keep reading for an important caveat.)
Another common need is to join or split dates. In the ecology dataset, the date is split across year, month, and day columns. However, pandas has a special data type, datetime64
, for representing dates that is useful for plotting, comparing, and resampling time series data. To make use of that functionality, we can concatenate the date columns and convert them to a datetime object. For clarity, we’ll use an unambiguous date format: YYYY-MM-DD.
We need to perform an additional step before combining the date columns. Year, month, and day are all stored as integers in our dataframe (specifically, they use the int64
data type). If we try to concatenate them as they are, we’ll receive an error. This is because the +
operator only works when each object has a similar type, that is, all objects are either the same type or can be coerced to a common type (like int
and float
, which are distinct types that generally play well together). In this case, the columns are integers and the hyphens are strings. pandas cannot determine exactly how the user wants to combine the values, so it gives an error.
To resolve the error, we can use the astype()
method to convert each column to a string before combining the columns:
PYTHON
= surveys["year"].astype(str)
year = surveys["month"].astype(str)
month = surveys["day"].astype(str)
day
"date"] = year + "-" + month + "-" + day
surveys["date"] surveys[
OUTPUT
0 1977-7-16
1 1977-7-16
2 1977-7-16
3 1977-7-16
4 1977-7-16
...
35544 2002-12-31
35545 2002-12-31
35546 2002-12-31
35547 2002-12-31
35548 2002-12-31
Name: date, Length: 35549, dtype: object
Note that some of the dates look a little funny because single-digit days and months do not include a leading zero. For example, in the first row we have 1977-7-16 instead of 1977-07-16. This is usually not a big deal but can be neatened up using the str
accessor.
In pandas, an accessor is an attribute that provides additional functionality to an object. Here, the str
accessor allows us to access many of the methods from the built-in str
data type, including zfill()
, which pads a string to a given length by adding zeroes to the start of the string:
PYTHON
= "2"
text 3) text.zfill(
OUTPUT
'002'
Using the str
accessor, we can use that method to zero-pad the data in a Series
:
PYTHON
# Pad month and day to two characters
= month.str.zfill(2)
month = day.str.zfill(2)
day
"date"] = year + "-" + month + "-" + day
surveys["date"] surveys[
OUTPUT
0 1977-07-16
1 1977-07-16
2 1977-07-16
3 1977-07-16
4 1977-07-16
...
35544 2002-12-31
35545 2002-12-31
35546 2002-12-31
35547 2002-12-31
35548 2002-12-31
Name: date, Length: 35549, dtype: object
The month and date values in date are now padded to a length of two, allowing us to create a well-formed YYYY-MM-DD date string. Other string methods, like upper()
and lower()
, can be used in the same way.
Before we convert the date column to a datetime, we’re going to use the date string to show the opposite operation: Splitting a value stored in one column into multiple columns. One way to do this in pandas is to use str.split()
, which splits each value in a series based on a delimiter, a character used as a boundary between parts of a string. Here, a hyphen is used to delimit the year, month, and date in each date. By splitting the column on a hyphen, we can extract each of those components into its own column. We also pass True
to the expand keyword argument, which makes the str.split()
method return a dataframe:
PYTHON
"date"].str.split("-", expand=True) surveys[
0 | 1 | 2 | |
---|---|---|---|
0 | 1977 | 07 | 16 |
1 | 1977 | 07 | 16 |
2 | 1977 | 07 | 16 |
3 | 1977 | 07 | 16 |
4 | 1977 | 07 | 16 |
… | … | … | … |
35544 | 2002 | 12 | 31 |
35545 | 2002 | 12 | 31 |
35546 | 2002 | 12 | 31 |
35547 | 2002 | 12 | 31 |
35548 | 2002 | 12 | 31 |
35549 rows × 3 columns
Now let’s go ahead and convert our date column into a datetime object using pd.to_datetime()
:
PYTHON
"date"] = pd.to_datetime(surveys["date"])
surveys["date"] surveys[
OUTPUT
0 1977-07-16
1 1977-07-16
2 1977-07-16
3 1977-07-16
4 1977-07-16
...
35544 2002-12-31
35545 2002-12-31
35546 2002-12-31
35547 2002-12-31
35548 2002-12-31
Name: date, Length: 35549, dtype: datetime64[ns]
Challenge
pandas can help us ask specific questions which we want to answer about our data. The real skill is to know how to translate our scientific questions into a sensible approach (and subsequently visualize and interpret our results).
Try using pandas to answer the following questions.
- How many specimens of each sex are there for each year, including those whose sex is unknown?
- What is the average weight of each taxa?
- What are the minimum, maximum and average weight for each species of Rodent?
- What is the average hindfoot length for male and female rodent of each species? Is there a male/female difference?
- What is the average weight of each rodent species over the course of the years? Is there any noticeable trend for any of the species?
How many specimens of each sex are there for each year, including those whose sex is unknown?
PYTHON
# Fill in NaN values in sex with U
"sex"] = surveys["sex"].fillna("U")
surveys[
# Count records by sex
= surveys.groupby(["year", "sex"])["record_id"].count()
result
# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
print(result)
OUTPUT
year sex
1977 F 204
M 214
U 85
1978 F 503
M 433
U 112
1979 F 327
M 324
U 68
1980 F 605
M 727
U 83
1981 F 631
M 745
U 96
1982 F 823
M 1027
U 128
1983 F 771
M 797
U 105
1984 F 445
M 443
U 93
1985 F 636
M 716
U 86
1986 F 414
M 455
U 73
1987 F 677
M 862
U 132
1988 F 604
M 737
U 128
1989 F 678
M 780
U 111
1990 F 581
M 636
U 94
1991 F 606
M 637
U 104
1992 F 443
M 477
U 118
1993 F 285
M 380
U 85
1994 F 243
M 327
U 98
1995 F 574
M 534
U 114
1996 F 725
M 910
U 71
1997 F 1071
M 1357
U 65
1998 F 717
M 798
U 95
1999 F 545
M 530
U 60
2000 F 690
M 779
U 83
2001 F 743
M 744
U 123
2002 F 1149
M 979
U 101
Name: record_id, dtype: int64
What is the average weight of each taxa?
PYTHON
# Create the merged dataframe
= pd.merge(surveys, species, how="left")
merged
# Group by taxa
= merged.groupby("taxa")
grouped
# Calculate the min, max, and mean weight
"weight"].mean() grouped[
OUTPUT
taxa
Bird NaN
Rabbit NaN
Reptile NaN
Rodent 42.672428
Name: weight, dtype: float64
What are the minimum, maximum and average weight for each species of Rodent?
PYTHON
# Create the merged dataframe
= pd.merge(surveys, species, how="left")
merged
# Limit merged dataframe to rodents
= merged[merged["taxa"] == "Rodent"]
rodents
# Group rodents by species
= rodents.groupby("species_id")
grouped
# Calculate the min, max, and mean weight
"weight": ["min", "max", "mean"]}) grouped.agg({
weight | |||
---|---|---|---|
min | max | mean | |
species_id | |||
AH | NaN | NaN | NaN |
BA | 6.0 | 18.0 | 8.600000 |
DM | 10.0 | 66.0 | 43.157864 |
DO | 12.0 | 76.0 | 48.870523 |
DS | 12.0 | 190.0 | 120.130546 |
DX | NaN | NaN | NaN |
NL | 30.0 | 280.0 | 159.245660 |
OL | 10.0 | 56.0 | 31.575258 |
OT | 5.0 | 46.0 | 24.230556 |
OX | 18.0 | 24.0 | 21.000000 |
PB | 12.0 | 55.0 | 31.735943 |
PE | 8.0 | 40.0 | 21.586508 |
PF | 4.0 | 25.0 | 7.923127 |
PH | 18.0 | 48.0 | 31.064516 |
PI | 17.0 | 21.0 | 19.250000 |
PL | 8.0 | 27.0 | 19.138889 |
PM | 7.0 | 49.0 | 21.364155 |
PP | 4.0 | 74.0 | 17.173942 |
PX | 18.0 | 20.0 | 19.000000 |
RF | 9.0 | 20.0 | 13.386667 |
RM | 4.0 | 29.0 | 10.585010 |
RO | 8.0 | 13.0 | 10.250000 |
RX | 11.0 | 20.0 | 15.500000 |
SF | 24.0 | 199.0 | 58.878049 |
SH | 16.0 | 140.0 | 73.148936 |
SO | 15.0 | 105.0 | 55.414634 |
SS | 57.0 | 130.0 | 93.500000 |
ST | NaN | NaN | NaN |
UR | NaN | NaN | NaN |
What is the average hindfoot length for male and female rodent of each species? Is there a male/female difference?
PYTHON
# Create the merged dataframe
= pd.merge(surveys, species, how="left")
merged
# Limit merged dataframe to rodents
= merged[merged["taxa"] == "Rodent"]
rodents
# Group rodents by species and sex
= rodents.groupby(["species_id", "sex"])
grouped
# Calculate the mean hindfoot length, plus count and standard deviation
# to better assess the question
with pd.option_context("display.max_rows", None):
print(grouped["hindfoot_length"].agg(["count", "mean", "std"]))
OUTPUT
count mean std
species_id sex
AH M 0 NaN NaN
U 2 33.000000 2.828427
BA F 31 13.161290 1.529636
M 14 12.642857 2.097880
U 0 NaN NaN
DM F 4302 35.712692 1.433067
M 5658 36.188229 1.455396
U 12 35.583333 0.996205
DO F 1244 35.486334 1.726414
M 1640 35.698780 1.611656
U 3 36.000000 0.000000
DS F 1046 49.583174 1.973573
M 1083 50.301939 2.128120
U 3 50.000000 1.732051
DX U 0 NaN NaN
NL F 620 32.024194 1.669181
M 452 32.577434 1.903603
U 2 52.000000 25.455844
OL F 438 20.287671 1.101022
M 480 20.758333 1.651195
U 2 20.000000 0.000000
OT F 1019 20.281649 1.519412
M 1115 20.257399 1.087354
U 5 19.600000 1.140175
OX F 3 19.666667 1.154701
M 5 18.800000 3.346640
U 0 NaN NaN
PB F 1645 25.882675 0.928842
M 1213 26.431987 1.444130
U 6 26.166667 0.983192
PE F 544 20.272059 1.156084
M 666 20.132132 1.177721
U 2 20.500000 0.707107
PF F 721 15.550624 1.381627
M 770 15.620779 1.142208
U 2 13.000000 4.242641
PH F 20 25.900000 1.293709
M 11 25.545455 1.752920
PI M 8 22.500000 0.534522
U 1 20.000000 NaN
PL F 16 19.750000 2.720294
M 19 20.263158 0.933459
U 1 20.000000 NaN
PM F 361 20.279778 1.162782
M 483 20.530021 1.275621
U 3 21.333333 4.041452
PP F 1574 21.676620 1.316529
M 1444 21.838643 1.201006
U 9 20.888889 2.027588
PX F 1 19.000000 NaN
M 1 20.000000 NaN
U 0 NaN NaN
RF F 55 17.527273 0.813191
M 18 17.500000 0.985184
RM F 1133 16.391880 1.123946
M 1296 16.491512 1.134202
U 13 15.846154 2.444250
RO F 4 14.750000 0.957427
M 4 16.000000 1.154701
RX M 2 18.500000 2.121320
SF F 16 27.500000 2.683282
M 23 26.347826 3.524056
U 2 24.500000 0.707107
SH F 71 29.028169 2.298893
M 60 27.983333 2.801281
U 0 NaN NaN
SO F 30 25.633333 4.139660
M 11 25.727273 1.954017
SS F 0 NaN NaN
M 0 NaN NaN
U 0 NaN NaN
ST U 0 NaN NaN
UR U 0 NaN NaN
What is the average weight of each rodent species over the course of the years? Is there any noticeable trend for any of the species?
PYTHON
# Create the merged dataframe
= pd.merge(surveys, species, how="left")
merged
# Limit merged dataframe to rodents
= merged[merged["taxa"] == "Rodent"]
rodents
# Group rodents by species and year
= rodents.groupby(["species_id", "year"])
grouped
# Calculate the mean weight by year
= grouped["weight"].mean()
result
# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
print(result)
OUTPUT
species_id year
AH 1978 NaN
1982 NaN
1983 NaN
1984 NaN
1985 NaN
1986 NaN
1987 NaN
1988 NaN
1989 NaN
1990 NaN
1991 NaN
1992 NaN
1993 NaN
1994 NaN
1995 NaN
1996 NaN
1997 NaN
1998 NaN
1999 NaN
2000 NaN
2001 NaN
2002 NaN
BA 1989 7.000000
1990 8.000000
1991 9.240000
1992 7.833333
DM 1977 41.141304
1978 40.795455
1979 43.507317
1980 44.136082
1981 44.032907
1982 41.975042
1983 40.944551
1984 40.765306
1985 41.507645
1986 43.361596
1987 43.232609
1988 43.397790
1989 44.349206
1990 41.769912
1991 43.148338
1992 43.877966
1993 43.373984
1994 42.373288
1995 44.806147
1996 44.506173
1997 44.708551
1998 43.131403
1999 43.945402
2000 43.126638
2001 45.442177
2002 46.168317
DO 1977 42.666667
1978 45.000000
1979 45.862069
1980 48.058824
1981 49.111111
1982 47.919643
1983 47.150602
1984 48.415094
1985 47.956731
1986 49.372727
1987 50.087379
1988 51.463768
1989 51.025641
1990 48.512048
1991 49.695652
1992 48.367816
1993 48.461538
1994 47.750000
1995 49.592593
1996 47.234940
1997 48.177866
1998 49.731183
1999 48.012048
2000 49.224719
2001 52.233766
2002 49.514403
DS 1977 121.437500
1978 115.198653
1979 111.796954
1980 120.725664
1981 125.818444
1982 115.647929
1983 122.033088
1984 124.082192
1985 124.326316
1986 128.071429
1987 126.383838
1988 129.490566
1989 121.896552
1990 121.187500
1991 113.000000
1992 112.352941
1993 105.722222
1994 106.625000
1997 111.000000
1998 116.000000
1999 120.714286
DX 1980 NaN
1983 NaN
1986 NaN
1988 NaN
1991 NaN
1992 NaN
1993 NaN
1994 NaN
1995 NaN
1996 NaN
1998 NaN
1999 NaN
2000 NaN
2001 NaN
2002 NaN
NL 1977 NaN
1978 184.656250
1979 138.000000
1980 158.982143
1981 165.903226
1982 160.613208
1983 156.691489
1984 150.095238
1985 148.829268
1986 159.724138
1987 158.840000
1988 163.104167
1989 151.278689
1990 154.275862
1991 148.785714
1992 139.000000
1993 127.391304
1994 186.875000
1995 155.833333
1996 148.000000
1997 150.688889
1998 159.466667
1999 182.200000
2000 179.307692
2001 167.851064
2002 182.159091
OL 1977 21.666667
1978 31.027027
1979 33.717391
1980 33.107143
1981 33.296296
1982 35.590361
1983 34.584615
1984 32.550725
1985 32.108696
1986 30.880952
1987 30.779221
1988 30.306122
1989 31.947368
1990 31.200000
1991 28.171429
1992 28.454545
1993 27.545455
1994 21.900000
1995 27.296296
1996 27.769231
1997 33.625000
2002 25.428571
OT 1977 21.000000
1978 23.220000
1979 23.075758
1980 24.083333
1981 24.415385
1982 24.925532
1983 24.697674
1984 22.416667
1985 22.064516
1986 21.500000
1987 23.069767
1988 24.070588
1989 24.663366
1990 23.675325
1991 24.588235
1992 24.928571
1993 22.363636
1994 25.285714
1995 24.868421
1996 23.453704
1997 24.785156
1998 24.675676
1999 25.723810
2000 25.303448
2001 23.707792
2002 23.833333
OX 1977 21.333333
1978 NaN
1982 24.000000
1984 18.000000
1989 20.000000
1993 NaN
PB 1995 34.000000
1996 32.578947
1997 31.085603
1998 30.082237
1999 31.710037
2000 30.878899
2001 32.798851
2002 32.359447
PE 1977 19.500000
1978 20.428571
1979 20.529412
1980 22.450000
1981 20.558140
1982 21.173077
1983 20.864865
1984 20.210526
1985 20.360000
1986 22.520000
1987 21.625899
1988 22.625668
1989 21.935673
1990 21.671233
1991 21.435484
1992 22.750000
1993 19.230769
1994 18.000000
1995 21.230769
1996 22.027778
1997 20.929825
1998 20.304348
1999 25.428571
2000 21.615385
2001 20.400000
2002 21.719298
PF 1977 7.173913
1978 7.088235
1979 7.529412
1980 7.455556
1981 7.152542
1982 6.918919
1983 6.833333
1984 7.428571
1985 NaN
1986 8.000000
1987 7.500000
1988 8.000000
1989 7.230769
1990 7.117647
1991 7.827586
1992 8.767857
1993 8.238806
1994 7.855072
1995 8.780645
1996 8.393846
1997 8.448649
1998 8.720000
2001 8.538462
2002 7.388889
PH 1984 28.000000
1985 32.666667
1986 39.000000
1988 33.000000
1989 30.500000
1990 31.142857
1992 31.142857
1995 35.333333
1997 22.000000
PI 1992 18.000000
1995 NaN
2001 18.000000
2002 20.000000
PL 1995 22.333333
1996 21.166667
1997 18.789474
1998 16.714286
2000 21.000000
PM 1978 20.666667
1979 23.666667
1980 21.000000
1981 19.933333
1982 21.391304
1983 22.023810
1984 19.545455
1985 19.000000
1986 21.696970
1987 22.297710
1988 21.759259
1989 20.222222
1995 27.375000
1996 20.224490
1997 21.126394
1998 20.591398
1999 22.523810
2000 20.500000
2001 26.666667
2002 19.000000
PP 1977 15.333333
1978 14.869565
1979 15.105263
1980 14.176471
1981 13.950000
1982 16.125000
1983 15.468750
1984 15.307692
1985 15.764706
1986 16.750000
1987 17.840909
1988 18.280000
1989 17.409091
1990 16.029412
1991 17.904762
1992 17.479675
1993 17.954023
1994 17.585714
1995 16.844444
1996 18.095563
1997 18.175000
1998 16.266990
1999 16.521212
2000 16.788618
2001 17.752896
2002 17.018617
PX 1996 NaN
1997 19.000000
1998 NaN
2000 NaN
RF 1982 11.500000
1988 13.818182
1989 13.346939
1990 12.916667
1997 20.000000
RM 1977 10.000000
1978 7.500000
1979 8.333333
1980 10.227273
1981 11.178571
1982 10.436242
1983 9.872000
1984 11.152542
1985 8.371429
1986 10.750000
1987 10.657609
1988 10.322115
1989 10.411311
1990 10.305677
1991 10.498305
1992 10.904762
1993 10.747475
1994 10.675000
1995 12.653846
1996 11.455556
1997 11.230769
1998 13.100000
1999 10.555556
2000 11.400000
2001 11.333333
2002 10.000000
RO 1991 11.000000
2002 10.142857
RX 1995 20.000000
1997 11.000000
SF 1989 54.800000
1990 52.611111
1991 93.166667
1992 43.000000
1993 44.000000
2002 62.166667
SH 1977 NaN
1978 89.000000
1982 79.000000
1983 NaN
1986 55.000000
1987 73.800000
1988 72.806452
1989 76.345455
1990 76.888889
1991 63.000000
1997 50.857143
1999 54.000000
2000 73.375000
2001 79.900000
2002 64.666667
SO 1991 53.909091
1992 55.263158
1993 55.600000
1994 62.333333
1997 54.666667
SS 1977 NaN
1978 130.000000
1979 NaN
1980 NaN
1981 57.000000
1982 NaN
1983 NaN
1984 NaN
1985 NaN
1986 NaN
1987 NaN
1988 NaN
1989 NaN
1990 NaN
1991 NaN
1992 NaN
1993 NaN
1994 NaN
1995 NaN
1996 NaN
1997 NaN
1998 NaN
1999 NaN
2000 NaN
2001 NaN
2002 NaN
ST 1993 NaN
UR 1987 NaN
1988 NaN
1991 NaN
1992 NaN
1994 NaN
2001 NaN
2002 NaN
Name: weight, dtype: float64
Key Points
- Combine two dataframes on one or more common values using
pd.merge()
- Append rows from one dataframe to another using
pd.concat()
- Combine multiple text columns into one using the
+
operator - Use the
str
accessor to use string methods likesplit()
andzfill()
on text columns - Convert date strings to datetime objects using
pd.to_datetime()
Content from Data Workflows and Automation
Last updated on 2024-07-17 | Edit this page
Estimated time: 90 minutes
Overview
Questions
- Can I automate operations in Python?
- What are functions and why should I use them?
Objectives
- Describe why for loops are used in Python.
- Employ for loops to automate data analysis.
- Write unique filenames in Python.
- Build reusable code in Python.
- Write functions using conditional statements (if, then, else).
So far, we’ve used Python and the pandas library to explore and manipulate individual datasets by hand, much like we would do in a spreadsheet. The beauty of using a programming language like Python, though, comes from the ability to automate data processing through the use of loops and functions.
For loops
Loops allow us to repeat a workflow (or series of actions) a given number of times or while some condition is true. We would use a loop to automatically process data that’s stored in multiple files (daily values with one file per year, for example). Loops lighten our work load by performing repeated tasks without our direct involvement and make it less likely that we’ll introduce errors by making mistakes while processing each file by hand.
Let’s write a simple for loop that simulates what a kid might see during a visit to the zoo:
PYTHON
= ['lion', 'tiger', 'crocodile', 'vulture', 'hippo']
animals print(animals)
OUTPUT
['lion', 'tiger', 'crocodile', 'vulture', 'hippo']
PYTHON
for creature in animals:
print(creature)
OUTPUT
lion
tiger
crocodile
vulture
hippo
The line defining the loop must start with for
and end with a colon, and the body of the loop must be indented.
In this example, creature
is the loop variable that takes the value of the next entry in animals
every time the loop goes around. We can call the loop variable anything we like. After the loop finishes, the loop variable will still exist and will have the value of the last entry in the collection:
PYTHON
= ['lion', 'tiger', 'crocodile', 'vulture', 'hippo']
animals for creature in animals:
pass
OUTPUT
PYTHON
print('The loop variable is now: ' + creature)
OUTPUT
The loop variable is now: hippo
We are not asking Python to print the value of the loop variable anymore, but the for loop still runs and the value of creature
changes on each pass through the loop. The statement pass
in the body of the loop means “do nothing”.
Challenge - Loops
What happens if we don’t include the
pass
statement?Rewrite the loop so that the animals are separated by commas, not new lines (Hint: You can concatenate strings using a plus sign. For example,
print(string1 + string2)
outputs ‘string1string2’).
-
PYTHON
= ['lion', 'tiger', 'crocodile', 'vulture', 'hippo'] animals for creature in animals:
ERROR
IndentationError: expected an indented block
-
Using the
end
argument toprint
:PYTHON
for creature in animals: print(creature + ',', end='')
OUTPUT
lion,tiger,crocodile,vulture,hippo,
This puts a comma on the end of the list, which is not ideal. To avoid this, we need to use an altogether different approach: string objects in Python have a
join
method, which can be used to concatenate items in a list with the string in between, e.g.PYTHON
', '.join(animals)
OUTPUT
'lion, tiger, crocodile, vulture, hippo'
Automating data processing using For Loops
The file we’ve been using so far, surveys.csv
, contains 25 years of data and is very large. We would like to separate the data for each year into a separate file.
Let’s start by making a new directory inside the folder data
to store all of these files using the module os
:
PYTHON
import os
'data/yearly_files') os.mkdir(
The command os.mkdir
is equivalent to mkdir
in the shell. Just so we are sure, we can check that the new directory was created within the data
folder:
PYTHON
'data') os.listdir(
OUTPUT
['plots.csv',
'portal_mammals.sqlite',
'species.csv',
'survey2001.csv',
'survey2002.csv',
'surveys.csv',
'surveys2002_temp.csv',
'yearly_files']
The command os.listdir
is equivalent to ls
in the shell.
In previous lessons, we saw how to use the library pandas to load the species data into memory as a DataFrame, how to select a subset of the data using some criteria, and how to write the DataFrame into a CSV file. Let’s write a script that performs those three steps in sequence for the year 2002:
PYTHON
import pandas as pd
# Load the data into a DataFrame
= pd.read_csv('data/surveys.csv')
surveys_df
# Select only data for the year 2002
= surveys_df[surveys_df.year == 2002]
surveys2002
# Write the new DataFrame to a CSV file
'data/yearly_files/surveys2002.csv') surveys2002.to_csv(
To create yearly data files, we could repeat the last two commands over and over, once for each year of data. Repeating code is neither elegant nor practical, and is very likely to introduce errors into your code. We want to turn what we’ve just written into a loop that repeats the last two commands for every year in the dataset.
Let’s start by writing a loop that prints the names of the files we want to create - the dataset we are using covers 1977 through 2002, and we’ll create a separate file for each of those years. Listing the filenames is a good way to confirm that the loop is behaving as we expect.
We have seen that we can loop over a list of items, so we need a list of years to loop over. We can get the years in our DataFrame with:
PYTHON
'year'] surveys_df[
OUTPUT
0 1977
1 1977
2 1977
3 1977
...
35545 2002
35546 2002
35547 2002
35548 2002
but we want only unique years, which we can get using the unique
method which we have already seen.
PYTHON
'year'].unique() surveys_df[
OUTPUT
array([1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987,
1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998,
1999, 2000, 2001, 2002], dtype=int64)
Putting this into our for loop we get
PYTHON
for year in surveys_df['year'].unique():
='data/yearly_files/surveys' + str(year) + '.csv'
filenameprint(filename)
OUTPUT
data/yearly_files/surveys1977.csv
data/yearly_files/surveys1978.csv
data/yearly_files/surveys1979.csv
data/yearly_files/surveys1980.csv
data/yearly_files/surveys1981.csv
data/yearly_files/surveys1982.csv
data/yearly_files/surveys1983.csv
data/yearly_files/surveys1984.csv
data/yearly_files/surveys1985.csv
data/yearly_files/surveys1986.csv
data/yearly_files/surveys1987.csv
data/yearly_files/surveys1988.csv
data/yearly_files/surveys1989.csv
data/yearly_files/surveys1990.csv
data/yearly_files/surveys1991.csv
data/yearly_files/surveys1992.csv
data/yearly_files/surveys1993.csv
data/yearly_files/surveys1994.csv
data/yearly_files/surveys1995.csv
data/yearly_files/surveys1996.csv
data/yearly_files/surveys1997.csv
data/yearly_files/surveys1998.csv
data/yearly_files/surveys1999.csv
data/yearly_files/surveys2000.csv
data/yearly_files/surveys2001.csv
data/yearly_files/surveys2002.csv
We can now add the rest of the steps we need to create separate text files:
PYTHON
# Load the data into a DataFrame
= pd.read_csv('data/surveys.csv')
surveys_df
for year in surveys_df['year'].unique():
# Select data for the year
= surveys_df[surveys_df.year == year]
surveys_year
# Write the new DataFrame to a CSV file
= 'data/yearly_files/surveys' + str(year) + '.csv'
filename surveys_year.to_csv(filename)
Look inside the yearly_files
directory and check a couple of the files you just created to confirm that everything worked as expected.
Writing Unique File Names
Notice that the code above created a unique filename for each year.
PYTHON
= 'data/yearly_files/surveys' + str(year) + '.csv' filename
Let’s break down the parts of this name:
- The first part is some text that specifies the directory to store our data file in (data/yearly_files/) and the first part of the file name (surveys):
'data/yearly_files/surveys'
- We can concatenate this with the value of a variable, in this case
year
by using the plus+
sign and the variable we want to add to the file name:+ str(year)
- Then we add the file extension as another text string:
+ '.csv'
Notice that we use single quotes to add text strings. The variable is not surrounded by quotes. This code produces the string data/yearly_files/surveys2002.csv
which contains the path to the new filename AND the file name itself.
Challenge - Modifying loops
Some of the surveys you saved are missing data (they have null values that show up as NaN - Not A Number - in the DataFrames and do not show up in the text files). Modify the for loop so that the entries with null values are not included in the yearly files.
Let’s say you only want to look at data from a given multiple of years. How would you modify your loop in order to generate a data file for only every 5th year, starting from 1977?
Instead of splitting out the data by years, a colleague wants to do analyses each species separately. How would you write a unique CSV file for each species?
PYTHON
= surveys_df[surveys_df.year == year].dropna() surveys_year
-
You could just make a list manually, however, why not check the first and last year making use of the code itself?
PYTHON
= 5 # better overview by making variable from it n_year = surveys_df['year'].min() first_year = surveys_df['year'].max() last_year for year in range(first_year, last_year, n_year): print(year) # Select data for the year = surveys_df[surveys_df.year == year].dropna() surveys_year
python for species in surveys_df['species_id'].dropna().unique(): surveys_species = surveys_df[surveys_df.species_id == species] filename = 'episodes/data/species_files/surveys' + species + '.csv' surveys_species.to_csv(filename)