Content from Introducing Python
Last updated on 2025-07-03 | Edit this page
Overview
Questions
- What is Python?
- How do we assign variables in Python?
- How do we perform actions in Python?
- How does Python handle different types of data?
Objectives
- Introduce some Python terminology
- Learn how to assign values to variables
- Learn how to use functions and methods
- Introduce the built-in Python data types and functions
- Introduce the Python Standard Library
About Python
Python is a high-level, general purpose language. In practical terms, that means it dispenses with many of the complexities of writing code. It is widely used and has an active community of developers, making it suitable for a large number of tasks.
A core philosophical tenet of Python is that code is read more often than it is written. Best practices for coding in Python therefore emphasize writing well-documented, consistently formatted code that follows language-wide standards. In a Pythonic world, everyone’s code would look the same.
Whitespace
One feature to be aware of going in is that Python uses whitespace (as opposed to braces) to organize blocks of code. This can be a sticking point for people used to other languages.
Getting started
Everything in Python is an object. Each object has a type, which controls what the object can do. For example, we can add integers together or append a string to a list. An object can be assigned to a variable, and a function performs an action.
Python is an object-oriented language. This means that each object includes built-in variables and functions that can be used from the object itself. Whe use different terms to refer to variables and functions associated with an object:
- an attribute stores a piece of data
- a method performs an action
Let’s look at one of the main built-in types, str
, to
see how this works in practice. In Python, str
, short for
string, is used to store and manipulate text. To get started, we’ll
assign the string “hello world” to the variable text.
In Python, we use a single equal sign for assignment and quotes to
create a string.
Using quotes
Either single or double quotes can be used to create strings, but try to use them consistently! We will use double quotes in the examples here.
Now, instead of retyping the full string every time we need it, we can access that string by referring to the variable:
OUTPUT
'hello world'
We can check the type of the variable using the type()
function:
OUTPUT
str
A str
is also an object, which means that it includes
methods and attributes that we can use. Let’s use the
upper()
method to convert the text to upper case. We call
methods by adding a period and the name of the method to the
variable:
OUTPUT
'HELLO WORLD'
After running a cell, the notebook displays the object that appears
in the last line of a cell (but note that some actions, like assigning
to a variable, have no output). Some cells will have more than one line,
and it is often useful to display output from earlier in a cell. This
can be accomplished using the built-in print()
function. In
the cell below, we’ll use print()
to show a series of
strings:
OUTPUT
h
e
l
l
o
'world'
Note that the string “hello” at the top of the cell lacks a print statement and does not appear in the output, whereas the text “world” does appear in the output because it is the last line of the cell.
Each object may contain many attributes and methods. Use the
help()
function on any object, including functions or
methods, to show a description of the object and list the available
methods.
OUTPUT
Help on class str in module builtins:
class str(object)
| str(object='') -> str
| str(bytes_or_buffer[, encoding[, errors]]) -> str
|
| Create a new string object from the given object. If encoding or
| errors is specified, then the object must expose a data buffer
| that will be decoded using the given encoding and error handler.
| Otherwise, returns the result of object.__str__() (if defined)
| or repr(object).
| encoding defaults to 'utf-8'.
| errors defaults to 'strict'.
|
| Methods defined here:
|
| __add__(self, value, /)
| Return self+value.
|
| __contains__(self, key, /)
| Return bool(key in self).
|
| __eq__(self, value, /)
| Return self==value.
|
| __format__(self, format_spec, /)
| Return a formatted version of the string as described by format_spec.
|
| __ge__(self, value, /)
| Return self>=value.
|
| __getitem__(self, key, /)
| Return self[key].
|
| __getnewargs__(self, /)
|
| __gt__(self, value, /)
| Return self>value.
|
| __hash__(self, /)
| Return hash(self).
|
| __iter__(self, /)
| Implement iter(self).
|
| __le__(self, value, /)
| Return self<=value.
|
| __len__(self, /)
| Return len(self).
|
| __lt__(self, value, /)
| Return self<value.
|
| __mod__(self, value, /)
| Return self%value.
|
| __mul__(self, value, /)
| Return self*value.
|
| __ne__(self, value, /)
| Return self!=value.
|
| __repr__(self, /)
| Return repr(self).
|
| __rmod__(self, value, /)
| Return value%self.
|
| __rmul__(self, value, /)
| Return value*self.
|
| __sizeof__(self, /)
| Return the size of the string in memory, in bytes.
|
| __str__(self, /)
| Return str(self).
|
| capitalize(self, /)
| Return a capitalized version of the string.
|
| More specifically, make the first character have upper case and the rest lower
| case.
|
| casefold(self, /)
| Return a version of the string suitable for caseless comparisons.
|
| center(self, width, fillchar=' ', /)
| Return a centered string of length width.
|
| Padding is done using the specified fill character (default is a space).
|
| count(self, sub[, start[, end]], /)
| Return the number of non-overlapping occurrences of substring sub in string S[start:end].
|
| Optional arguments start and end are interpreted as in slice notation.
|
| encode(self, /, encoding='utf-8', errors='strict')
| Encode the string using the codec registered for encoding.
|
| encoding
| The encoding in which to encode the string.
| errors
| The error handling scheme to use for encoding errors.
| The default is 'strict' meaning that encoding errors raise a
| UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
| 'xmlcharrefreplace' as well as any other name registered with
| codecs.register_error that can handle UnicodeEncodeErrors.
|
| endswith(self, suffix[, start[, end]], /)
| Return True if the string ends with the specified suffix, False otherwise.
|
| suffix
| A string or a tuple of strings to try.
| start
| Optional start position. Default: start of the string.
| end
| Optional stop position. Default: end of the string.
|
| expandtabs(self, /, tabsize=8)
| Return a copy where all tab characters are expanded using spaces.
|
| If tabsize is not given, a tab size of 8 characters is assumed.
|
| find(self, sub[, start[, end]], /)
| Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end].
|
| Optional arguments start and end are interpreted as in slice notation.
| Return -1 on failure.
|
| format(self, /, *args, **kwargs)
| Return a formatted version of the string, using substitutions from args and kwargs.
| The substitutions are identified by braces ('{' and '}').
|
| format_map(self, mapping, /)
| Return a formatted version of the string, using substitutions from mapping.
| The substitutions are identified by braces ('{' and '}').
|
| index(self, sub[, start[, end]], /)
| Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end].
|
| Optional arguments start and end are interpreted as in slice notation.
| Raises ValueError when the substring is not found.
|
| isalnum(self, /)
| Return True if the string is an alpha-numeric string, False otherwise.
|
| A string is alpha-numeric if all characters in the string are alpha-numeric and
| there is at least one character in the string.
|
| isalpha(self, /)
| Return True if the string is an alphabetic string, False otherwise.
|
| A string is alphabetic if all characters in the string are alphabetic and there
| is at least one character in the string.
|
| isascii(self, /)
| Return True if all characters in the string are ASCII, False otherwise.
|
| ASCII characters have code points in the range U+0000-U+007F.
| Empty string is ASCII too.
|
| isdecimal(self, /)
| Return True if the string is a decimal string, False otherwise.
|
| A string is a decimal string if all characters in the string are decimal and
| there is at least one character in the string.
|
| isdigit(self, /)
| Return True if the string is a digit string, False otherwise.
|
| A string is a digit string if all characters in the string are digits and there
| is at least one character in the string.
|
| isidentifier(self, /)
| Return True if the string is a valid Python identifier, False otherwise.
|
| Call keyword.iskeyword(s) to test whether string s is a reserved identifier,
| such as "def" or "class".
|
| islower(self, /)
| Return True if the string is a lowercase string, False otherwise.
|
| A string is lowercase if all cased characters in the string are lowercase and
| there is at least one cased character in the string.
|
| isnumeric(self, /)
| Return True if the string is a numeric string, False otherwise.
|
| A string is numeric if all characters in the string are numeric and there is at
| least one character in the string.
|
| isprintable(self, /)
| Return True if all characters in the string are printable, False otherwise.
|
| A character is printable if repr() may use it in its output.
|
| isspace(self, /)
| Return True if the string is a whitespace string, False otherwise.
|
| A string is whitespace if all characters in the string are whitespace and there
| is at least one character in the string.
|
| istitle(self, /)
| Return True if the string is a title-cased string, False otherwise.
|
| In a title-cased string, upper- and title-case characters may only
| follow uncased characters and lowercase characters only cased ones.
|
| isupper(self, /)
| Return True if the string is an uppercase string, False otherwise.
|
| A string is uppercase if all cased characters in the string are uppercase and
| there is at least one cased character in the string.
|
| join(self, iterable, /)
| Concatenate any number of strings.
|
| The string whose method is called is inserted in between each given string.
| The result is returned as a new string.
|
| Example: '.'.join(['ab', 'pq', 'rs']) -> 'ab.pq.rs'
|
| ljust(self, width, fillchar=' ', /)
| Return a left-justified string of length width.
|
| Padding is done using the specified fill character (default is a space).
|
| lower(self, /)
| Return a copy of the string converted to lowercase.
|
| lstrip(self, chars=None, /)
| Return a copy of the string with leading whitespace removed.
|
| If chars is given and not None, remove characters in chars instead.
|
| partition(self, sep, /)
| Partition the string into three parts using the given separator.
|
| This will search for the separator in the string. If the separator is found,
| returns a 3-tuple containing the part before the separator, the separator
| itself, and the part after it.
|
| If the separator is not found, returns a 3-tuple containing the original string
| and two empty strings.
|
| removeprefix(self, prefix, /)
| Return a str with the given prefix string removed if present.
|
| If the string starts with the prefix string, return string[len(prefix):].
| Otherwise, return a copy of the original string.
|
| removesuffix(self, suffix, /)
| Return a str with the given suffix string removed if present.
|
| If the string ends with the suffix string and that suffix is not empty,
| return string[:-len(suffix)]. Otherwise, return a copy of the original
| string.
|
| replace(self, old, new, /, count=-1)
| Return a copy with all occurrences of substring old replaced by new.
|
| count
| Maximum number of occurrences to replace.
| -1 (the default value) means replace all occurrences.
|
| If the optional argument count is given, only the first count occurrences are
| replaced.
|
| rfind(self, sub[, start[, end]], /)
| Return the highest index in S where substring sub is found, such that sub is contained within S[start:end].
|
| Optional arguments start and end are interpreted as in slice notation.
| Return -1 on failure.
|
| rindex(self, sub[, start[, end]], /)
| Return the highest index in S where substring sub is found, such that sub is contained within S[start:end].
|
| Optional arguments start and end are interpreted as in slice notation.
| Raises ValueError when the substring is not found.
|
| rjust(self, width, fillchar=' ', /)
| Return a right-justified string of length width.
|
| Padding is done using the specified fill character (default is a space).
|
| rpartition(self, sep, /)
| Partition the string into three parts using the given separator.
|
| This will search for the separator in the string, starting at the end. If
| the separator is found, returns a 3-tuple containing the part before the
| separator, the separator itself, and the part after it.
|
| If the separator is not found, returns a 3-tuple containing two empty strings
| and the original string.
|
| rsplit(self, /, sep=None, maxsplit=-1)
| Return a list of the substrings in the string, using sep as the separator string.
|
| sep
| The separator used to split the string.
|
| When set to None (the default value), will split on any whitespace
| character (including \n \r \t \f and spaces) and will discard
| empty strings from the result.
| maxsplit
| Maximum number of splits.
| -1 (the default value) means no limit.
|
| Splitting starts at the end of the string and works to the front.
|
| rstrip(self, chars=None, /)
| Return a copy of the string with trailing whitespace removed.
|
| If chars is given and not None, remove characters in chars instead.
|
| split(self, /, sep=None, maxsplit=-1)
| Return a list of the substrings in the string, using sep as the separator string.
|
| sep
| The separator used to split the string.
|
| When set to None (the default value), will split on any whitespace
| character (including \n \r \t \f and spaces) and will discard
| empty strings from the result.
| maxsplit
| Maximum number of splits.
| -1 (the default value) means no limit.
|
| Splitting starts at the front of the string and works to the end.
|
| Note, str.split() is mainly useful for data that has been intentionally
| delimited. With natural text that includes punctuation, consider using
| the regular expression module.
|
| splitlines(self, /, keepends=False)
| Return a list of the lines in the string, breaking at line boundaries.
|
| Line breaks are not included in the resulting list unless keepends is given and
| true.
|
| startswith(self, prefix[, start[, end]], /)
| Return True if the string starts with the specified prefix, False otherwise.
|
| prefix
| A string or a tuple of strings to try.
| start
| Optional start position. Default: start of the string.
| end
| Optional stop position. Default: end of the string.
|
| strip(self, chars=None, /)
| Return a copy of the string with leading and trailing whitespace removed.
|
| If chars is given and not None, remove characters in chars instead.
|
| swapcase(self, /)
| Convert uppercase characters to lowercase and lowercase characters to uppercase.
|
| title(self, /)
| Return a version of the string where each word is titlecased.
|
| More specifically, words start with uppercased characters and all remaining
| cased characters have lower case.
|
| translate(self, table, /)
| Replace each character in the string using the given translation table.
|
| table
| Translation table, which must be a mapping of Unicode ordinals to
| Unicode ordinals, strings, or None.
|
| The table must implement lookup/indexing via __getitem__, for instance a
| dictionary or list. If this operation raises LookupError, the character is
| left untouched. Characters mapped to None are deleted.
|
| upper(self, /)
| Return a copy of the string converted to uppercase.
|
| zfill(self, width, /)
| Pad a numeric string with zeros on the left, to fill a field of the given width.
|
| The string is never truncated.
|
| ----------------------------------------------------------------------
| Static methods defined here:
|
| __new__(*args, **kwargs)
| Create and return a new object. See help(type) for accurate signature.
|
| maketrans(x, y=<unrepresentable>, z=<unrepresentable>, /)
| Return a translation table usable for str.translate().
|
| If there is only one argument, it must be a dictionary mapping Unicode
| ordinals (integers) or characters to Unicode ordinals, strings or None.
| Character keys will be then converted to ordinals.
| If there are two arguments, they must be strings of equal length, and
| in the resulting dictionary, each character in x will be mapped to the
| character at the same position in y. If there is a third argument, it
| must be a string, whose characters will be mapped to None in the result.
Data types
Python includes data types for representing other types of data, including numbers or collections of data. The core Python data types are introduced in the table below. We’ll talk more about some of these as we encounter them in the lesson:
Type | Definition | Example |
---|---|---|
str | Character string | "hello world" |
int | Integer numerical | 42 |
float | Approximate numerical | 0.406 |
bool | Stores True or False values |
True or False
|
list | Sequence that can be modified | ["a", "b", "c"] |
tuple | Sequence that cannot be modified | ("a", "b", "c") |
dict | Mapping of keys to values | {"DC": "District of Columbia", "MD": "Maryland", "VA": "Virginia"} |
set | Collection of unique values | {"1", "2", 1} |
Libraries
Python includes a number of built-in functions that are available wherever Python is installed. See the table below for some examples.
Examples of built-in functions
Name | Description | Example |
---|---|---|
abs() |
Gets the absolute value of a number | abs(-1.5) # returns 1.5 |
max() |
Gets the highest value in a sequence | min([1, 2, 3]) # returns 3 |
min() |
Gets the lowest value in a sequence | min([1, 2, 3]) # returns 1 |
round() |
Rounds a number to the nearest integer | round(5.4) # returns 5 |
Python also includes a number of built-in modules. A module bundles functions and other code related to a single task or data type. They are used to simplify the performance of common tasks. By using a common, well-tested code base, modules allow coders to work more quickly and with fewer errors.
Modules, packages, and libraries
A module is generally a single file. Collections of modules are referred to as packages or libraries.
The modules built into Python are referred to as the Python Standard Library. They can be accessed through a typical Python installation and do not require any additional downloads. A few examples are included in the table below, but as with the table of built-in functions, there are more where those came from.
Examples from the Python Standard Library
Library | Description |
---|---|
datetime | Read, write, and analyze dates and times |
os | Create, manipulate, and get information about files and paths |
random | Generate pseudo-random numbers |
Documentation
The documentation for each module can be viewed by clicking the link in the table. Documentation is an invaluable resource. It provides descriptions of what a module does and detailed information about how it can be used, often including examples.
Unlike the built-in functions, we must import a library before we can
use it. We do so using the import
statement:
Once imported, the module is available for use anywhere in the
current document. A module is itself an object, and we must include the
name of the module to access any functions it defines. For example, to
create a datetime.date
object (that is, a date
object defined by the datetime
module), we include both the
module and method name:
As with the built-in types, the datetime.date
object
includes its own suite of attributes and methods. We can, for example,
use the year attribute to get the year:
OUTPUT
1970
Or convert the date to a string using the strftime()
method and Python’s date
format codes. Here, %Y corresponds to YYYY, %m to MM, and %d to
DD.
OUTPUT
'1970-01-01'
Like Python itself, the Python Standard Library is maintained by the Python Software Foundation. The built-in modules are limited to a relatively small set of operations expected to be useful to a broad population of users. However, Python allows users to create their own packages to perform actions that are beyond the scope of core Python. The rest of this lesson will focus on an external package called pandas.
Key Points
- Python is a widely used language that can be used for a variety of tasks, including analyzing data
- Python uses different data types to handle text, numbers, collections, and other kinds of data
- Assign values to variables using the
=
operator - Use functions and methods to perform specific actions
- Python’s functionality can be extended using packages developed by the community
- Use the
help()
function and developer documentation to learn more about Python modules and packages
Content from Introducing pandas
Last updated on 2025-07-03 | Edit this page
Overview
Questions
- What data will we be working with in this lesson?
- What is pandas?
- Why use pandas for data analysis?
- How do we read and write data using pandas?
Objectives
- Learn about the dataset we’ll be working with
- Look at the benefits of using pandas to analyze data
- Import data from a CSV into a pandas dataframe
- Learn how pandas handles different types of data
- Write a dataframe to a CSV
Dataset description
The dataset description was taken from Data Management with SQL for Ecologists (CC-BY-4.0)
The data we will be using is a time-series for a small mammal community in southern Arizona. This is part of a project studying the effects of rodents and ants on the plant community that has been running for almost 40 years. The rodents are sampled on a series of 24 plots, with different experimental manipulations controlling which rodents are allowed to access which plots. Each row represents a single observation of an animal, including its species, weight, and hindfoot length.
This is a real dataset that has been used in over 100 publications. We’ve simplified it for the workshop, but you can download the full dataset and work with it using exactly the same tools we’ll learn about today.
Answering questions using data
Let’s look at some of the cleaned spreadsheets we downloaded during Setup to complete this challenge. Over the course of this lesson, we’ll be working with the following three files:
- surveys.csv
- species.csv
- plots.csv
Discussion
Open each of these csv files and explore them. What information is contained in each file? Specifically, if we had the following research questions:
- How has the hindfoot length and weight of Dipodomys species changed over time?
- What is the average weight of each species, per year?
- What information can I learn about Dipodomys species in the 2000s, over time?
What would we need to answer these questions? Which files have the data we need? What operations would we need to perform if we were doing these analyses by hand?
Hint: We can view CSV files by clicking on them in the left sidebar.
In order to answer the questions described above, we’ll need to do the following basic data operations:
- select subsets of the data (rows and columns)
- group subsets of data
- do math and other calculations
- combine data across spreadsheets
- plot data to identify patterns
In addition, we don’t want to do this manually! Instead of searching for the right pieces of data ourselves, or clicking between spreadsheets, or manually sorting columns, we want to make the computer do the work.
In particular, we want to use a tool where it’s easy to repeat our analysis in case our data changes. We also want to do all this searching without actually modifying our source data.
Why use pandas?
The Python Data Analysis Library, or pandas, is a Python package used to work with dataframes. A dataframe is a representation of tabular data similar to a well-organized spreadsheet, consisting of rows (representing records) and columns (representing fields or variables). Tables are a very common format for representing scientific data and should be very familiar to anyone taking this course.
Pandas offers the same advantages as any well-written package: It creates a common codebase for working on a single task, in this case, analyzing data. Some benefits of this approach include:
- Reliability: Provides flexible, well-tested methods for reading, querying, aggregating, grouping, and plotting data
- Repeatability: Repeat the same analyses when data is added or changed
- Speed: Faster in many cases than coding our own functions in Python
- Reproducibility: Document and share code in narrative form using tools like Jupyter notebooks
- Community: Access a large, active community for help when we run into problems
Importing data using pandas
Unlike the modules in the Python Standard Library discussed in the
previous lesson, pandas is not part of the typical Python installation.
Once it has been installed, however, it can be accessed using the same
import
command used to import built-in modules. By
convention, pandas is imported using the alias “pd”. We can assign the
alias using the as
keyword:
Using aliases
Why use an alias? We will refer to pandas many, many times when writing a script, so it’s useful to abbreviate the name and save some keystrokes. But it’s also a matter of consistency with the larger community. Many of the core scientific Python packages, including pandas, recommend a specific alias, so most code shared online will use those aliases as well.
Now that pandas has been imported, we can access the function we need
to load data from a CSV, pd.read_csv()
. The function call
has three parts:
- The name (or in this case, alias) of the object that defines the function. This can be a module, package, or any other object. It can also be omitted in some cases (for example, when using a function built into Python).
- The name of the method we’d like to use
- A set of parentheses that tells the function to run.
Many functions include parameters that allow the user to modify the behavior of the function. Parameters may be positional or named. In Python, data passed to positional parameters are called arguments (often abbreviated as args), and data passed to named parameters are called keyword arguments (often abbreviated as kwargs). In either case, the arguments are included inside the parentheses used to call the function.
Below, we will pass a single argument to pd.read_csv()
:
A string that tells the function where to find the surveys.csv file.
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
… | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
35549 rows × 9 columns
Here are a few things to observe about how the dataframe is structured:
- By default, the notebook displays the first and last five rows of the dataframe
- Each row represents a record
- Each column represents a field
- The unlabeled column on the far left is called the row label
- pandas has done a lot of work behind the scenes when reading the
data, including:
- Assigning the row index as the row label
- Assigning each column a data type based on its contents
- Assigning certain cells the value NaN, which stands for “not a number” and is used to designate null values in the dataset. Here, those cells represent blank cells in the spreadsheet.
Much of this behavior can be controlled when the spreadsheet is first
read by using keyword arguments. For example, to force
pd.read_csv()
to use the existing record_id column as the
row label, use the index_col keyword argument:
OUTPUT
month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|
record_id | ||||||||
1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
… | … | … | … | … | … | … | … | … |
35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
35549 rows × 8 columns
Now record_id appears on the far left, indicating that pandas is using that column as the row label.
Challenge
Run help()
on the pd.read_csv()
method
(note that you should omit the trailing parentheses). Alternatively,
take a look at the much
prettier documentation provided by the pandas developers. Based on
that information, answer the following questions:
- What parameter would you use to control how null values are interpreted?
- What values are interpreted as NaN by default?
na_values
In addition to empty cells, the following values are interpreted as NaN, or null, by pandas. These strings may look familiar from programs like Excel.
- #N/A
- #N/A N/A
- #NA
- -1.#IND
- -1.#QNAN
- -NaN
- -nan
- 1.#IND
- 1.#QNAN
- <NA>
- N/A
- NA
- NULL
- NaN
- n/a
- nan
- null
Assigning to variables
We can assign the dataframe to a variable so we don’t need to load it every time we want to access it. As we saw in the previous lesson, we use a single equals sign to assign an object to a variable. The variable name should be short and descriptive. By convention, variable names in Python use snake_case (that is, lower case with individual words separated by underscores).
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
… | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
35549 rows × 9 columns
Understanding data types in pandas
We can use the info()
method to see how pandas
interpreted each column in the dataset. This method gives use the name,
count, and data type of each column and provides some information about
the dataset as a whole (for example, memory usage, which is helpful to
know when working with large datasets):
OUTPUT
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35549 entries, 0 to 35548
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 record_id 35549 non-null int64
1 month 35549 non-null int64
2 day 35549 non-null int64
3 year 35549 non-null int64
4 plot_id 35549 non-null int64
5 species_id 34786 non-null object
6 sex 33038 non-null object
7 hindfoot_length 31438 non-null float64
8 weight 32283 non-null float64
dtypes: float64(2), int64(5), object(2)
memory usage: 2.4+ MB
Note that the data types used by pandas look a little different than the ones used by Python itself. See the table below for the rough equivalents.
Saving a dataframe
When analyzing a dataset, we’ll often want to save our work to a
file. The to_csv()
method can be used to write a dataframe
to a CSV, which is a text file which places each row on a separate line
and separates individual cells with commas. We will include a positional
argument with the path to which we want to save the file. We will also
include the index keyword argument. Setting that parameter to
False tells pandas not to include the row label when writing the
CSV.
This is far from the only option for writing data—pandas supports a variety of file types for both reading and writing. Try searching for “read_” in the pandas API reference to see other supported formats.
Key Points
- This lesson uses real data from a decades-long survey of rodents in Arizona
- pandas is a data analysis package that allows users to read, manipulate, and view tabular data using Python
- pandas represents data as a dataframe consisting of rows (records) and columns (fields or variables)
- We can read a dataframe from CSV using the
pd.read_csv()
function and write a dataframe to CSV using theto_csv()
method - The behavior of a function can be modified by including arguments and keyword arguments when the function is called
- pandas uses its own classes to represent text, numbers, booleans, and datetimes
Content from Accessing and Filtering Data
Last updated on 2025-07-15 | Edit this page
Overview
Questions
- How can we look at individual rows and columns in a dataframe?
- How can we look at subsets of the dataset?
Objectives
- Access individual rows and columns
- Access multiple columns at once using a list
- Filter the dataframe based on the data it contains
- Sort the dataframe
In the previous lesson, we saw how to load a dataframe from a file. Now we’ll look at how to access the data within that dataframe. We’ll begin by importing pandas and reading our CSV, as we did in the previous lesson:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
… | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
35549 rows × 9 columns
Getting a column
We can get a single column from the dataframe using square brackets.
Square brackets are used in Python to access objects inside a container
like a list
, dict
, or DataFrame
.
To get a column, we pass the name of the column inside a set brackets
appended to the dataframe. For example, tp retrieve the year, use:
OUTPUT
0 1977
1 1977
2 1977
3 1977
4 1977
...
35544 2002
35545 2002
35546 2002
35547 2002
35548 2002
Name: year, Length: 35549, dtype: int64
A single column is returned as a Series
, which is why
the output for this cell is formatted differently than the other cells
in this lesson. The main difference between a Series
and a
DataFrame
is that a DataFrame
can have
multiple columns. The two classes share many but not all attributes and
methods.
Note also that this is a copy of data from the original dataframe. Changing values in the series will have no effect on the original dataframe. Most operations on a DataFrame or Series return a copy of the original object.
Getting unique values
We can get the list of unique values within a column using the
unique()
method:
OUTPUT
array(['NL', 'DM', 'PF', 'PE', 'DS', 'PP', 'SH', 'OT', 'DO', 'OX', 'SS',
'OL', 'RM', nan, 'SA', 'PM', 'AH', 'DX', 'AB', 'CB', 'CM', 'CQ',
'RF', 'PC', 'PG', 'PH', 'PU', 'CV', 'UR', 'UP', 'ZL', 'UL', 'CS',
'SC', 'BA', 'SF', 'RO', 'AS', 'SO', 'PI', 'ST', 'CU', 'SU', 'RX',
'PB', 'PL', 'PX', 'CT', 'US'], dtype=object)
Getting multiple columns
It is also possible to retrieve more than one column at a time. To do
some, we’ll use the built-in list
data type.
Python uses list
to store sequences, that is, ordered
lists of objects. Any type of Python object can be stored in a list:
strings, integers, floats, even other lists or collections. Once
created, a list can be modified by appending, inserting, or deleting
items. We will not be going into detail about these operations here, but
as always you can learn more about how to use a list using
help()
or the Python
docs.
You can create a list using square brackets. Let’s create a list of the three columns in our dataframe that together give the date of an observation:
When we pass this list to the survey dataframe using square brackets, as we did above, we get a copy of the dataframe containing just those columns. Note that, because we asked for more than one column, pandas returns a dataframe:
OUTPUT
year | month | day | |
---|---|---|---|
0 | 1977 | 7 | 16 |
1 | 1977 | 7 | 16 |
2 | 1977 | 7 | 16 |
3 | 1977 | 7 | 16 |
4 | 1977 | 7 | 16 |
… | … | … | … |
35544 | 2002 | 12 | 31 |
35545 | 2002 | 12 | 31 |
35546 | 2002 | 12 | 31 |
35547 | 2002 | 12 | 31 |
35548 | 2002 | 12 | 31 |
35549 rows × 3 columns
Suppose we want to get the unique values for multiple columns. The
unique()
method only works on a Series
, that
is, a single column. Instead, we can use the
drop_duplicates()
method on a copy of the dataframe with
the columns we’re interested in. Like any well-named method,
drop_duplicates()
does exactly what the name implies: It
returns a copy of the dataframe with all duplicate rows removed.
OUTPUT
plot_id | species_id | |
---|---|---|
0 | 2 | NL |
1 | 3 | NL |
2 | 2 | DM |
3 | 7 | DM |
4 | 3 | DM |
… | … | … |
35474 | 22 | SF |
35511 | 11 | US |
35519 | 9 | SF |
35527 | 13 | US |
35543 | 15 | US |
612 rows × 2 columns
Getting one or more rows
Pandas provides a variety of ways to view rows within a dataframe. We
can get the rows at the beginning of the dataframe using
head()
:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
By default, this methods returns the first five rows. We can provide a number inside the parentheses if we want to view a different number of rows:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
5 | 6 | 7 | 16 | 1977 | 1 | PF | M | 14.0 | NaN |
6 | 7 | 7 | 16 | 1977 | 2 | PE | F | NaN | NaN |
7 | 8 | 7 | 16 | 1977 | 1 | DM | M | 37.0 | NaN |
8 | 9 | 7 | 16 | 1977 | 1 | DM | F | 34.0 | NaN |
9 | 10 | 7 | 16 | 1977 | 6 | PF | F | 20.0 | NaN |
The tail()
method is similar, except it returns rows
from the end of the table:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
Or we can use sample()
to return a random row from
anywhere in the dataframe:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
25936 | 25937 | 5 | 11 | 1997 | 15 | OT | F | 20.0 | 25.0 |
If you’re following along, you may notice that the output of this
method on your screen differs from what’s shown here. That’s exactly
what we’d expect to see. Remember, sample()
is returnning a
random row—it would be more surprising if the outputs were the same!
Slicing the dataframe
The head()
, tail()
, and
sample()
methods are useful for getting a feel for how our
data is structured, but we may also want to look at specific rows. One
way to do so is to extract rows based on where they appear in the
dataframe. We can use square brackets to extract these slices.
A slice is a subset of the dataframe starting at one row and ending at
another. To get a slice, we pass the starting and ending indexes to the
square brackets as start:end
:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
There are three things to be aware of when slicing a dataframe:
- Row indexes are zero-based. The first row has an index of
0, not
- When slicing, the slice includes start but not the end index. In this case, that means the slice includes rows 2, 3, and 4 but not 5.
- The row label can be different from the row index. They happen to be the same here, but don’t count on that being true.
Core Python types like list
and tuple
use
the same conventions, as do most Python pacakges that work with
sequences.
Filtering data
It is often more useful to subset a dataframe based on the data itself. Pandas provides a variety of ways to filter a dataframe in this way. For example, suppose we want to look at a specific species in the surveys dataframe. We can view the rows matching a given species using the same square brackets we used above to select specific columns and rows. Here, however, instead of using a value or list of values, we will use a conditional expression.
A conditional expression is a statement that evaluates as either True or False. They often make use of inequality operators, for example:
-
==
for equals -
!=
for does not equal -
>
for greater than -
>=
for greater than or equal to -
<
for less than -
<=
for less than or equal to
Examples of conditional statements include:
-
"a" == "b"
evaluates False -
"a" != b"
evaluates True -
3 > 4
evaluates False
Note that, when comparing strings, evaluations are case sensitive:
-
"a" == "A"
evaluates False
= for assignment, == for equality
Remember that, in Python, a single equal sign is used to assign values to variables. We’ve already used the assignment operator in this lesson, for example, when we created a new column in the dataframe.
To limit the dataframe to rows matching the species “DM”, we will
again use square brackets. This time, instead of passing a string or a
number, we will include the conditional
surveys["species_id"] == "DM"
inside the square
brackets:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
7 | 8 | 7 | 16 | 1977 | 1 | DM | M | 37.0 | NaN |
8 | 9 | 7 | 16 | 1977 | 1 | DM | F | 34.0 | NaN |
… | … | … | … | … | … | … | … | … | … |
35532 | 35533 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 48.0 |
35533 | 35534 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 56.0 |
35534 | 35535 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 53.0 |
35535 | 35536 | 12 | 31 | 2002 | 14 | DM | F | 35.0 | 42.0 |
35536 | 35537 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 46.0 |
10596 rows × 9 columns
Other comparisons can be used in the same way. To limit our results to observations made in or after 2000, use:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
30158 | 30159 | 1 | 8 | 2000 | 1 | PP | F | 22.0 | 17.0 |
30159 | 30160 | 1 | 8 | 2000 | 1 | DO | M | 35.0 | 53.0 |
30160 | 30161 | 1 | 8 | 2000 | 1 | PP | F | 21.0 | 17.0 |
30161 | 30162 | 1 | 8 | 2000 | 1 | DM | M | 36.0 | 50.0 |
30162 | 30163 | 1 | 8 | 2000 | 1 | PP | M | 20.0 | 16.0 |
… | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
5391 rows × 9 columns
As when we selected columns above, each filtering operation returns a copy of the dataframe.
Using complex filters
When analyzing data, we will often need to filter on multiple columns at one time. In pandas, we can combine conditionals using bitwise operators. These work like the terms AND and OR in many search interfaces:
-
&
: True if conditions on both sides of the operator are True (and) -
|
: True if a condition on either side is True (or)
To return all observations of DM in or after 2000, we can combine the
two conditionals we used previously into a single operation. Note that,
when joining conditionals using &
or |
, we
must wrap each individual condition in parentheses. If we omit the
parentheses, pandas will not perform the comparisons in the expected
order.
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
30161 | 30162 | 1 | 8 | 2000 | 1 | DM | M | 36.0 | 50.0 |
30178 | 30179 | 1 | 8 | 2000 | 12 | DM | M | 36.0 | 60.0 |
30195 | 30196 | 1 | 8 | 2000 | 17 | DM | M | 37.0 | 52.0 |
30196 | 30197 | 1 | 8 | 2000 | 17 | DM | F | 34.0 | 43.0 |
30209 | 30210 | 1 | 8 | 2000 | 22 | DM | M | 38.0 | 56.0 |
… | … | … | … | … | … | … | … | … | … |
35532 | 35533 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 48.0 |
35533 | 35534 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 56.0 |
35534 | 35535 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 53.0 |
35535 | 35536 | 12 | 31 | 2002 | 14 | DM | F | 35.0 | 42.0 |
35536 | 35537 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 46.0 |
847 rows × 9 columns
We can also use methods to filter the dataframe. For example,
isin()
can be used to match a list of values. Methods can
be combined with other conditionals as above. The example below returns
rows from 2000 or later with either “DM”, “DO”, or “DS” in the
species_id column:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
30159 | 30160 | 1 | 8 | 2000 | 1 | DO | M | 35.0 | 53.0 |
30161 | 30162 | 1 | 8 | 2000 | 1 | DM | M | 36.0 | 50.0 |
30166 | 30167 | 1 | 8 | 2000 | 1 | DO | M | 35.0 | 41.0 |
30170 | 30171 | 1 | 8 | 2000 | 2 | DO | M | 36.0 | 52.0 |
30172 | 30173 | 1 | 8 | 2000 | 2 | DO | F | 35.0 | 54.0 |
… | … | … | … | … | … | … | … | … | … |
35533 | 35534 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 56.0 |
35534 | 35535 | 12 | 31 | 2002 | 14 | DM | M | 37.0 | 53.0 |
35535 | 35536 | 12 | 31 | 2002 | 14 | DM | F | 35.0 | 42.0 |
35536 | 35537 | 12 | 31 | 2002 | 14 | DM | F | 36.0 | 46.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
1268 rows × 9 columns
Sorting data
We can sort a dataframe using the sort_values()
method.
To sort by weight, we’ll pass the name of that column to the
sort_values()
:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
28125 | 28126 | 6 | 28 | 1998 | 15 | PF | M | NaN | 4.0 |
217 | 218 | 9 | 13 | 1977 | 1 | PF | M | 13.0 | 4.0 |
9822 | 9823 | 1 | 19 | 1985 | 23 | RM | M | 16.0 | 4.0 |
9852 | 9853 | 1 | 19 | 1985 | 17 | RM | M | 16.0 | 4.0 |
9936 | 9937 | 2 | 16 | 1985 | 21 | RM | M | 16.0 | 4.0 |
… | … | … | … | … | … | … | … | … | … |
35530 | 35531 | 12 | 31 | 2002 | 13 | PB | F | 27.0 | NaN |
35543 | 35544 | 12 | 31 | 2002 | 15 | US | NaN | NaN | NaN |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
35549 rows × 9 columns
By default, rows are sorted in ascending order (smallest to largest). We can reorder them from largest to smallest using the ascending keyword argument:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
33048 | 33049 | 11 | 17 | 2001 | 12 | NL | M | 33.0 | 280.0 |
12870 | 12871 | 5 | 28 | 1987 | 2 | NL | M | 32.0 | 278.0 |
15458 | 15459 | 1 | 11 | 1989 | 9 | NL | M | 36.0 | 275.0 |
2132 | 2133 | 10 | 25 | 1979 | 2 | NL | F | 33.0 | 274.0 |
12728 | 12729 | 4 | 26 | 1987 | 2 | NL | M | 32.0 | 270.0 |
… | … | … | … | … | … | … | … | … | … |
35530 | 35531 | 12 | 31 | 2002 | 13 | PB | F | 27.0 | NaN |
35543 | 35544 | 12 | 31 | 2002 | 15 | US | NaN | NaN | NaN |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
35549 rows × 9 columns
We can sort on multiple fields at once by passing a list of column names. We can control how each column sorts by passing a list with the same number of values (that is, one value per column) to the ascending keyword. The cell below sorts the results first by species_id (largest to smallest), then by weight (smallest to largest):
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
14249 | 14250 | 3 | 20 | 1988 | 18 | ZL | NaN | NaN | NaN |
14350 | 14351 | 4 | 17 | 1988 | 23 | ZL | NaN | NaN | NaN |
35511 | 35512 | 12 | 31 | 2002 | 11 | US | NaN | NaN | NaN |
35512 | 35513 | 12 | 31 | 2002 | 11 | US | NaN | NaN | NaN |
35527 | 35528 | 12 | 31 | 2002 | 13 | US | NaN | NaN | NaN |
… | … | … | … | … | … | … | … | … | … |
34756 | 34757 | 9 | 10 | 2002 | 23 | NaN | NaN | NaN | NaN |
34969 | 34970 | 10 | 6 | 2002 | 10 | NaN | NaN | NaN | NaN |
35187 | 35188 | 11 | 10 | 2002 | 10 | NaN | NaN | NaN | NaN |
35384 | 35385 | 12 | 8 | 2002 | 10 | NaN | NaN | NaN | NaN |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
35549 rows × 9 columns
As with the dataframe methods above, sort_values()
returns a copy of the original dataframe and leaves the original
untouched.
Challenge
Write a query that returns year, species_id, and weight from the surveys table, sorted with the largest weights at the top.
PYTHON
# Create a subset containing only year, species_id, and weight
subset = surveys[["year", "species_id", "weight"]]
# Sort the subset by weight
subset.sort_values("weight", ascending=False)
OUTPUT
year | species_id | weight | |
---|---|---|---|
33048 | 2001 | NL | 280.0 |
12870 | 1987 | NL | 278.0 |
15458 | 1989 | NL | 275.0 |
2132 | 1979 | NL | 274.0 |
12728 | 1987 | NL | 270.0 |
… | … | … | … |
35530 | 2002 | PB | NaN |
35543 | 2002 | US | NaN |
35544 | 2002 | AH | NaN |
35545 | 2002 | AH | NaN |
35548 | 2002 | NaN | NaN |
35549 rows × 3 columns
Showing data on a plot
We will discuss data visualization using plotly in depth in lesson 6 but will introduce some fundamental concepts as we go along. Like pandas, plotly is an external package installed separately from Python itself. It can be used to create interactive, highly customizable plots based on pandas dataframes using just a few lines of code. For example, to create a scatter plot of the weight and hindfoot length, we need only to import plotly:
Once plotly is loaded, we can create a scatter plot using the
px.scatter()
method. We include the dataframe as the first
argument, then x and y keyword arguments to select the columns we want
to show on our scatter plot:
OUTPUT
Content from Aggregating Data
Last updated on 2025-07-15 | Edit this page
Overview
Questions
- How do we calculate summary statistics?
- How do we group data?
- How do null values affect calculations?
Objectives
- Introduce aggregation calculations in pandas
- Introduce grouping in pandas
- Learn about how pandas handles null values
- Make a boxplot showing summary stats
In the previous lesson, we saw how to access and filter our dataframe. Here, we’ll use pandas to calculate summary statistics for some of that data.
As always, we’ll begin by importing pandas and reading our CSV:
Aggregating data
Aggregation allows us to describe the data in our dataframe by calculating totals (like the number of records) and statistics (like the mean value of a column). Pandas allows us to run these calculations on dataframes or on subsets of interest.
Suppose we want to know how many records are in our dataset. We’ve
already seen that we can use the info()
method to get
high-level about the dataset, including the number of entries. What if
just wanted the number of rows? One approach is to use the built-in
function len()
, which is used to count the number of items
in an object (for example, the number of characters in a string or the
number of items in a list). When used on a dataframe, len()
returns the number of rows:
OUTPUT
35549
Pandas provides a suite of aggregation methods that go beyond this
simple case. For example, we can count the number of non-NaN values in
each column using count()
:
OUTPUT
record_id 35549
month 35549
day 35549
year 35549
plot_id 35549
species_id 34786
sex 33038
hindfoot_length 31438
weight 32283
dtype: int64
Or we can find out the total weight of all individuals using
sum()
:
OUTPUT
np.float64(1377594.0)
Other aggregation methods supported by pandas include
min()
, max()
, and mean()
. These
methods all ignore NaNs, so missing data does not affect the
calculations.
Challenge
Calculate the total weight, average weight, minimum and maximum weights for all animals observed during the survey. Can you modify your code so that it outputs these values only for weights between 5 and 10 grams?
PYTHON
# Create a subset of only the animals between 5 and 10 grams
weights = surveys[(surveys["weight"] > 5) & (surveys["weight"] < 10)]["weight"]
# Display aggregation calculations using a dict
{
"sum": weights.sum(),
"mean": weights.mean(),
"min": weights.min(),
"max": weights.max(),
}
OUTPUT
{'sum': np.float64(16994.0),
'mean': np.float64(7.91523055426176),
'min': np.float64(6.0),
'max': np.float64(9.0)}
To quickly generate summary statistics, we can use the
describe()
method instead of calling each aggregation
method separately. When we use this method on a dataframe, it calculates
stats for all columns with numeric data:
OUTPUT
record_id | month | day | year | plot_id | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|
count | 35549.000000 | 35549.000000 | 35549.000000 | 35549.000000 | 35549.000000 | 31438.000000 | 32283.000000 |
mean | 17775.000000 | 6.477847 | 15.991195 | 1990.475231 | 11.397001 | 29.287932 | 42.672428 |
std | 10262.256696 | 3.396925 | 8.257366 | 7.493355 | 6.799406 | 9.564759 | 36.631259 |
min | 1.000000 | 1.000000 | 1.000000 | 1977.000000 | 1.000000 | 2.000000 | 4.000000 |
25% | 8888.000000 | 4.000000 | 9.000000 | 1984.000000 | 5.000000 | 21.000000 | 20.000000 |
50% | 17775.000000 | 6.000000 | 16.000000 | 1990.000000 | 11.000000 | 32.000000 | 37.000000 |
75% | 26662.000000 | 10.000000 | 23.000000 | 1997.000000 | 17.000000 | 36.000000 | 48.000000 |
max | 35549.000000 | 12.000000 | 31.000000 | 2002.000000 | 24.000000 | 70.000000 | 280.000000 |
You can see that describe()
isn’t picky: It includes the
ID, year, month, and day columns in its results. Notice also that counts
differ in between columns. This is because count()
only
counts non-NaN rows.
If desired, we can also describe a single column at a time:
OUTPUT
count 32283.000000
mean 42.672428
std 36.631259
min 4.000000
25% 20.000000
50% 37.000000
75% 48.000000
max 280.000000
Name: weight, dtype: float64
If we need more control over the output (for example, if we want to
calculate the total weight of all animals, as in the challenge above),
pandas provides the agg()
method, which allows us to
specify aggregation methods by column. This method requires a
dict
, which is a built-in data type that we have not
discussed yet.
Like a list
, a dict
is a container that can
include more than one object. Where as a list
is a
sequence, a dict
is a mapping consisting of keys
that map to values. Because it is built in, it can be used in
any Python application without having to import anything.
Let’s see what that looks like in practice by defining a
dict
that maps lowercase to uppercase letters. We use curly
braces to define a dict
. The parts of each item are
separated by a colon, with the key on the left and the value on the
right.
To retrieve the value for a given key, we use square brackets:
OUTPUT
'A'
There are many other ways to interact with a dict
. We
can add or delete items, remap a key to a new value, or iterate over all
the keys, values, or items inside. Use help()
or check out
the
Python docs to learn more about dict
.
Now let’s return to the calculating summary statistics using
agg()
. For this method, Each key must be a column name and
each value a list of the names of aggregation methods.
To calculate the total weight, mean weight, and mean hindfoot length of all records in the survey, we can use:
OUTPUT
weight | hindfoot_length | |
---|---|---|
sum | 1.377594e+06 | NaN |
mean | 4.267243e+01 | 29.287932 |
Grouping data
Up to now, we have been calculating statistics based on all records
in the dataframe, but the dataset includes a variety of species, each
with a characteristic size and number of observations. We may find it
useful to look at subsets for individual species (or plots, years,
etc.). We can do so using the groupby()
method, which
groups records based on the data in one or more columns. To group by
species_id, use:
OUTPUT
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f3cfb63d7f0>
The groupby()
methods returns a special object that
includes the rows associated with each species_id, but we wouldn’t know
that based on the output of the previous cell. Instead of a table or
another view of the grouped records, the method outputs a string with
the type of object
(pandas.core.groupby.generic.DataFrameGroupBy
) and some
information about where that object is stored (0x0...
). Not
the most useful thing in the world! We might encounter strings like this
when developers determine that there is no concise, human-readable way
to represent an object.
Nevertheless, we can use the DataFrameGroupBy
object to
calculate summary statistics for each group. In the example below, we’ll
calculate the number of times each species appears in the dataset. To
simplify the output, we’ll limit the count to a single column,
species_id, using square brackets. (Because count()
excludes NaNs, it’s usually a good practice to choose a column that does
not contain missing data. Record ID fields are a good choice, but any
field that is populated for every row will work.)
OUTPUT
species_id
AB 303
AH 437
AS 2
BA 46
CB 50
CM 13
CQ 16
CS 1
CT 1
CU 1
CV 1
DM 10596
DO 3027
DS 2504
DX 40
NL 1252
OL 1006
OT 2249
OX 12
PB 2891
PC 39
PE 1299
PF 1597
PG 8
PH 32
PI 9
PL 36
PM 899
PP 3123
PU 5
PX 6
RF 75
RM 2609
RO 8
RX 2
SA 75
SC 1
SF 43
SH 147
SO 43
SS 248
ST 1
SU 5
UL 4
UP 8
UR 10
US 4
ZL 2
Name: species_id, dtype: int64
To group by multiple columns, we can pass a list to
groupby()
. To determine how many observations of each
species were made in each year, we can use:
OUTPUT
species_id year
AB 1980 5
1981 7
1982 34
1983 41
1984 12
..
UR 1994 1
2001 2
2002 1
US 2002 4
ZL 1988 2
Name: record_id, Length: 509, dtype: int64
Challenge
Write statements to answer the following questions:
- How many individuals were counted in each year in total?
- How many were counted each year, for each different species?
- What was the average weight of each species in each year?
- How many individuals were counted for each species that was observed more than 10 times?
Can you get the answer to both 2 and 3 in a single query?
How many individuals were counted in each year in total?
OUTPUT
year
1977 503
1978 1048
1979 719
1980 1415
1981 1472
1982 1978
1983 1673
1984 981
1985 1438
1986 942
1987 1671
1988 1469
1989 1569
1990 1311
1991 1347
1992 1038
1993 750
1994 668
1995 1222
1996 1706
1997 2493
1998 1610
1999 1135
2000 1552
2001 1610
2002 2229
Name: record_id, dtype: int64
How many individuals were counted each year for each different species?
PYTHON
# Individual counts by species and year
result = surveys.groupby(["year", "species_id"])["record_id"].count()
# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
print(result)
OUTPUT
year species_id
1977 DM 264
DO 12
DS 98
NL 31
OL 10
OT 17
OX 7
PE 6
PF 31
PP 7
RM 2
SH 1
SS 1
1978 AH 1
DM 389
DO 22
DS 320
NL 48
OL 47
OT 66
OX 1
PE 17
PF 40
PM 3
PP 24
RM 3
SA 2
SH 1
SS 8
1979 DM 209
DO 31
DS 204
NL 30
OL 46
OT 68
PE 17
PF 19
PM 6
PP 20
RM 6
SS 2
1980 AB 5
CB 1
CM 12
CQ 2
DM 493
DO 89
DS 346
DX 1
NL 57
OL 85
OT 96
PE 20
PF 95
PM 1
PP 17
RM 46
SS 9
1981 AB 7
CB 1
CQ 4
DM 559
DO 123
DS 354
NL 63
OL 55
OT 65
PE 43
PF 59
PM 16
PP 20
RM 29
SA 11
SS 8
1982 AB 34
AH 1
CQ 8
DM 609
DO 117
DS 354
NL 111
OL 83
OT 94
OX 1
PC 2
PE 107
PF 152
PM 72
PP 32
RF 2
RM 154
SA 17
SH 2
SS 12
1983 AB 41
AH 2
CB 3
DM 528
DO 167
DS 280
DX 1
NL 98
OL 65
OT 86
PC 3
PE 38
PF 114
PG 1
PM 42
PP 32
RM 127
SA 10
SH 2
SS 12
1984 AB 12
AH 2
CB 1
CQ 2
DM 396
DO 108
DS 76
NL 64
OL 71
OT 49
OX 1
PE 20
PF 7
PH 7
PM 12
PP 26
PU 2
RM 60
SA 20
SS 15
1985 AB 14
AH 14
CB 1
DM 667
DO 225
DS 98
NL 45
OL 94
OT 62
PE 51
PF 1
PH 3
PM 17
PP 35
RM 72
SA 3
SS 14
1986 AB 5
AH 19
CB 4
DM 406
DO 115
DS 88
DX 1
NL 60
OL 43
OT 25
PE 26
PF 2
PH 1
PM 34
PP 26
RM 53
SA 1
SH 1
SS 12
1987 AB 35
AH 41
CB 5
CM 1
CV 1
DM 469
DO 213
DS 104
NL 128
OL 79
OT 43
PC 5
PE 142
PF 4
PM 133
PP 45
RM 189
SA 1
SH 5
SS 20
UP 2
UR 3
1988 AB 39
AH 26
CB 6
DM 365
DO 144
DS 54
DX 1
NL 102
OL 51
OT 86
PE 190
PF 2
PG 2
PH 1
PM 54
PP 53
RF 11
RM 211
SA 1
SH 31
SS 20
UL 1
UP 3
UR 1
ZL 2
1989 AB 31
AH 30
BA 3
CB 1
CS 1
DM 321
DO 119
DS 33
NL 67
OL 57
OT 102
OX 1
PC 1
PE 177
PF 13
PH 2
PM 27
PP 45
RF 49
RM 398
SA 3
SC 1
SF 5
SH 55
SS 18
UP 2
1990 AB 27
AH 22
BA 11
DM 462
DO 171
DS 17
NL 29
OL 25
OT 77
PC 8
PE 75
PF 51
PG 4
PH 7
PP 34
RF 12
RM 231
SF 18
SH 9
SS 5
UP 1
1991 AB 15
AH 21
AS 1
BA 26
CB 2
DM 404
DO 122
DS 11
DX 4
NL 30
OL 36
OT 107
PC 6
PE 65
PF 62
PP 88
RM 307
RO 1
SA 2
SF 7
SH 1
SO 11
SS 2
UR 1
1992 AB 10
AH 16
BA 6
CB 7
DM 307
DO 94
DS 18
DX 12
NL 15
OL 45
OT 42
PC 9
PE 41
PF 57
PH 7
PI 1
PP 131
RM 158
SF 3
SO 20
SS 7
UL 1
UR 1
1993 AB 9
AH 15
AS 1
CB 5
CU 1
DM 253
DO 29
DS 18
DX 3
NL 24
OL 35
OT 22
OX 1
PC 1
PE 13
PF 70
PP 100
RM 103
SF 3
SO 6
SS 1
ST 1
1994 AB 7
AH 19
CB 1
DM 293
DO 25
DS 9
DX 2
NL 10
OL 21
OT 21
PC 1
PE 3
PF 73
PP 74
PU 2
RM 40
SO 3
SS 7
SU 1
UL 2
UR 1
1995 AB 4
AH 36
CB 2
DM 436
DO 58
DX 3
NL 8
OL 29
OT 38
PB 7
PC 1
PE 14
PF 158
PH 3
PI 1
PL 3
PM 8
PP 277
PU 1
RM 81
RX 1
SS 7
SU 4
1996 AH 25
DM 492
DO 174
DX 1
NL 7
OL 13
OT 108
PB 39
PC 1
PE 36
PF 330
PL 6
PM 50
PP 298
PX 2
RM 90
SA 1
SS 16
1997 AH 37
CB 3
DM 576
DO 253
DS 6
NL 48
OL 8
OT 258
PB 259
PE 57
PF 186
PH 1
PL 19
PM 271
PP 325
PX 2
RF 1
RM 158
RX 1
SA 1
SH 7
SO 3
SS 12
1998 AB 2
AH 33
CB 4
CT 1
DM 503
DO 111
DS 9
DX 2
NL 32
OT 164
PB 329
PC 1
PE 24
PF 26
PL 7
PM 103
PP 208
PX 1
RM 13
SA 2
SS 15
1999 AH 14
DM 348
DO 84
DS 7
DX 3
NL 20
OT 105
PB 272
PE 7
PM 44
PP 167
RM 28
SH 2
SS 7
2000 AB 3
AH 12
DM 233
DO 91
DX 4
NL 29
OT 154
PB 555
PE 15
PG 1
PL 1
PM 2
PP 381
PX 1
RM 15
SH 8
SS 4
2001 AB 1
AH 23
CB 3
DM 305
DO 81
DX 1
NL 48
OT 167
PB 538
PE 35
PF 27
PI 2
PM 3
PP 273
RM 15
SH 11
SS 5
UR 2
2002 AB 2
AH 28
DM 309
DO 249
DX 1
NL 48
OL 8
OT 127
PB 892
PE 60
PF 18
PI 5
PM 1
PP 385
RM 20
RO 7
SF 7
SH 11
SS 9
UR 1
US 4
Name: record_id, dtype: int64
What was the average weight of each species in each year?
PYTHON
# Mean weight by species and year
result = surveys.groupby(["year", "species_id"])["weight"].mean()
# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
print(result)
OUTPUT
year species_id
1977 DM 41.141304
DO 42.666667
DS 121.437500
NL NaN
OL 21.666667
OT 21.000000
OX 21.333333
PE 19.500000
PF 7.173913
PP 15.333333
RM 10.000000
SH NaN
SS NaN
1978 AH NaN
DM 40.795455
DO 45.000000
DS 115.198653
NL 184.656250
OL 31.027027
OT 23.220000
OX NaN
PE 20.428571
PF 7.088235
PM 20.666667
PP 14.869565
RM 7.500000
SA NaN
SH 89.000000
SS 130.000000
1979 DM 43.507317
DO 45.862069
DS 111.796954
NL 138.000000
OL 33.717391
OT 23.075758
PE 20.529412
PF 7.529412
PM 23.666667
PP 15.105263
RM 8.333333
SS NaN
1980 AB NaN
CB NaN
CM NaN
CQ NaN
DM 44.136082
DO 48.058824
DS 120.725664
DX NaN
NL 158.982143
OL 33.107143
OT 24.083333
PE 22.450000
PF 7.455556
PM 21.000000
PP 14.176471
RM 10.227273
SS NaN
1981 AB NaN
CB NaN
CQ NaN
DM 44.032907
DO 49.111111
DS 125.818444
NL 165.903226
OL 33.296296
OT 24.415385
PE 20.558140
PF 7.152542
PM 19.933333
PP 13.950000
RM 11.178571
SA NaN
SS 57.000000
1982 AB NaN
AH NaN
CQ NaN
DM 41.975042
DO 47.919643
DS 115.647929
NL 160.613208
OL 35.590361
OT 24.925532
OX 24.000000
PC NaN
PE 21.173077
PF 6.918919
PM 21.391304
PP 16.125000
RF 11.500000
RM 10.436242
SA NaN
SH 79.000000
SS NaN
1983 AB NaN
AH NaN
CB NaN
DM 40.944551
DO 47.150602
DS 122.033088
DX NaN
NL 156.691489
OL 34.584615
OT 24.697674
PC NaN
PE 20.864865
PF 6.833333
PG NaN
PM 22.023810
PP 15.468750
RM 9.872000
SA NaN
SH NaN
SS NaN
1984 AB NaN
AH NaN
CB NaN
CQ NaN
DM 40.765306
DO 48.415094
DS 124.082192
NL 150.095238
OL 32.550725
OT 22.416667
OX 18.000000
PE 20.210526
PF 7.428571
PH 28.000000
PM 19.545455
PP 15.307692
PU NaN
RM 11.152542
SA NaN
SS NaN
1985 AB NaN
AH NaN
CB NaN
DM 41.507645
DO 47.956731
DS 124.326316
NL 148.829268
OL 32.108696
OT 22.064516
PE 20.360000
PF NaN
PH 32.666667
PM 19.000000
PP 15.764706
RM 8.371429
SA NaN
SS NaN
1986 AB NaN
AH NaN
CB NaN
DM 43.361596
DO 49.372727
DS 128.071429
DX NaN
NL 159.724138
OL 30.880952
OT 21.500000
PE 22.520000
PF 8.000000
PH 39.000000
PM 21.696970
PP 16.750000
RM 10.750000
SA NaN
SH 55.000000
SS NaN
1987 AB NaN
AH NaN
CB NaN
CM NaN
CV NaN
DM 43.232609
DO 50.087379
DS 126.383838
NL 158.840000
OL 30.779221
OT 23.069767
PC NaN
PE 21.625899
PF 7.500000
PM 22.297710
PP 17.840909
RM 10.657609
SA NaN
SH 73.800000
SS NaN
UP NaN
UR NaN
1988 AB NaN
AH NaN
CB NaN
DM 43.397790
DO 51.463768
DS 129.490566
DX NaN
NL 163.104167
OL 30.306122
OT 24.070588
PE 22.625668
PF 8.000000
PG NaN
PH 33.000000
PM 21.759259
PP 18.280000
RF 13.818182
RM 10.322115
SA NaN
SH 72.806452
SS NaN
UL NaN
UP NaN
UR NaN
ZL NaN
1989 AB NaN
AH NaN
BA 7.000000
CB NaN
CS NaN
DM 44.349206
DO 51.025641
DS 121.896552
NL 151.278689
OL 31.947368
OT 24.663366
OX 20.000000
PC NaN
PE 21.935673
PF 7.230769
PH 30.500000
PM 20.222222
PP 17.409091
RF 13.346939
RM 10.411311
SA NaN
SC NaN
SF 54.800000
SH 76.345455
SS NaN
UP NaN
1990 AB NaN
AH NaN
BA 8.000000
DM 41.769912
DO 48.512048
DS 121.187500
NL 154.275862
OL 31.200000
OT 23.675325
PC NaN
PE 21.671233
PF 7.117647
PG NaN
PH 31.142857
PP 16.029412
RF 12.916667
RM 10.305677
SF 52.611111
SH 76.888889
SS NaN
UP NaN
1991 AB NaN
AH NaN
AS NaN
BA 9.240000
CB NaN
DM 43.148338
DO 49.695652
DS 113.000000
DX NaN
NL 148.785714
OL 28.171429
OT 24.588235
PC NaN
PE 21.435484
PF 7.827586
PP 17.904762
RM 10.498305
RO 11.000000
SA NaN
SF 93.166667
SH 63.000000
SO 53.909091
SS NaN
UR NaN
1992 AB NaN
AH NaN
BA 7.833333
CB NaN
DM 43.877966
DO 48.367816
DS 112.352941
DX NaN
NL 139.000000
OL 28.454545
OT 24.928571
PC NaN
PE 22.750000
PF 8.767857
PH 31.142857
PI 18.000000
PP 17.479675
RM 10.904762
SF 43.000000
SO 55.263158
SS NaN
UL NaN
UR NaN
1993 AB NaN
AH NaN
AS NaN
CB NaN
CU NaN
DM 43.373984
DO 48.461538
DS 105.722222
DX NaN
NL 127.391304
OL 27.545455
OT 22.363636
OX NaN
PC NaN
PE 19.230769
PF 8.238806
PP 17.954023
RM 10.747475
SF 44.000000
SO 55.600000
SS NaN
ST NaN
1994 AB NaN
AH NaN
CB NaN
DM 42.373288
DO 47.750000
DS 106.625000
DX NaN
NL 186.875000
OL 21.900000
OT 25.285714
PC NaN
PE 18.000000
PF 7.855072
PP 17.585714
PU NaN
RM 10.675000
SO 62.333333
SS NaN
SU NaN
UL NaN
UR NaN
1995 AB NaN
AH NaN
CB NaN
DM 44.806147
DO 49.592593
DX NaN
NL 155.833333
OL 27.296296
OT 24.868421
PB 34.000000
PC NaN
PE 21.230769
PF 8.780645
PH 35.333333
PI NaN
PL 22.333333
PM 27.375000
PP 16.844444
PU NaN
RM 12.653846
RX 20.000000
SS NaN
SU NaN
1996 AH NaN
DM 44.506173
DO 47.234940
DX NaN
NL 148.000000
OL 27.769231
OT 23.453704
PB 32.578947
PC NaN
PE 22.027778
PF 8.393846
PL 21.166667
PM 20.224490
PP 18.095563
PX NaN
RM 11.455556
SA NaN
SS NaN
1997 AH NaN
CB NaN
DM 44.708551
DO 48.177866
DS 111.000000
NL 150.688889
OL 33.625000
OT 24.785156
PB 31.085603
PE 20.929825
PF 8.448649
PH 22.000000
PL 18.789474
PM 21.126394
PP 18.175000
PX 19.000000
RF 20.000000
RM 11.230769
RX 11.000000
SA NaN
SH 50.857143
SO 54.666667
SS NaN
1998 AB NaN
AH NaN
CB NaN
CT NaN
DM 43.131403
DO 49.731183
DS 116.000000
DX NaN
NL 159.466667
OT 24.675676
PB 30.082237
PC NaN
PE 20.304348
PF 8.720000
PL 16.714286
PM 20.591398
PP 16.266990
PX NaN
RM 13.100000
SA NaN
SS NaN
1999 AH NaN
DM 43.945402
DO 48.012048
DS 120.714286
DX NaN
NL 182.200000
OT 25.723810
PB 31.710037
PE 25.428571
PM 22.523810
PP 16.521212
RM 10.555556
SH 54.000000
SS NaN
2000 AB NaN
AH NaN
DM 43.126638
DO 49.224719
DX NaN
NL 179.307692
OT 25.303448
PB 30.878899
PE 21.615385
PG NaN
PL 21.000000
PM 20.500000
PP 16.788618
PX NaN
RM 11.400000
SH 73.375000
SS NaN
2001 AB NaN
AH NaN
CB NaN
DM 45.442177
DO 52.233766
DX NaN
NL 167.851064
OT 23.707792
PB 32.798851
PE 20.400000
PF 8.538462
PI 18.000000
PM 26.666667
PP 17.752896
RM 11.333333
SH 79.900000
SS NaN
UR NaN
2002 AB NaN
AH NaN
DM 46.168317
DO 49.514403
DX NaN
NL 182.159091
OL 25.428571
OT 23.833333
PB 32.359447
PE 21.719298
PF 7.388889
PI 20.000000
PM 19.000000
PP 17.018617
RM 10.000000
RO 10.142857
SF 62.166667
SH 64.666667
SS NaN
UR NaN
US NaN
Name: weight, dtype: float64
How many individuals were counted for each species that was observed more than 10 times?
PYTHON
# Counts by species that appear more than 10 times
species = surveys.groupby("species_id")["record_id"].count()
species[species > 10]
OUTPUT
species_id
AB 303
AH 437
BA 46
CB 50
CM 13
CQ 16
DM 10596
DO 3027
DS 2504
DX 40
NL 1252
OL 1006
OT 2249
OX 12
PB 2891
PC 39
PE 1299
PF 1597
PH 32
PL 36
PM 899
PP 3123
RF 75
RM 2609
SA 75
SF 43
SH 147
SO 43
SS 248
Name: record_id, dtype: int64
Calculate the number of individuals observed and the average weight for each species in each year.
Note that weight contains NaNs, so counts for record_id and weight differ.
PYTHON
# Counts and mean weight by species and year
surveys.groupby(["year", "species_id"]).agg(
{"record_id": "count", "weight": ["count", "mean"]}
)
OUTPUT
record_id | weight | |||
---|---|---|---|---|
count | count | mean | ||
year | species_id | |||
1977 | DM | 264 | 184 | 41.141304 |
DO | 12 | 12 | 42.666667 | |
DS | 98 | 32 | 121.437500 | |
NL | 31 | 0 | NaN | |
OL | 10 | 3 | 21.666667 | |
… | … | … | … | … |
2002 | SF | 7 | 6 | 62.166667 |
SH | 11 | 9 | 64.666667 | |
SS | 9 | 0 | NaN | |
UR | 1 | 0 | NaN | |
US | 4 | 0 | NaN |
509 rows × 3 columns
Handling missing data
As we’ve discussed, some columns in the surveys dataframe have the value NaN instead of text or numbers. NaN, short for “not a number,” is a special type of float used by pandas to represent missing data. When reading from a CSV, pandas interpret certains values as NaN. NaNs are excluded from groups and most aggregation calculations in pandas. Below, we’ll see how they can also cause issues when plotting.
Defaults
See na_values in the pd.read_csv() documentation for the values that pandas interprets as NaN by default.
When analyzing a dataset, it is critical to understand how missing data is represented. Failing to do so may introduce errors into the analysis. The ecology dataset used in this lesson uses empty cells to represent missing data, but other disciplines have different conventions. For example, some geographic datasets use -9999 to represent null values. Failure to convert such values to NaN will result in significant errors on any calculations performed on that dataset.
In some cases, it can be useful to fill in cells containing NaN with
a non-null value. Recall that the groupby()
method excludes
NaN cells. When looking at sex, the following code counts only those
cells with either F (female) or M (male):
OUTPUT
sex
F 15690
M 17348
Name: record_id, dtype: int64
But not all records specify a sex. To include records where sex was
not specified, we can use the fillna()
method on the sex
column. This method replaces each NaN with the first argument passed to
the function call. To replace all NaN values in sex with “U”, use:
The aggregation calculation now includes all records in the dataframe:
OUTPUT
sex
F 15690
M 17348
U 2511
Name: record_id, dtype: int64
In other cases, we may want to ignore rows that contain NaNs. This
can be done using dropna()
:
This method returns a copy of the dataframe containing only those rows that have non-NaN data in every field.
Visualizing groups and statistics
Plots can be a useful way to identify patterns, commonalities, and differences within a dataset. For example, they can use color, shape, and size to highlight features of interest. Plotly allows us to style the markers on a scatter plot in much the same way as we selected x and y variables to plot at the end of the last lesson. To color the markers on a scatter plot based on species_id, we can use the color keyword argument:
PYTHON
import plotly.express as px
px.scatter(surveys, x="weight", y="hindfoot_length", color="species_id")
OUTPUT
Content from Combining Dataframes
Last updated on 2025-07-03 | Edit this page
Overview
Questions
- How do we combine data from multiple sources using pandas?
- How do we add data to an existing dataframe?
- How do we combine columns within a single dataframe?
Objectives
- Use
pd.merge()
to add related columns to a dataframe - Use
pd.concat()
to add rows to a dataframe - Combine text columns
- Combine year, month, and day columns into a
datetime64
object
Dataframes can be used to organize and group data by common characteristics. Often, we need to combine elements from separate dataframes into one for analysis and visualization. Likewise, we may need to append data collected under different circumstances. In this chapter, we will show how to merge, concatenate, and combine data using pandas.
Merging dataframes
The survey dataframe we’ve been using throughout this lesson has a column called species_id. We used this column in the previous lesson to calculate summary statistics about observations of each species. But the species_id is just a two-letter code. Without knowing what each code represents, we can’t really understand, describe, or even properly label the observations. We need a way to go from the code to more detailed species information.
Managing repetitive data
Why store species data in a separate table in the first place? Species information is repetitive: Every observation of the same species has the same genus, species, and taxa. Storing it in the original survey table would require including that data in every record, increasing both the size of the table and the likelihood of errors. Storing that data in a separate table means we only have to enter and validate it once. A tool like pandas then allows us to access the authoratative data when we need it.
We can enhance the species data in the dataframe using an
authoratative dataset that contains the taxonomic information associated
with each species. This data is stored in the species.csv file, which
can be loaded using pd.read_csv()
:
PYTHON
import pandas as pd
surveys = pd.read_csv("data/surveys.csv")
species = pd.read_csv("data/species.csv")
species
OUTPUT
species_id | genus | species | taxa | |
---|---|---|---|---|
0 | AB | Amphispiza | bilineata | Bird |
1 | AH | Ammospermophilus | harrisi | Rodent |
2 | AS | Ammodramus | savannarum | Bird |
3 | BA | Baiomys | taylori | Rodent |
4 | CB | Campylorhynchus | brunneicapillus | Bird |
5 | CM | Calamospiza | melanocorys | Bird |
6 | CQ | Callipepla | squamata | Bird |
7 | CS | Crotalus | scutalatus | Reptile |
8 | CT | Cnemidophorus | tigris | Reptile |
9 | CU | Cnemidophorus | uniparens | Reptile |
10 | CV | Crotalus | viridis | Reptile |
11 | DM | Dipodomys | merriami | Rodent |
12 | DO | Dipodomys | ordii | Rodent |
13 | DS | Dipodomys | spectabilis | Rodent |
14 | DX | Dipodomys | sp. | Rodent |
15 | EO | Eumeces | obsoletus | Reptile |
16 | GS | Gambelia | silus | Reptile |
17 | NL | Neotoma | albigula | Rodent |
18 | NX | Neotoma | sp. | Rodent |
19 | OL | Onychomys | leucogaster | Rodent |
20 | OT | Onychomys | torridus | Rodent |
21 | OX | Onychomys | sp. | Rodent |
22 | PB | Chaetodipus | baileyi | Rodent |
23 | PC | Pipilo | chlorurus | Bird |
24 | PE | Peromyscus | eremicus | Rodent |
25 | PF | Perognathus | flavus | Rodent |
26 | PG | Pooecetes | gramineus | Bird |
27 | PH | Perognathus | hispidus | Rodent |
28 | PI | Chaetodipus | intermedius | Rodent |
29 | PL | Peromyscus | leucopus | Rodent |
30 | PM | Peromyscus | maniculatus | Rodent |
31 | PP | Chaetodipus | penicillatus | Rodent |
32 | PU | Pipilo | fuscus | Bird |
33 | PX | Chaetodipus | sp. | Rodent |
34 | RF | Reithrodontomys | fulvescens | Rodent |
35 | RM | Reithrodontomys | megalotis | Rodent |
36 | RO | Reithrodontomys | montanus | Rodent |
37 | RX | Reithrodontomys | sp. | Rodent |
38 | SA | Sylvilagus | audubonii | Rabbit |
39 | SB | Spizella | breweri | Bird |
40 | SC | Sceloporus | clarki | Reptile |
41 | SF | Sigmodon | fulviventer | Rodent |
42 | SH | Sigmodon | hispidus | Rodent |
43 | SO | Sigmodon | ochrognathus | Rodent |
44 | SS | Spermophilus | spilosoma | Rodent |
45 | ST | Spermophilus | tereticaudus | Rodent |
46 | SU | Sceloporus | undulatus | Reptile |
47 | SX | Sigmodon | sp. | Rodent |
48 | UL | Lizard | sp. | Reptile |
49 | UP | Pipilo | sp. | Bird |
50 | UR | Rodent | sp. | Rodent |
51 | US | Sparrow | sp. | Bird |
52 | ZL | Zonotrichia | leucophrys | Bird |
53 | ZM | Zenaida | macroura | Bird |
We can see that the species dataframe includes the genus, species,
and taxon for each species_id. This is much more useful than the
species_id included in the original dataframe—how can we integrate that
data into our surveys dataframe? Adding it by hand would be tedious and
error prone. Fortunately, pandas provides the pd.merge()
function to join two dataframes. We’ll assigned the new, merged
dataframe to the variable merged:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN | Neotoma | albigula | Rodent |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN | Dipodomys | merriami | Rodent |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN | Dipodomys | merriami | Rodent |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN | Dipodomys | merriami | Rodent |
… | … | … | … | … | … | … | … | … | … | … | … | … |
34781 | 35544 | 12 | 31 | 2002 | 15 | US | NaN | NaN | NaN | Sparrow | sp. | Bird |
34782 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | Ammospermophilus | harrisi | Rodent |
34783 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | Ammospermophilus | harrisi | Rodent |
34784 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 | Reithrodontomys | megalotis | Rodent |
34785 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 | Dipodomys | ordii | Rodent |
34786 rows × 12 columns
Following the merge, the genus, species, and taxa columns have all been added to the survey dataframe. We can now use those columns to filter and summarize our data.
Joins
The merge()
method performs the same function as joins
in SQL
Challenge
We previously described the dataset as consisting of observations of rodents. That’s mostly true, but observations of other animals are also included. Based on the merged dataframe, what other types of animals are present in the dataframe? We are interesed in the type of animal, not the Latin name.
In the example above, we didn’t provide any information about how we wanted to merge the dataframes together, so pandas made an educated guess by looking at the columns in each of the dataframes and merging them on the only column that appeared in both datasets, species_id. For more complex tables, we may need more control over the operation, for example, by specifiying the exact columns we want to merge on. We can do that by passing one or more column names using the on keyword argument:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN | Neotoma | albigula | Rodent |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN | Dipodomys | merriami | Rodent |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN | Dipodomys | merriami | Rodent |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN | Dipodomys | merriami | Rodent |
… | … | … | … | … | … | … | … | … | … | … | … | … |
34781 | 35544 | 12 | 31 | 2002 | 15 | US | NaN | NaN | NaN | Sparrow | sp. | Bird |
34782 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | Ammospermophilus | harrisi | Rodent |
34783 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | Ammospermophilus | harrisi | Rodent |
34784 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 | Reithrodontomys | megalotis | Rodent |
34785 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 | Dipodomys | ordii | Rodent |
34786 rows × 12 columns
It’s not necessary here, but we can also merge datasets where the common data is stored in columns with different names. To do so, we’d use the left_on (to specify the column name in the first dataframe) and right_on (to specify the column name in the second dataframe).
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN | Neotoma | albigula | Rodent |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN | Dipodomys | merriami | Rodent |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN | Dipodomys | merriami | Rodent |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN | Dipodomys | merriami | Rodent |
… | … | … | … | … | … | … | … | … | … | … | … | … |
34781 | 35544 | 12 | 31 | 2002 | 15 | US | NaN | NaN | NaN | Sparrow | sp. | Bird |
34782 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | Ammospermophilus | harrisi | Rodent |
34783 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | Ammospermophilus | harrisi | Rodent |
34784 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 | Reithrodontomys | megalotis | Rodent |
34785 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 | Dipodomys | ordii | Rodent |
34786 rows × 12 columns
Challenge
Compare the number of rows in the original and merged survey dataframes. How do they differ? Why might that be?
Hint: Use pd.unique()
method to look at
the species_id column in each dataframe.
OUTPUT
array(['AB', 'AH', 'AS', 'BA', 'CB', 'CM', 'CQ', 'CS', 'CT', 'CU', 'CV',
'DM', 'DO', 'DS', 'DX', 'NL', 'OL', 'OT', 'OX', 'PB', 'PC', 'PE',
'PF', 'PG', 'PH', 'PI', 'PL', 'PM', 'PP', 'PU', 'PX', 'RF', 'RM',
'RO', 'RX', 'SA', 'SC', 'SF', 'SH', 'SO', 'SS', 'ST', 'SU', 'UL',
'UP', 'UR', 'US', 'ZL', nan], dtype=object)
The merged dataframe omits rows with no value in the species_id column.
The number of rows in the merged dataframe is lower than in the
original surveys dataframe. By default, merge()
performs an
inner join. This means that a row will only appear in
the merged dataframe if the value(s) in the join column(s) appear in
both dataframes. Here, observations that did not include a species_id or
that included a species_id that was not definedin the species dataframe
were dropped.
This is not always desirable behavior. Fortunately, pandas supports other kinds of merges:
- Inner: Include all rows with common values in the join columns. This is the default behavior.
- Left: Include all rows from the left dataframe. Columns from the right dataframe are populated if a common value exists and set to NaN if not.
- Right: Include all rows from the right dataframe. Columns from the left dataframe are populated if a common value exists and set to NaN if not.
- Outer: Include all rows from both dataframes
We want to keep all of the observations, so let’s do a left join instead. To specify the type of merge, we use the how keyword argument:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN | Neotoma | albigula | Rodent |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN | Neotoma | albigula | Rodent |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN | Dipodomys | merriami | Rodent |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN | Dipodomys | merriami | Rodent |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN | Dipodomys | merriami | Rodent |
… | … | … | … | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | Ammospermophilus | harrisi | Rodent |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | Ammospermophilus | harrisi | Rodent |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 | Reithrodontomys | megalotis | Rodent |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 | Dipodomys | ordii | Rodent |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
35549 rows × 12 columns
Now all 35,549 rows appear in the merged dataframe.
Appending rows to a dataframe
Merges address the case where information for a set of observations is spread across multiple files. What about when the observations themselves are in more than one file? For a survey like the one we’ve been looking at in this lesson, we might get a new file once a year with the same columns but a completely new set of observations. How can we add those new observations to our dataframe?
We’ll simulate this operation by splitting data from two different years, 2001 and 2002, into separate dataframes. We can filter the dataset this using conditionals, as we saw in lesson 3:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
31710 | 31711 | 1 | 21 | 2001 | 1 | PB | F | 26.0 | 25.0 |
31711 | 31712 | 1 | 21 | 2001 | 1 | DM | M | 37.0 | 43.0 |
31712 | 31713 | 1 | 21 | 2001 | 1 | PB | M | 29.0 | 44.0 |
31713 | 31714 | 1 | 21 | 2001 | 1 | DO | M | 34.0 | 53.0 |
31714 | 31715 | 1 | 21 | 2001 | 2 | OT | M | 20.0 | 27.0 |
… | … | … | … | … | … | … | … | … | … |
33315 | 33316 | 12 | 16 | 2001 | 11 | NaN | NaN | NaN | NaN |
33316 | 33317 | 12 | 16 | 2001 | 13 | NaN | NaN | NaN | NaN |
33317 | 33318 | 12 | 16 | 2001 | 14 | NaN | NaN | NaN | NaN |
33318 | 33319 | 12 | 16 | 2001 | 15 | NaN | NaN | NaN | NaN |
33319 | 33320 | 12 | 16 | 2001 | 16 | NaN | NaN | NaN | NaN |
1610 rows × 9 columns
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
33320 | 33321 | 1 | 12 | 2002 | 1 | DM | M | 38.0 | 44.0 |
33321 | 33322 | 1 | 12 | 2002 | 1 | DO | M | 37.0 | 58.0 |
33322 | 33323 | 1 | 12 | 2002 | 1 | PB | M | 28.0 | 45.0 |
33323 | 33324 | 1 | 12 | 2002 | 1 | AB | NaN | NaN | NaN |
33324 | 33325 | 1 | 12 | 2002 | 1 | DO | M | 35.0 | 29.0 |
… | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
2229 rows × 9 columns
We now have two different dataframes with the same columns but
different data, one with 1,610 rows, the other with 2,229 rows, for a
total of 3,839 records. We can combine them into a new dataframe using
pd.concat()
, which stacks the dataframes vertically (that
is, it adds records from one dataset to the end of another). This method
accepts a list of dataframes and works from left to right (so the
leftmost dataframe will be at the top of the new dataframe and the
rightmost will be at the bottom). We’re only combining two dataframes
here but could include more if necessary.
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
31710 | 31711 | 1 | 21 | 2001 | 1 | PB | F | 26.0 | 25.0 |
31711 | 31712 | 1 | 21 | 2001 | 1 | DM | M | 37.0 | 43.0 |
31712 | 31713 | 1 | 21 | 2001 | 1 | PB | M | 29.0 | 44.0 |
31713 | 31714 | 1 | 21 | 2001 | 1 | DO | M | 34.0 | 53.0 |
31714 | 31715 | 1 | 21 | 2001 | 2 | OT | M | 20.0 | 27.0 |
… | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
3839 rows × 9 columns
The combined dataframe includes all rows from both dataframes.
In some cases, the exact columns may change from year to year even within the same project. For example, researchers may decide to add an additional column to track a new piece of data or to provide a quality check. If a column is present in only one dataset, you can still concatenate the datasets. Any column that does not appear in a given dataset will have the value NaN for those rows in the combined dataframe.
To illustrate this, we’ll add a column called test to the 2002
survey, then re-run pd.concat()
:
PYTHON
surveys_2002["test"] = True
surveys_2001_2002 = pd.concat([surveys_2001, surveys_2002])
surveys_2001_2002
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | test | |
---|---|---|---|---|---|---|---|---|---|---|
31710 | 31711 | 1 | 21 | 2001 | 1 | PB | F | 26.0 | 25.0 | NaN |
31711 | 31712 | 1 | 21 | 2001 | 1 | DM | M | 37.0 | 43.0 | NaN |
31712 | 31713 | 1 | 21 | 2001 | 1 | PB | M | 29.0 | 44.0 | NaN |
31713 | 31714 | 1 | 21 | 2001 | 1 | DO | M | 34.0 | 53.0 | NaN |
31714 | 31715 | 1 | 21 | 2001 | 2 | OT | M | 20.0 | 27.0 | NaN |
… | … | … | … | … | … | … | … | … | … | … |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | True |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN | True |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 | True |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 | True |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN | True |
3839 rows × 10 columns
As expected, the test column has a value of NaN for the 2001 data in the combined dataframe.
Combining data in multiple columns
Sometimes we want to combine values from two or more columns into a
single column. For example, we might want to refer to the species in
each record by both its genus and species names. In Python, we use the
+
operator to concatenate (or join) strings, and pandas
works the same way:
PYTHON
species["genus_species"] = species["genus"] + " " + species["species"]
species["genus_species"]
OUTPUT
0 Amphispiza bilineata
1 Ammospermophilus harrisi
2 Ammodramus savannarum
3 Baiomys taylori
4 Campylorhynchus brunneicapillus
5 Calamospiza melanocorys
6 Callipepla squamata
7 Crotalus scutalatus
8 Cnemidophorus tigris
9 Cnemidophorus uniparens
10 Crotalus viridis
11 Dipodomys merriami
12 Dipodomys ordii
13 Dipodomys spectabilis
14 Dipodomys sp.
15 Eumeces obsoletus
16 Gambelia silus
17 Neotoma albigula
18 Neotoma sp.
19 Onychomys leucogaster
20 Onychomys torridus
21 Onychomys sp.
22 Chaetodipus baileyi
23 Pipilo chlorurus
24 Peromyscus eremicus
25 Perognathus flavus
26 Pooecetes gramineus
27 Perognathus hispidus
28 Chaetodipus intermedius
29 Peromyscus leucopus
30 Peromyscus maniculatus
31 Chaetodipus penicillatus
32 Pipilo fuscus
33 Chaetodipus sp.
34 Reithrodontomys fulvescens
35 Reithrodontomys megalotis
36 Reithrodontomys montanus
37 Reithrodontomys sp.
38 Sylvilagus audubonii
39 Spizella breweri
40 Sceloporus clarki
41 Sigmodon fulviventer
42 Sigmodon hispidus
43 Sigmodon ochrognathus
44 Spermophilus spilosoma
45 Spermophilus tereticaudus
46 Sceloporus undulatus
47 Sigmodon sp.
48 Lizard sp.
49 Pipilo sp.
50 Rodent sp.
51 Sparrow sp.
52 Zonotrichia leucophrys
53 Zenaida macroura
Name: genus_species, dtype: object
Note that the +
operator can also be used to add numeric
columns. In Python, the same operator can be used to perform different
operations for different data types.
Combining dates
Another common need is to join or split dates. In the ecology
dataset, the date of each observation is split across year, month, and
day columns. However, pandas has a special data type,
datetime64
, for representing dates that can be useful for
analyzing time series data. To make use of that functionality, we can
use pd.to_datetime()
method to create a datetime
object:
OUTPUT
0 1977-07-16
1 1977-07-16
2 1977-07-16
3 1977-07-16
4 1977-07-16
...
35544 2002-12-31
35545 2002-12-31
35546 2002-12-31
35547 2002-12-31
35548 2002-12-31
Name: date, Length: 35549, dtype: datetime64[ns]
Challenge
Pandas allows us to ask specific questions about our data. A key skill is knowing how to translate our scientific questions into a sensible approach (and subsequently visualize and interpret our results).
Try using pandas to answer the following questions.
- How many specimens of each sex are there for each year, including those whose sex is unknown?
- What is the average weight of each taxa?
- What are the minimum, maximum and average weight for each species of Rodent?
- What is the average hindfoot length for male and female rodent of each species? Is there a male/female difference?
- What is the average weight of each rodent species over the course of the years? Is there any noticeable trend for any of the species?
How many specimens of each sex are there for each year, including those whose sex is unknown?
PYTHON
# Fill in NaN values in sex with U
surveys["sex"] = surveys["sex"].fillna("U")
# Count records by sex
result = surveys.groupby(["year", "sex"])["record_id"].count()
# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
print(result)
OUTPUT
year sex
1977 F 204
M 214
U 85
1978 F 503
M 433
U 112
1979 F 327
M 324
U 68
1980 F 605
M 727
U 83
1981 F 631
M 745
U 96
1982 F 823
M 1027
U 128
1983 F 771
M 797
U 105
1984 F 445
M 443
U 93
1985 F 636
M 716
U 86
1986 F 414
M 455
U 73
1987 F 677
M 862
U 132
1988 F 604
M 737
U 128
1989 F 678
M 780
U 111
1990 F 581
M 636
U 94
1991 F 606
M 637
U 104
1992 F 443
M 477
U 118
1993 F 285
M 380
U 85
1994 F 243
M 327
U 98
1995 F 574
M 534
U 114
1996 F 725
M 910
U 71
1997 F 1071
M 1357
U 65
1998 F 717
M 798
U 95
1999 F 545
M 530
U 60
2000 F 690
M 779
U 83
2001 F 743
M 744
U 123
2002 F 1149
M 979
U 101
Name: record_id, dtype: int64
What is the average weight of each taxa?
PYTHON
# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")
# Group by taxa
grouped = merged.groupby("taxa")
# Calculate the min, max, and mean weight
grouped["weight"].mean()
OUTPUT
taxa
Bird NaN
Rabbit NaN
Reptile NaN
Rodent 42.672428
Name: weight, dtype: float64
What are the minimum, maximum and average weight for each species of Rodent?
PYTHON
# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")
# Limit merged dataframe to rodents
rodents = merged[merged["taxa"] == "Rodent"]
# Group rodents by species
grouped = rodents.groupby("species_id")
# Calculate the min, max, and mean weight
grouped.agg({"weight": ["min", "max", "mean"]})
OUTPUT
weight | |||
---|---|---|---|
min | max | mean | |
species_id | |||
AH | NaN | NaN | NaN |
BA | 6.0 | 18.0 | 8.600000 |
DM | 10.0 | 66.0 | 43.157864 |
DO | 12.0 | 76.0 | 48.870523 |
DS | 12.0 | 190.0 | 120.130546 |
DX | NaN | NaN | NaN |
NL | 30.0 | 280.0 | 159.245660 |
OL | 10.0 | 56.0 | 31.575258 |
OT | 5.0 | 46.0 | 24.230556 |
OX | 18.0 | 24.0 | 21.000000 |
PB | 12.0 | 55.0 | 31.735943 |
PE | 8.0 | 40.0 | 21.586508 |
PF | 4.0 | 25.0 | 7.923127 |
PH | 18.0 | 48.0 | 31.064516 |
PI | 17.0 | 21.0 | 19.250000 |
PL | 8.0 | 27.0 | 19.138889 |
PM | 7.0 | 49.0 | 21.364155 |
PP | 4.0 | 74.0 | 17.173942 |
PX | 18.0 | 20.0 | 19.000000 |
RF | 9.0 | 20.0 | 13.386667 |
RM | 4.0 | 29.0 | 10.585010 |
RO | 8.0 | 13.0 | 10.250000 |
RX | 11.0 | 20.0 | 15.500000 |
SF | 24.0 | 199.0 | 58.878049 |
SH | 16.0 | 140.0 | 73.148936 |
SO | 15.0 | 105.0 | 55.414634 |
SS | 57.0 | 130.0 | 93.500000 |
ST | NaN | NaN | NaN |
UR | NaN | NaN | NaN |
What is the average hindfoot length for male and female rodent of each species? Is there a male/female difference?
PYTHON
# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")
# Limit merged dataframe to rodents
rodents = merged[merged["taxa"] == "Rodent"]
# Group rodents by species and sex
grouped = rodents.groupby(["species_id", "sex"])
# Calculate the mean hindfoot length, plus count and standard deviation
# to better assess the question
with pd.option_context("display.max_rows", None):
print(grouped["hindfoot_length"].agg(["count", "mean", "std"]))
OUTPUT
count mean std
species_id sex
AH M 0 NaN NaN
U 2 33.000000 2.828427
BA F 31 13.161290 1.529636
M 14 12.642857 2.097880
U 0 NaN NaN
DM F 4302 35.712692 1.433067
M 5658 36.188229 1.455396
U 12 35.583333 0.996205
DO F 1244 35.486334 1.726414
M 1640 35.698780 1.611656
U 3 36.000000 0.000000
DS F 1046 49.583174 1.973573
M 1083 50.301939 2.128120
U 3 50.000000 1.732051
DX U 0 NaN NaN
NL F 620 32.024194 1.669181
M 452 32.577434 1.903603
U 2 52.000000 25.455844
OL F 438 20.287671 1.101022
M 480 20.758333 1.651195
U 2 20.000000 0.000000
OT F 1019 20.281649 1.519412
M 1115 20.257399 1.087354
U 5 19.600000 1.140175
OX F 3 19.666667 1.154701
M 5 18.800000 3.346640
U 0 NaN NaN
PB F 1645 25.882675 0.928842
M 1213 26.431987 1.444130
U 6 26.166667 0.983192
PE F 544 20.272059 1.156084
M 666 20.132132 1.177721
U 2 20.500000 0.707107
PF F 721 15.550624 1.381627
M 770 15.620779 1.142208
U 2 13.000000 4.242641
PH F 20 25.900000 1.293709
M 11 25.545455 1.752920
PI M 8 22.500000 0.534522
U 1 20.000000 NaN
PL F 16 19.750000 2.720294
M 19 20.263158 0.933459
U 1 20.000000 NaN
PM F 361 20.279778 1.162782
M 483 20.530021 1.275621
U 3 21.333333 4.041452
PP F 1574 21.676620 1.316529
M 1444 21.838643 1.201006
U 9 20.888889 2.027588
PX F 1 19.000000 NaN
M 1 20.000000 NaN
U 0 NaN NaN
RF F 55 17.527273 0.813191
M 18 17.500000 0.985184
RM F 1133 16.391880 1.123946
M 1296 16.491512 1.134202
U 13 15.846154 2.444250
RO F 4 14.750000 0.957427
M 4 16.000000 1.154701
RX M 2 18.500000 2.121320
SF F 16 27.500000 2.683282
M 23 26.347826 3.524056
U 2 24.500000 0.707107
SH F 71 29.028169 2.298893
M 60 27.983333 2.801281
U 0 NaN NaN
SO F 30 25.633333 4.139660
M 11 25.727273 1.954017
SS F 0 NaN NaN
M 0 NaN NaN
U 0 NaN NaN
ST U 0 NaN NaN
UR U 0 NaN NaN
What is the average weight of each rodent species over the course of the years? Is there any noticeable trend for any of the species?
PYTHON
# Create the merged dataframe
merged = pd.merge(surveys, species, how="left")
# Limit merged dataframe to rodents
rodents = merged[merged["taxa"] == "Rodent"]
# Group rodents by species and year
grouped = rodents.groupby(["species_id", "year"])
# Calculate the mean weight by year
result = grouped["weight"].mean()
# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
print(result)
OUTPUT
species_id year
AH 1978 NaN
1982 NaN
1983 NaN
1984 NaN
1985 NaN
1986 NaN
1987 NaN
1988 NaN
1989 NaN
1990 NaN
1991 NaN
1992 NaN
1993 NaN
1994 NaN
1995 NaN
1996 NaN
1997 NaN
1998 NaN
1999 NaN
2000 NaN
2001 NaN
2002 NaN
BA 1989 7.000000
1990 8.000000
1991 9.240000
1992 7.833333
DM 1977 41.141304
1978 40.795455
1979 43.507317
1980 44.136082
1981 44.032907
1982 41.975042
1983 40.944551
1984 40.765306
1985 41.507645
1986 43.361596
1987 43.232609
1988 43.397790
1989 44.349206
1990 41.769912
1991 43.148338
1992 43.877966
1993 43.373984
1994 42.373288
1995 44.806147
1996 44.506173
1997 44.708551
1998 43.131403
1999 43.945402
2000 43.126638
2001 45.442177
2002 46.168317
DO 1977 42.666667
1978 45.000000
1979 45.862069
1980 48.058824
1981 49.111111
1982 47.919643
1983 47.150602
1984 48.415094
1985 47.956731
1986 49.372727
1987 50.087379
1988 51.463768
1989 51.025641
1990 48.512048
1991 49.695652
1992 48.367816
1993 48.461538
1994 47.750000
1995 49.592593
1996 47.234940
1997 48.177866
1998 49.731183
1999 48.012048
2000 49.224719
2001 52.233766
2002 49.514403
DS 1977 121.437500
1978 115.198653
1979 111.796954
1980 120.725664
1981 125.818444
1982 115.647929
1983 122.033088
1984 124.082192
1985 124.326316
1986 128.071429
1987 126.383838
1988 129.490566
1989 121.896552
1990 121.187500
1991 113.000000
1992 112.352941
1993 105.722222
1994 106.625000
1997 111.000000
1998 116.000000
1999 120.714286
DX 1980 NaN
1983 NaN
1986 NaN
1988 NaN
1991 NaN
1992 NaN
1993 NaN
1994 NaN
1995 NaN
1996 NaN
1998 NaN
1999 NaN
2000 NaN
2001 NaN
2002 NaN
NL 1977 NaN
1978 184.656250
1979 138.000000
1980 158.982143
1981 165.903226
1982 160.613208
1983 156.691489
1984 150.095238
1985 148.829268
1986 159.724138
1987 158.840000
1988 163.104167
1989 151.278689
1990 154.275862
1991 148.785714
1992 139.000000
1993 127.391304
1994 186.875000
1995 155.833333
1996 148.000000
1997 150.688889
1998 159.466667
1999 182.200000
2000 179.307692
2001 167.851064
2002 182.159091
OL 1977 21.666667
1978 31.027027
1979 33.717391
1980 33.107143
1981 33.296296
1982 35.590361
1983 34.584615
1984 32.550725
1985 32.108696
1986 30.880952
1987 30.779221
1988 30.306122
1989 31.947368
1990 31.200000
1991 28.171429
1992 28.454545
1993 27.545455
1994 21.900000
1995 27.296296
1996 27.769231
1997 33.625000
2002 25.428571
OT 1977 21.000000
1978 23.220000
1979 23.075758
1980 24.083333
1981 24.415385
1982 24.925532
1983 24.697674
1984 22.416667
1985 22.064516
1986 21.500000
1987 23.069767
1988 24.070588
1989 24.663366
1990 23.675325
1991 24.588235
1992 24.928571
1993 22.363636
1994 25.285714
1995 24.868421
1996 23.453704
1997 24.785156
1998 24.675676
1999 25.723810
2000 25.303448
2001 23.707792
2002 23.833333
OX 1977 21.333333
1978 NaN
1982 24.000000
1984 18.000000
1989 20.000000
1993 NaN
PB 1995 34.000000
1996 32.578947
1997 31.085603
1998 30.082237
1999 31.710037
2000 30.878899
2001 32.798851
2002 32.359447
PE 1977 19.500000
1978 20.428571
1979 20.529412
1980 22.450000
1981 20.558140
1982 21.173077
1983 20.864865
1984 20.210526
1985 20.360000
1986 22.520000
1987 21.625899
1988 22.625668
1989 21.935673
1990 21.671233
1991 21.435484
1992 22.750000
1993 19.230769
1994 18.000000
1995 21.230769
1996 22.027778
1997 20.929825
1998 20.304348
1999 25.428571
2000 21.615385
2001 20.400000
2002 21.719298
PF 1977 7.173913
1978 7.088235
1979 7.529412
1980 7.455556
1981 7.152542
1982 6.918919
1983 6.833333
1984 7.428571
1985 NaN
1986 8.000000
1987 7.500000
1988 8.000000
1989 7.230769
1990 7.117647
1991 7.827586
1992 8.767857
1993 8.238806
1994 7.855072
1995 8.780645
1996 8.393846
1997 8.448649
1998 8.720000
2001 8.538462
2002 7.388889
PH 1984 28.000000
1985 32.666667
1986 39.000000
1988 33.000000
1989 30.500000
1990 31.142857
1992 31.142857
1995 35.333333
1997 22.000000
PI 1992 18.000000
1995 NaN
2001 18.000000
2002 20.000000
PL 1995 22.333333
1996 21.166667
1997 18.789474
1998 16.714286
2000 21.000000
PM 1978 20.666667
1979 23.666667
1980 21.000000
1981 19.933333
1982 21.391304
1983 22.023810
1984 19.545455
1985 19.000000
1986 21.696970
1987 22.297710
1988 21.759259
1989 20.222222
1995 27.375000
1996 20.224490
1997 21.126394
1998 20.591398
1999 22.523810
2000 20.500000
2001 26.666667
2002 19.000000
PP 1977 15.333333
1978 14.869565
1979 15.105263
1980 14.176471
1981 13.950000
1982 16.125000
1983 15.468750
1984 15.307692
1985 15.764706
1986 16.750000
1987 17.840909
1988 18.280000
1989 17.409091
1990 16.029412
1991 17.904762
1992 17.479675
1993 17.954023
1994 17.585714
1995 16.844444
1996 18.095563
1997 18.175000
1998 16.266990
1999 16.521212
2000 16.788618
2001 17.752896
2002 17.018617
PX 1996 NaN
1997 19.000000
1998 NaN
2000 NaN
RF 1982 11.500000
1988 13.818182
1989 13.346939
1990 12.916667
1997 20.000000
RM 1977 10.000000
1978 7.500000
1979 8.333333
1980 10.227273
1981 11.178571
1982 10.436242
1983 9.872000
1984 11.152542
1985 8.371429
1986 10.750000
1987 10.657609
1988 10.322115
1989 10.411311
1990 10.305677
1991 10.498305
1992 10.904762
1993 10.747475
1994 10.675000
1995 12.653846
1996 11.455556
1997 11.230769
1998 13.100000
1999 10.555556
2000 11.400000
2001 11.333333
2002 10.000000
RO 1991 11.000000
2002 10.142857
RX 1995 20.000000
1997 11.000000
SF 1989 54.800000
1990 52.611111
1991 93.166667
1992 43.000000
1993 44.000000
2002 62.166667
SH 1977 NaN
1978 89.000000
1982 79.000000
1983 NaN
1986 55.000000
1987 73.800000
1988 72.806452
1989 76.345455
1990 76.888889
1991 63.000000
1997 50.857143
1999 54.000000
2000 73.375000
2001 79.900000
2002 64.666667
SO 1991 53.909091
1992 55.263158
1993 55.600000
1994 62.333333
1997 54.666667
SS 1977 NaN
1978 130.000000
1979 NaN
1980 NaN
1981 57.000000
1982 NaN
1983 NaN
1984 NaN
1985 NaN
1986 NaN
1987 NaN
1988 NaN
1989 NaN
1990 NaN
1991 NaN
1992 NaN
1993 NaN
1994 NaN
1995 NaN
1996 NaN
1997 NaN
1998 NaN
1999 NaN
2000 NaN
2001 NaN
2002 NaN
ST 1993 NaN
UR 1987 NaN
1988 NaN
1991 NaN
1992 NaN
1994 NaN
2001 NaN
2002 NaN
Name: weight, dtype: float64
Key Points
- Combine two dataframes on one or more common values using
pd.merge()
- Append rows from one dataframe to another using
pd.concat()
- Combine multiple text columns into one using the
+
operator - Convert date info to datetime objects using
pd.to_datetime()
Content from Visualizing Data
Last updated on 2025-07-15 | Edit this page
Overview
Questions
- How can we create different types of plots using Python?
- How can we style plots?
- How can we add descriptive titles to plots and axes?
Objectives
- Review processes for reading, modifying, and combining dataframes
- Make and customize scatter, box, and bar plots
We’ll begin by loading and preparing the data we’d like to plot. This will include operations introduced in previous lessons, including reading CSVs into dataframes, merging dataframes, sorting a dataframe, and removing records that include null values. We’ll begin by importing pandas:
Next we’ll load the surveys dataset using
pd.read_csv()
:
Now we want to take a quick look at the surveys dataset. Since we’re
going to be plotting data, we need to consider how we want to handle any
null values in the dataset. The info()
method provides an
overview, including counts of non-null values, that we can use to assess
the dataset.
OUTPUT
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35549 entries, 0 to 35548
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 record_id 35549 non-null int64
1 month 35549 non-null int64
2 day 35549 non-null int64
3 year 35549 non-null int64
4 plot_id 35549 non-null int64
5 species_id 34786 non-null object
6 sex 33038 non-null object
7 hindfoot_length 31438 non-null float64
8 weight 32283 non-null float64
dtypes: float64(2), int64(5), object(2)
memory usage: 2.4+ MB
There are 35,459 records in the table. Four columns—species_id, sex,
hindfoot_length, and weight—include null values, that is, they contain
fewer non-null values than there are rows in the dataframe. We can us
fillna()
to replace null values where it makes sense to do
so. For example, some specimens do not specify as sex. We can fill those
values in with the letter U (for unknown):
The other three columns that contain null values are all required for
the plots we will create below. This means that we can use
dropna()
to drop all rows where any column is null:
Now we we’ll merge the main surveys dataframe with two other datasets containing additional information:
- species.csv provides the genus and species corresponding to species_id
- plots.csv provides the plot type corresponding to plot_id
We will read each CSV and merge it into the main dataframe:
PYTHON
species = pd.read_csv("data/species.csv")
plots = pd.read_csv("data/plots.csv")
surveys = surveys.merge(species, how="left").merge(plots, how="left")
Chaining
The previous cell performs two merges in the same line of code. Performing multiple operations on the same object in a single line of code is called chaining.
We now have a dataframe that includes all observations from the Portal dataset that specify a species, weight, and hindfoot length, as well as descriptive metadata about each species and plot type. The taxa column contains the general type of animal. If we look at the unique values in this column–
OUTPUT
array(['Rodent'], dtype=object)
–we can see that all remaining observations are of rodents. In honor of this, we will assign our dataframe to a new variable:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | plot_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
12369 | 15232 | 12 | 14 | 1988 | 11 | PE | M | 20.0 | 22.0 | Peromyscus | eremicus | Rodent | Control |
30117 | 34885 | 10 | 6 | 2002 | 4 | PB | F | 27.0 | 25.0 | Chaetodipus | baileyi | Rodent | Control |
2723 | 3815 | 1 | 31 | 1981 | 13 | OL | F | 21.0 | 35.0 | Onychomys | leucogaster | Rodent | Short-term Krat Exclosure |
22412 | 26462 | 7 | 9 | 1997 | 19 | PP | M | 21.0 | 14.0 | Chaetodipus | penicillatus | Rodent | Long-term Krat Exclosure |
13991 | 17109 | 1 | 30 | 1990 | 4 | DM | M | 37.0 | 49.0 | Dipodomys | merriami | Rodent | Control |
Finally, we’ll save the rodents dataframe to a file that we can load directly in the future if needed:
Now we’re ready to plot.
Re-introducing plotly
We’ve already worked with plotly a little in previous lessons, but we haven’t provided a comprehensive introduction. Plotly is a data visualization package for Python that allows us to create customizable, interactive plots of a variety of different types. Plotly makes plots that are:
- Customizable. Allows the appearance of plots to be extensively modified.
- Interactive. Pan and zoom across plots, or hover over elements to get additional information about them.
- Flexible. Many different plot types can be created, often with only a few lines of code. Because plotly uses similar syntax for each plot type, it is also easy to quickly change plot types to get a different perspective on a dataset.
- Embeddable. Interactive plots can be embedded on websites using ploty’s JavaScript library.
Other plotting libraries
The R community has largely coalesced around gg2plot for plotting. In contrast, the Python community has no clear consensus pick and makes use of a number of data visualization packages. Some other commonly used packages include:
The functionality of this packages overlaps to a large degree, and which one to use depends in large part on personal preference.
Plotly has two main ways of making plots:
-
plotly.express
provides a simplified interface for quickly building and customizing plots -
plotly.graph_objects
uses a more complex interface to provide more granular control over the contents of a plot
We will use plotly.express
in this lesson.
We’ll begin by reproducing the scatter plot from the end
of lesson 3, which included weight on the x axis and hindfoot length
on the y axis. Likely pandas, the developers of
plotly.express
have a preferred alias, px, that we will use
when we import plotly.express
:
OUTPUT
Before diving into the content of this plot, let’s take a quick look at the interactive elements plotly makes available. When we hover over a plotly plot, a toolbar appears in the upper right corner. Each icon on the toolbar is a widget that allows us to interact with the plot. By default, the toolbar includes the following widgets:
- The camera allows us to save the current view as a PNG file
- The next four widgets are toggles that control how click-and-drag
affects the plot. Only one can be active at a time.
- The magnifying glass enables drawing a box to zoom
- The crossing arrows enable panning
- The dotted box enables drawing a box to select data
- The dotted lasso enables drawing an arbitrary shape to to select data
- The plus box allows us to zoom in
- The minus box allows us to zoom out
- The crossing arrows autoscale the plot to show all adata
- The house resets the plot to the original view
When we hover over a data point, we get a tooltip that shows information about that point.
Now let’s think about the plot itself.
Challenge
What are some limitations to the plot above? Think about how the data itself is presented as well as the general appearance of the plot.
Some issues with the plot include the following:
- All points are the same color
- Large number of overlapping points, making it difficult to understand how data is distributed
- Axis labels include underscores and lack units
- No plot title
Any others?
One issue with the plot is that many of the points in the dataframe overlap, making it difficult to get a feel for how the data is distributed. Does it cluster in places? Is it evenly distributed? We really can’t tell.
We can mitigate this issue in part by making the points semitransparent using the opacity keyword argument. We’ll use a value of 0.2, which means the points are mostly transparent.
OUTPUT
With the points now partially transparent, the places where they overlap are more obvious, and we can see several clusters. To get a better sense of what those clusters might be, let’s add some color to the plot.
At the end of lesson 4, we colored this scatter plot based on the species_id column. This revealed that several of the clusters reflected the species of rodent observed. But the use of species_id was problematic in other ways. For one, there are so many different species that colors ended up being re-used. For another, the species codes are not legible to people who are not intimately familiar with this dataset.
With those concerns in mind, let’s use genus for color instead. Genus information wasn’t available in lesson 4, but we can use it now now that we’ve merged the surveys and species dataframes.
OUTPUT
Choosing colors
One concern when making plots is to make sure they are legible to as broad an audience as possible. Clear, informative labels are one way to do this. Color is another.
Plotly makes a number of color palettes available via its color module. Because we are working with categorical data, we will use a qualitative palette, which consists of a list of discrete colors. Other palette types are also available. For example, plots showing a range of values might benefit from using a sequential or diverging color scheme, which use a continuous range of colors (for example, blue to red for a heat map).
Qualitative palettes are available under
px.colors.qualitative
. We can view the available plaettes
using the swatches function:
OUTPUT
In the spirit of effective communication with a wide audience, we
will use px.colors.qualitative.Safe
, a colorblind-safe
palette. Because we will be using the same palette for the rest of the
lesson, we will store the palette as a variable:
In addition to simplifying the code for future plots a bit, storing the palette as a variable also allows us to quickly change the color scheme for all our plots at once if needed.
Let’s apply the safe colors to our scatter plot:
PYTHON
px.scatter(
rodents,
x="weight",
y="hindfoot_length",
color="genus",
opacity=0.2,
color_discrete_sequence=colors,
)
OUTPUT
Sorting data
Take a look at the legend of the plot. The genera from the dataset are all listed, but they are in no apparent order. This makes it difficult for anyone looking at the plot to quickly pick out a given genus. We can alphabetize the legend to make it more readable. To do so, we can use the category_order keyword argument.
This argument requires a dict
. Recall that a
dict
is a mapping of keys to values defined using curly
braces. The dict
passed to category_orders maps a column
name from the dataframe to a list of values in the preferred order.
One approach to creating an alphabetical list would be to simply
build the list manually. That would work well enough here, where we have
only a handful of values, but quickly becomes unwiedly for larger
datasets. Instead, we will used the built-in sorted()
function to create a sorted list of values in the genus column. Because
we will be using the same order in all following plots, we will store
the dict with the sorted values in a variable that we can refer to
whenever we need to and can change if needed.
Challenge
Create the dict
needed for category_order. The
dict
should map the column name, genus
, to an
ordered list of unique values from that column.
In addition to the approach used in the challenge, we can create the
dict
we need with a single line of code:
That variable can now be passed to the category_orders keyword argument, producing a new version of the plot with an alphabetical legend.
PYTHON
px.scatter(
rodents,
x="weight",
y="hindfoot_length",
color="genus",
opacity=0.2,
color_discrete_sequence=colors,
category_orders=cat_order,
)
OUTPUT
Note that the colors in the plot have also changed. Colors are assigned based on the same category order used by the legend.
Adding plot and axis titles
By default, plotly uses the column names from the dataframe to label the axes of a plot. Here, the axis labels are adequate but would benefit from removing the underscore and including units. We can use the labels keyword argument to assign human-readable labels to our plto.
We will turn again to a dict
for this, which we will use
to map the underlying column names to preferred display values:
PYTHON
labels = {
"hindfoot_length": "Hindfoot length (mm)",
"genus": "Genus",
"weight": "Weight (g)"
}
We can then update the plot itself using the title and labels keyword arguments:
PYTHON
px.scatter(
rodents,
x="weight",
y="hindfoot_length",
color="genus",
opacity=0.2,
color_discrete_sequence=colors,
category_orders=cat_order,
title="Rodent size by genus",
labels=labels,
)
OUTPUT
Create a faceted plot
Even with the semitransparent points, there is a good deal of overlap among the data on this plot, particularly in the lower left part. It may be useful to plot each genus separately to see if anyting interesting is being obscured. We could do so using some of the techniques we’ve already covered. For example, we could filter the dataframe by genus and create a separate plots for each.
Plotly provides a simpler approach called faceting. A facet is a type of filter, and a faceted plot includes a separate subplot for each unique value. In plotly, we can create a faceted plot using the facet_col keyword argument. This argument produces a separate subplot for each unique value in the specified column. The subplots are arranged in a single row.
The plots we have created so far have not specified dimensions, but we will need to consider the size of the faceted plot. Because this plot includes a number of subplots, it may appear cramped unless it is quite wide. We can use the height and width keyword arguments to set the size of the plot. Each of these arguments requires an integer specifying a size in pixels.
Let’s update our plot to facet it by genus and make it 1400 pixels wide by 400 pixels tall:
PYTHON
px.scatter(
rodents,
x="weight",
y="hindfoot_length",
color="genus",
opacity=0.2,
color_discrete_sequence=colors,
category_orders=cat_order,
title="Rodent size by genus",
labels=labels,
facet_col="genus",
width=1400,
height=400,
)
OUTPUT
Row facets
The facet_col argument produces a single row of subplots. There is another argument, facet_row, that places each subplot in a separate row instead. Column and row facets can even be combined, for example, to produce a grid of subplots with genus as columns and sex as rows.
Making a box plot
We briefly discussed box plots (also known as box-and-whisker plots)
in lesson
4 as part of the discussion of summary statistics. Box plots are an
effective way to visualize the distribution of data. Plotly uses the
px.box()
method to generate them.
Some issues raised about the plot created during the earlier lesson—including the arbitrary order in which data was plotted, the repeated colors, and the inclusion of species with no data—have already been addressed above. We can integrate them into the box plot using the same approaches that we used for the scatter plots previously. Let’s create a box plot of hindfoot length by genus using the same color and ordering rules we defined earlier:
PYTHON
px.box(
rodents,
x="genus",
y="hindfoot_length",
color="genus",
color_discrete_sequence=colors,
category_orders=cat_order,
title="Rodent hindfoot length by genus",
labels=labels,
)
OUTPUT
We now have a box plot with colors corresponding to the scatter plots above, with an alphabetically ordered x axis and legend.
There are other aspects of the box plot that we may want to tweak. We’ll start with a concern about how the data is being represented. By default, ploty’s box plots show individual points only for outliers, that is, points that plot outside the upper and lower fences. This works well enough for normally distributed data but can obscure patterns for more complex distributions. And indeed, some genera, like Dipodomys, show a large number of outliers on the box pot and multiple clusters of data on the scatter plot. How might we update the box plot to better convey these distributions?
The px.box()
method includes a keyword argument, points,
that allows us to change how the underlying data is displayed. It
accepts three values:
- outliers only shows the outliers (default)
- all shows all points
- None shows no points
Let’s try updating the box plot to show all the underlying data:
PYTHON
px.box(
rodents,
x="genus",
y="hindfoot_length",
color="genus",
color_discrete_sequence=colors,
category_orders=cat_order,
title="Rodent hindfoot length by genus",
labels=labels,
points="all",
)
OUTPUT
A point cloud is now visible to the left of each box-and-whisker. We can see that plotly has spread the points out a bit along the x axis. This process, called jitter, is necessary because otherwise the points for each category would fall in a vertical line. We can also see that we’ve run into the same problem we did above with the scatter plot: The large number of overlapping points for each box makes it hard to see what is going on inside each point cloud.
We can again address this problem by changing the opacity of the
markers. It’s a little more complicated than it was for the scatter
plot, however, because the px.box()
method does not allow
us to set the marker opacity directly. Instead, we have to build the
plot, then update the existing markers using the
update_traces()
method. (Plotly refers to markers, lines,
and other elements as traces.)
PYTHON
fig = px.box(
rodents,
x="genus",
y="hindfoot_length",
color="genus",
color_discrete_sequence=colors,
category_orders=cat_order,
title="Rodent hindfoot length by genus",
labels=labels,
points="all",
)
fig.update_traces(marker={"opacity": 0.1})
OUTPUT
With the points now semitransparent, it is possible to see separate populations among some genera, like Dipodomys.
Another way to examine the distribution of data is a violin plot,
which visualizes the distribution of points as a line, similar to a Bell
curve. We can change our box plot to a violin plot by swapping
px.violin()
in for px.box()
:
PYTHON
fig = px.violin(
rodents,
x="genus",
y="hindfoot_length",
color="genus",
color_discrete_sequence=colors,
category_orders=cat_order,
title="Rodent hindfoot length by genus",
labels=labels,
points="all",
)
fig.update_traces(marker={"opacity": 0.1})
OUTPUT
This plot makes it easier to see identify complex distributions, like the bimodal distribution for Chaetodipus, that are visible but not necessarily obvious in the point clouds.
Challenge
Let’s return to a question posed all the way back in lesson 2: How has the weight of Dipodomys species changed over time? Make a plot that tries to answer this question.
There are several reasonable approaches to this question using the scatter plots and boxplots covered so far in this lesson. One possibility is a faceted plot showing the mean weight of each Dipodomys over the course of the study.
PYTHON
# Create genus_species column
rodents["genus_species"] = rodents["genus"] + " " + rodents["species"]
# Limit to Dipodomys
dipodomys = rodents[rodents["genus"] == "Dipodomys"].copy()
# Group by genus_species
grouped = dipodomys.groupby(["genus_species", "year"])["weight"].mean().reset_index()
# Create scatter plot
px.scatter(grouped, x="year", y="weight", facet_col="genus_species")
The plot generated by this code shows that the mean weights for these species oscillated over the course of the study. Overall, mean weight increased slightly for the two smaller species but decreased for the largest species (albeit with a large oscillation that makes determining a trend difficult.)
Key Points
- Plotly offers a wide variety of ways to build and style scatter plots
- Use scatter plots to visualize how parameters covary
- Use box and violin plots to visualize the distribution of a parameter