Visualizing Data
Last updated on 2025-07-15 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- How can we create different types of plots using Python?
- How can we style plots?
- How can we add descriptive titles to plots and axes?
Objectives
- Review processes for reading, modifying, and combining dataframes
- Make and customize scatter, box, and bar plots
We’ll begin by loading and preparing the data we’d like to plot. This will include operations introduced in previous lessons, including reading CSVs into dataframes, merging dataframes, sorting a dataframe, and removing records that include null values. We’ll begin by importing pandas:
Next we’ll load the surveys dataset using
pd.read_csv()
:
Now we want to take a quick look at the surveys dataset. Since we’re
going to be plotting data, we need to consider how we want to handle any
null values in the dataset. The info()
method provides an
overview, including counts of non-null values, that we can use to assess
the dataset.
OUTPUT
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35549 entries, 0 to 35548
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 record_id 35549 non-null int64
1 month 35549 non-null int64
2 day 35549 non-null int64
3 year 35549 non-null int64
4 plot_id 35549 non-null int64
5 species_id 34786 non-null object
6 sex 33038 non-null object
7 hindfoot_length 31438 non-null float64
8 weight 32283 non-null float64
dtypes: float64(2), int64(5), object(2)
memory usage: 2.4+ MB
There are 35,459 records in the table. Four columns—species_id, sex,
hindfoot_length, and weight—include null values, that is, they contain
fewer non-null values than there are rows in the dataframe. We can us
fillna()
to replace null values where it makes sense to do
so. For example, some specimens do not specify as sex. We can fill those
values in with the letter U (for unknown):
The other three columns that contain null values are all required for
the plots we will create below. This means that we can use
dropna()
to drop all rows where any column is null:
Now we we’ll merge the main surveys dataframe with two other datasets containing additional information:
- species.csv provides the genus and species corresponding to species_id
- plots.csv provides the plot type corresponding to plot_id
We will read each CSV and merge it into the main dataframe:
PYTHON
species = pd.read_csv("data/species.csv")
plots = pd.read_csv("data/plots.csv")
surveys = surveys.merge(species, how="left").merge(plots, how="left")
Chaining
The previous cell performs two merges in the same line of code. Performing multiple operations on the same object in a single line of code is called chaining.
We now have a dataframe that includes all observations from the Portal dataset that specify a species, weight, and hindfoot length, as well as descriptive metadata about each species and plot type. The taxa column contains the general type of animal. If we look at the unique values in this column–
OUTPUT
array(['Rodent'], dtype=object)
–we can see that all remaining observations are of rodents. In honor of this, we will assign our dataframe to a new variable:
OUTPUT
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | genus | species | taxa | plot_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
12369 | 15232 | 12 | 14 | 1988 | 11 | PE | M | 20.0 | 22.0 | Peromyscus | eremicus | Rodent | Control |
30117 | 34885 | 10 | 6 | 2002 | 4 | PB | F | 27.0 | 25.0 | Chaetodipus | baileyi | Rodent | Control |
2723 | 3815 | 1 | 31 | 1981 | 13 | OL | F | 21.0 | 35.0 | Onychomys | leucogaster | Rodent | Short-term Krat Exclosure |
22412 | 26462 | 7 | 9 | 1997 | 19 | PP | M | 21.0 | 14.0 | Chaetodipus | penicillatus | Rodent | Long-term Krat Exclosure |
13991 | 17109 | 1 | 30 | 1990 | 4 | DM | M | 37.0 | 49.0 | Dipodomys | merriami | Rodent | Control |
Finally, we’ll save the rodents dataframe to a file that we can load directly in the future if needed:
Now we’re ready to plot.
Re-introducing plotly
We’ve already worked with plotly a little in previous lessons, but we haven’t provided a comprehensive introduction. Plotly is a data visualization package for Python that allows us to create customizable, interactive plots of a variety of different types. Plotly makes plots that are:
- Customizable. Allows the appearance of plots to be extensively modified.
- Interactive. Pan and zoom across plots, or hover over elements to get additional information about them.
- Flexible. Many different plot types can be created, often with only a few lines of code. Because plotly uses similar syntax for each plot type, it is also easy to quickly change plot types to get a different perspective on a dataset.
- Embeddable. Interactive plots can be embedded on websites using ploty’s JavaScript library.
Other plotting libraries
The R community has largely coalesced around gg2plot for plotting. In contrast, the Python community has no clear consensus pick and makes use of a number of data visualization packages. Some other commonly used packages include:
The functionality of this packages overlaps to a large degree, and which one to use depends in large part on personal preference.
Plotly has two main ways of making plots:
-
plotly.express
provides a simplified interface for quickly building and customizing plots -
plotly.graph_objects
uses a more complex interface to provide more granular control over the contents of a plot
We will use plotly.express
in this lesson.
We’ll begin by reproducing the scatter plot from the end
of lesson 3, which included weight on the x axis and hindfoot length
on the y axis. Likely pandas, the developers of
plotly.express
have a preferred alias, px, that we will use
when we import plotly.express
:
OUTPUT
Choosing colors
One concern when making plots is to make sure they are legible to as broad an audience as possible. Clear, informative labels are one way to do this. Color is another.
Plotly makes a number of color palettes available via its color module. Because we are working with categorical data, we will use a qualitative palette, which consists of a list of discrete colors. Other palette types are also available. For example, plots showing a range of values might benefit from using a sequential or diverging color scheme, which use a continuous range of colors (for example, blue to red for a heat map).
Qualitative palettes are available under
px.colors.qualitative
. We can view the available plaettes
using the swatches function:
OUTPUT
In the spirit of effective communication with a wide audience, we
will use px.colors.qualitative.Safe
, a colorblind-safe
palette. Because we will be using the same palette for the rest of the
lesson, we will store the palette as a variable:
In addition to simplifying the code for future plots a bit, storing the palette as a variable also allows us to quickly change the color scheme for all our plots at once if needed.
Let’s apply the safe colors to our scatter plot:
PYTHON
px.scatter(
rodents,
x="weight",
y="hindfoot_length",
color="genus",
opacity=0.2,
color_discrete_sequence=colors,
)
OUTPUT
Sorting data
Take a look at the legend of the plot. The genera from the dataset are all listed, but they are in no apparent order. This makes it difficult for anyone looking at the plot to quickly pick out a given genus. We can alphabetize the legend to make it more readable. To do so, we can use the category_order keyword argument.
This argument requires a dict
. Recall that a
dict
is a mapping of keys to values defined using curly
braces. The dict
passed to category_orders maps a column
name from the dataframe to a list of values in the preferred order.
One approach to creating an alphabetical list would be to simply
build the list manually. That would work well enough here, where we have
only a handful of values, but quickly becomes unwiedly for larger
datasets. Instead, we will used the built-in sorted()
function to create a sorted list of values in the genus column. Because
we will be using the same order in all following plots, we will store
the dict with the sorted values in a variable that we can refer to
whenever we need to and can change if needed.
Challenge
Create the dict
needed for category_order. The
dict
should map the column name, genus
, to an
ordered list of unique values from that column.
In addition to the approach used in the challenge, we can create the
dict
we need with a single line of code:
That variable can now be passed to the category_orders keyword argument, producing a new version of the plot with an alphabetical legend.
PYTHON
px.scatter(
rodents,
x="weight",
y="hindfoot_length",
color="genus",
opacity=0.2,
color_discrete_sequence=colors,
category_orders=cat_order,
)
OUTPUT
Note that the colors in the plot have also changed. Colors are assigned based on the same category order used by the legend.
Adding plot and axis titles
By default, plotly uses the column names from the dataframe to label the axes of a plot. Here, the axis labels are adequate but would benefit from removing the underscore and including units. We can use the labels keyword argument to assign human-readable labels to our plto.
We will turn again to a dict
for this, which we will use
to map the underlying column names to preferred display values:
PYTHON
labels = {
"hindfoot_length": "Hindfoot length (mm)",
"genus": "Genus",
"weight": "Weight (g)"
}
We can then update the plot itself using the title and labels keyword arguments:
PYTHON
px.scatter(
rodents,
x="weight",
y="hindfoot_length",
color="genus",
opacity=0.2,
color_discrete_sequence=colors,
category_orders=cat_order,
title="Rodent size by genus",
labels=labels,
)
OUTPUT
Create a faceted plot
Even with the semitransparent points, there is a good deal of overlap among the data on this plot, particularly in the lower left part. It may be useful to plot each genus separately to see if anyting interesting is being obscured. We could do so using some of the techniques we’ve already covered. For example, we could filter the dataframe by genus and create a separate plots for each.
Plotly provides a simpler approach called faceting. A facet is a type of filter, and a faceted plot includes a separate subplot for each unique value. In plotly, we can create a faceted plot using the facet_col keyword argument. This argument produces a separate subplot for each unique value in the specified column. The subplots are arranged in a single row.
The plots we have created so far have not specified dimensions, but we will need to consider the size of the faceted plot. Because this plot includes a number of subplots, it may appear cramped unless it is quite wide. We can use the height and width keyword arguments to set the size of the plot. Each of these arguments requires an integer specifying a size in pixels.
Let’s update our plot to facet it by genus and make it 1400 pixels wide by 400 pixels tall:
PYTHON
px.scatter(
rodents,
x="weight",
y="hindfoot_length",
color="genus",
opacity=0.2,
color_discrete_sequence=colors,
category_orders=cat_order,
title="Rodent size by genus",
labels=labels,
facet_col="genus",
width=1400,
height=400,
)
OUTPUT
Row facets
The facet_col argument produces a single row of subplots. There is another argument, facet_row, that places each subplot in a separate row instead. Column and row facets can even be combined, for example, to produce a grid of subplots with genus as columns and sex as rows.
Making a box plot
We briefly discussed box plots (also known as box-and-whisker plots)
in lesson
4 as part of the discussion of summary statistics. Box plots are an
effective way to visualize the distribution of data. Plotly uses the
px.box()
method to generate them.
Some issues raised about the plot created during the earlier lesson—including the arbitrary order in which data was plotted, the repeated colors, and the inclusion of species with no data—have already been addressed above. We can integrate them into the box plot using the same approaches that we used for the scatter plots previously. Let’s create a box plot of hindfoot length by genus using the same color and ordering rules we defined earlier:
PYTHON
px.box(
rodents,
x="genus",
y="hindfoot_length",
color="genus",
color_discrete_sequence=colors,
category_orders=cat_order,
title="Rodent hindfoot length by genus",
labels=labels,
)
OUTPUT
We now have a box plot with colors corresponding to the scatter plots above, with an alphabetically ordered x axis and legend.
There are other aspects of the box plot that we may want to tweak. We’ll start with a concern about how the data is being represented. By default, ploty’s box plots show individual points only for outliers, that is, points that plot outside the upper and lower fences. This works well enough for normally distributed data but can obscure patterns for more complex distributions. And indeed, some genera, like Dipodomys, show a large number of outliers on the box pot and multiple clusters of data on the scatter plot. How might we update the box plot to better convey these distributions?
The px.box()
method includes a keyword argument, points,
that allows us to change how the underlying data is displayed. It
accepts three values:
- outliers only shows the outliers (default)
- all shows all points
- None shows no points
Let’s try updating the box plot to show all the underlying data:
PYTHON
px.box(
rodents,
x="genus",
y="hindfoot_length",
color="genus",
color_discrete_sequence=colors,
category_orders=cat_order,
title="Rodent hindfoot length by genus",
labels=labels,
points="all",
)
OUTPUT
A point cloud is now visible to the left of each box-and-whisker. We can see that plotly has spread the points out a bit along the x axis. This process, called jitter, is necessary because otherwise the points for each category would fall in a vertical line. We can also see that we’ve run into the same problem we did above with the scatter plot: The large number of overlapping points for each box makes it hard to see what is going on inside each point cloud.
We can again address this problem by changing the opacity of the
markers. It’s a little more complicated than it was for the scatter
plot, however, because the px.box()
method does not allow
us to set the marker opacity directly. Instead, we have to build the
plot, then update the existing markers using the
update_traces()
method. (Plotly refers to markers, lines,
and other elements as traces.)
PYTHON
fig = px.box(
rodents,
x="genus",
y="hindfoot_length",
color="genus",
color_discrete_sequence=colors,
category_orders=cat_order,
title="Rodent hindfoot length by genus",
labels=labels,
points="all",
)
fig.update_traces(marker={"opacity": 0.1})
OUTPUT
With the points now semitransparent, it is possible to see separate populations among some genera, like Dipodomys.
Another way to examine the distribution of data is a violin plot,
which visualizes the distribution of points as a line, similar to a Bell
curve. We can change our box plot to a violin plot by swapping
px.violin()
in for px.box()
:
PYTHON
fig = px.violin(
rodents,
x="genus",
y="hindfoot_length",
color="genus",
color_discrete_sequence=colors,
category_orders=cat_order,
title="Rodent hindfoot length by genus",
labels=labels,
points="all",
)
fig.update_traces(marker={"opacity": 0.1})
OUTPUT
This plot makes it easier to see identify complex distributions, like the bimodal distribution for Chaetodipus, that are visible but not necessarily obvious in the point clouds.
Challenge
Let’s return to a question posed all the way back in lesson 2: How has the weight of Dipodomys species changed over time? Make a plot that tries to answer this question.
There are several reasonable approaches to this question using the scatter plots and boxplots covered so far in this lesson. One possibility is a faceted plot showing the mean weight of each Dipodomys over the course of the study.
PYTHON
# Create genus_species column
rodents["genus_species"] = rodents["genus"] + " " + rodents["species"]
# Limit to Dipodomys
dipodomys = rodents[rodents["genus"] == "Dipodomys"].copy()
# Group by genus_species
grouped = dipodomys.groupby(["genus_species", "year"])["weight"].mean().reset_index()
# Create scatter plot
px.scatter(grouped, x="year", y="weight", facet_col="genus_species")
The plot generated by this code shows that the mean weights for these species oscillated over the course of the study. Overall, mean weight increased slightly for the two smaller species but decreased for the largest species (albeit with a large oscillation that makes determining a trend difficult.)
Key Points
- Plotly offers a wide variety of ways to build and style scatter plots
- Use scatter plots to visualize how parameters covary
- Use box and violin plots to visualize the distribution of a parameter