Aggregating and Grouping Data
Last updated on 2022-07-18 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- How do we calculate summary statistics?
- How do we group data?
- How do null values affect calculations?
Objectives
- Introduce aggregation calculations in pandas
- Introduce grouping in pandas
- Learn about how pandas handles null values
As always, we’ll begin by importing pandas and reading our CSV:
PYTHON
import pandas as pd
= pd.read_csv("data/surveys.csv") surveys
Aggregating data
Aggregation allows us to describe the data in our dataframe by calculating totals (like the number of records) and statistics (like the mean value of a column). pandas allows us to run these calculations on dataframes or subsets.
Suppose we need to know how many records are in our dataset. We’ve already seen that we can use the info()
method to get high-level about the dataset, including the number of entries. What if just wanted the number of rows? One approach is to use the built-in function len()
, which is used to calculate the number of items in an object (for example, the number of characters in a string or the number of items in a list). When used on a dataframe, len()
returns the number of rows:
PYTHON
len(surveys)
OUTPUT
35549
pandas provides a suite of aggregation methods that go beyond this simple case. For example, we can count the number of non-NaN values in each column using count()
:
PYTHON
surveys.count()
OUTPUT
record_id 35549
month 35549
day 35549
year 35549
plot_id 35549
species_id 34786
sex 33038
hindfoot_length 31438
weight 32283
dtype: int64
Or we can find out the total weight of all individuals in grams using sum()
:
PYTHON
"weight"].sum() surveys[
OUTPUT
1377594.0
Other aggregation methods supported by pandas include min()
, max()
, and mean()
. These methods all ignore NaNs, so missing data does not affect these calculations.
Challenge
Calculate the total weight, average weight, minimum and maximum weights for all animals caught over the duration of the survey. Can you modify your code so that it outputs these values only for weights between 5 and 10 grams?
PYTHON
# Create a subset of only the animals between 5 and 10 grams
= surveys[(surveys["weight"] > 5) & (surveys["weight"] < 10)]["weight"]
weights
# Display aggregation calculations using a dict
{"sum": weights.sum(),
"mean": weights.mean(),
"min": weights.min(),
"max": weights.max(),
}
OUTPUT
{'sum': 16994.0, 'mean': 7.91523055426176, 'min': 6.0, 'max': 9.0}
To quickly generate summary statistics, we can use the describe()
method instead of calling each aggregation method separately. When we use this method on a dataframe, it calculates stats for all columns with numeric data:
PYTHON
surveys.describe()
record_id | month | day | year | plot_id | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|
count | 35549.000000 | 35549.000000 | 35549.000000 | 35549.000000 | 35549.000000 | 31438.000000 | 32283.000000 |
mean | 17775.000000 | 6.477847 | 15.991195 | 1990.475231 | 11.397001 | 29.287932 | 42.672428 |
std | 10262.256696 | 3.396925 | 8.257366 | 7.493355 | 6.799406 | 9.564759 | 36.631259 |
min | 1.000000 | 1.000000 | 1.000000 | 1977.000000 | 1.000000 | 2.000000 | 4.000000 |
25% | 8888.000000 | 4.000000 | 9.000000 | 1984.000000 | 5.000000 | 21.000000 | 20.000000 |
50% | 17775.000000 | 6.000000 | 16.000000 | 1990.000000 | 11.000000 | 32.000000 | 37.000000 |
75% | 26662.000000 | 10.000000 | 23.000000 | 1997.000000 | 17.000000 | 36.000000 | 48.000000 |
max | 35549.000000 | 12.000000 | 31.000000 | 2002.000000 | 24.000000 | 70.000000 | 280.000000 |
You can see that describe()
isn’t picky: It includes both ID and date columns in its results. Notice also that counts differ in between columns. This is because count()
only counts non-NaN rows.
If desired, we can also describe a single column at a time:
PYTHON
"weight"].describe() surveys[
OUTPUT
count 32283.000000
mean 42.672428
std 36.631259
min 4.000000
25% 20.000000
50% 37.000000
75% 48.000000
max 280.000000
Name: weight, dtype: float64
If we need more control over the output (for example, if we want to calculate the total weight of all animals, as in the challenge above), pandas provides the agg()
method, which allows us to specify methods by column. The argument passed to this method is a dict
. Each key must be a column name and each value a list of the names of aggregation methods. To calculate the total weight, mean weight, and mean hindfoot length of all records in the survey, we can use:
PYTHON
"weight": ["sum", "mean"], "hindfoot_length": ["mean"]}) surveys.agg({
weight | hindfoot_length | |
---|---|---|
sum | 1.377594e+06 | NaN |
mean | 4.267243e+01 | 29.287932 |
Grouping data
Now, let’s find out how many individuals were counted for each species. We do this using groupby()
, which creates an object similar to a dataframe where rows are grouped by the data in one or more columns. To group by species_id, use:
PYTHON
= surveys.groupby("species_id") grouped
When we aggregate grouped data, pandas makes separate calculations for each member of the group. In the example below, we’ll calculate the number of times each species appears in the dataset. Rather than outputting the full dataframe, we’ll limit the count to a single column. Because count ignores NaN cells, it’s good practice to use a column that does not contain nulls. Record ID fields are a good choice because they makes it clear that we are counting rows. The fields used to group the data also work.
PYTHON
"species_id"].count() grouped[
OUTPUT
species_id
AB 303
AH 437
AS 2
BA 46
CB 50
CM 13
CQ 16
CS 1
CT 1
CU 1
CV 1
DM 10596
DO 3027
DS 2504
DX 40
NL 1252
OL 1006
OT 2249
OX 12
PB 2891
PC 39
PE 1299
PF 1597
PG 8
PH 32
PI 9
PL 36
PM 899
PP 3123
PU 5
PX 6
RF 75
RM 2609
RO 8
RX 2
SA 75
SC 1
SF 43
SH 147
SO 43
SS 248
ST 1
SU 5
UL 4
UP 8
UR 10
US 4
ZL 2
Name: species_id, dtype: int64
To group by multiple columns, we can pass a list to groupby()
:
PYTHON
"species_id", "year"])["record_id"].count() surveys.groupby([
OUTPUT
species_id year
AB 1980 5
1981 7
1982 34
1983 41
1984 12
..
UR 1994 1
2001 2
2002 1
US 2002 4
ZL 1988 2
Name: record_id, Length: 509, dtype: int64
Challenge
Write statements that return:
- How many individuals were counted in each year in total
- How many were counted each year, for each different species
- The average weights of each species in each year
- How many individuals were counted for each species that was observed more than 10 times
Can you get the answer to both 2 and 3 in a single query?
How many individuals were counted in each year in total?
PYTHON
# Individual counts per year
"year")["record_id"].count() surveys.groupby(
OUTPUT
year
1977 503
1978 1048
1979 719
1980 1415
1981 1472
1982 1978
1983 1673
1984 981
1985 1438
1986 942
1987 1671
1988 1469
1989 1569
1990 1311
1991 1347
1992 1038
1993 750
1994 668
1995 1222
1996 1706
1997 2493
1998 1610
1999 1135
2000 1552
2001 1610
2002 2229
Name: record_id, dtype: int64
How many individuals were counted each year, for each different species?
PYTHON
# Individual counts by species and year
= surveys.groupby(["year", "species_id"])["record_id"].count()
result
# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
print(result)
OUTPUT
year species_id
1977 DM 264
DO 12
DS 98
NL 31
OL 10
OT 17
OX 7
PE 6
PF 31
PP 7
RM 2
SH 1
SS 1
1978 AH 1
DM 389
DO 22
DS 320
NL 48
OL 47
OT 66
OX 1
PE 17
PF 40
PM 3
PP 24
RM 3
SA 2
SH 1
SS 8
1979 DM 209
DO 31
DS 204
NL 30
OL 46
OT 68
PE 17
PF 19
PM 6
PP 20
RM 6
SS 2
1980 AB 5
CB 1
CM 12
CQ 2
DM 493
DO 89
DS 346
DX 1
NL 57
OL 85
OT 96
PE 20
PF 95
PM 1
PP 17
RM 46
SS 9
1981 AB 7
CB 1
CQ 4
DM 559
DO 123
DS 354
NL 63
OL 55
OT 65
PE 43
PF 59
PM 16
PP 20
RM 29
SA 11
SS 8
1982 AB 34
AH 1
CQ 8
DM 609
DO 117
DS 354
NL 111
OL 83
OT 94
OX 1
PC 2
PE 107
PF 152
PM 72
PP 32
RF 2
RM 154
SA 17
SH 2
SS 12
1983 AB 41
AH 2
CB 3
DM 528
DO 167
DS 280
DX 1
NL 98
OL 65
OT 86
PC 3
PE 38
PF 114
PG 1
PM 42
PP 32
RM 127
SA 10
SH 2
SS 12
1984 AB 12
AH 2
CB 1
CQ 2
DM 396
DO 108
DS 76
NL 64
OL 71
OT 49
OX 1
PE 20
PF 7
PH 7
PM 12
PP 26
PU 2
RM 60
SA 20
SS 15
1985 AB 14
AH 14
CB 1
DM 667
DO 225
DS 98
NL 45
OL 94
OT 62
PE 51
PF 1
PH 3
PM 17
PP 35
RM 72
SA 3
SS 14
1986 AB 5
AH 19
CB 4
DM 406
DO 115
DS 88
DX 1
NL 60
OL 43
OT 25
PE 26
PF 2
PH 1
PM 34
PP 26
RM 53
SA 1
SH 1
SS 12
1987 AB 35
AH 41
CB 5
CM 1
CV 1
DM 469
DO 213
DS 104
NL 128
OL 79
OT 43
PC 5
PE 142
PF 4
PM 133
PP 45
RM 189
SA 1
SH 5
SS 20
UP 2
UR 3
1988 AB 39
AH 26
CB 6
DM 365
DO 144
DS 54
DX 1
NL 102
OL 51
OT 86
PE 190
PF 2
PG 2
PH 1
PM 54
PP 53
RF 11
RM 211
SA 1
SH 31
SS 20
UL 1
UP 3
UR 1
ZL 2
1989 AB 31
AH 30
BA 3
CB 1
CS 1
DM 321
DO 119
DS 33
NL 67
OL 57
OT 102
OX 1
PC 1
PE 177
PF 13
PH 2
PM 27
PP 45
RF 49
RM 398
SA 3
SC 1
SF 5
SH 55
SS 18
UP 2
1990 AB 27
AH 22
BA 11
DM 462
DO 171
DS 17
NL 29
OL 25
OT 77
PC 8
PE 75
PF 51
PG 4
PH 7
PP 34
RF 12
RM 231
SF 18
SH 9
SS 5
UP 1
1991 AB 15
AH 21
AS 1
BA 26
CB 2
DM 404
DO 122
DS 11
DX 4
NL 30
OL 36
OT 107
PC 6
PE 65
PF 62
PP 88
RM 307
RO 1
SA 2
SF 7
SH 1
SO 11
SS 2
UR 1
1992 AB 10
AH 16
BA 6
CB 7
DM 307
DO 94
DS 18
DX 12
NL 15
OL 45
OT 42
PC 9
PE 41
PF 57
PH 7
PI 1
PP 131
RM 158
SF 3
SO 20
SS 7
UL 1
UR 1
1993 AB 9
AH 15
AS 1
CB 5
CU 1
DM 253
DO 29
DS 18
DX 3
NL 24
OL 35
OT 22
OX 1
PC 1
PE 13
PF 70
PP 100
RM 103
SF 3
SO 6
SS 1
ST 1
1994 AB 7
AH 19
CB 1
DM 293
DO 25
DS 9
DX 2
NL 10
OL 21
OT 21
PC 1
PE 3
PF 73
PP 74
PU 2
RM 40
SO 3
SS 7
SU 1
UL 2
UR 1
1995 AB 4
AH 36
CB 2
DM 436
DO 58
DX 3
NL 8
OL 29
OT 38
PB 7
PC 1
PE 14
PF 158
PH 3
PI 1
PL 3
PM 8
PP 277
PU 1
RM 81
RX 1
SS 7
SU 4
1996 AH 25
DM 492
DO 174
DX 1
NL 7
OL 13
OT 108
PB 39
PC 1
PE 36
PF 330
PL 6
PM 50
PP 298
PX 2
RM 90
SA 1
SS 16
1997 AH 37
CB 3
DM 576
DO 253
DS 6
NL 48
OL 8
OT 258
PB 259
PE 57
PF 186
PH 1
PL 19
PM 271
PP 325
PX 2
RF 1
RM 158
RX 1
SA 1
SH 7
SO 3
SS 12
1998 AB 2
AH 33
CB 4
CT 1
DM 503
DO 111
DS 9
DX 2
NL 32
OT 164
PB 329
PC 1
PE 24
PF 26
PL 7
PM 103
PP 208
PX 1
RM 13
SA 2
SS 15
1999 AH 14
DM 348
DO 84
DS 7
DX 3
NL 20
OT 105
PB 272
PE 7
PM 44
PP 167
RM 28
SH 2
SS 7
2000 AB 3
AH 12
DM 233
DO 91
DX 4
NL 29
OT 154
PB 555
PE 15
PG 1
PL 1
PM 2
PP 381
PX 1
RM 15
SH 8
SS 4
2001 AB 1
AH 23
CB 3
DM 305
DO 81
DX 1
NL 48
OT 167
PB 538
PE 35
PF 27
PI 2
PM 3
PP 273
RM 15
SH 11
SS 5
UR 2
2002 AB 2
AH 28
DM 309
DO 249
DX 1
NL 48
OL 8
OT 127
PB 892
PE 60
PF 18
PI 5
PM 1
PP 385
RM 20
RO 7
SF 7
SH 11
SS 9
UR 1
US 4
Name: record_id, dtype: int64
What was the average weight of each species in each year?
PYTHON
# Mean weight by species and year
= surveys.groupby(["year", "species_id"])["weight"].mean()
result
# Use the option_context context manager to show all rows
with pd.option_context("display.max_rows", None):
print(result)
OUTPUT
year species_id
1977 DM 41.141304
DO 42.666667
DS 121.437500
NL NaN
OL 21.666667
OT 21.000000
OX 21.333333
PE 19.500000
PF 7.173913
PP 15.333333
RM 10.000000
SH NaN
SS NaN
1978 AH NaN
DM 40.795455
DO 45.000000
DS 115.198653
NL 184.656250
OL 31.027027
OT 23.220000
OX NaN
PE 20.428571
PF 7.088235
PM 20.666667
PP 14.869565
RM 7.500000
SA NaN
SH 89.000000
SS 130.000000
1979 DM 43.507317
DO 45.862069
DS 111.796954
NL 138.000000
OL 33.717391
OT 23.075758
PE 20.529412
PF 7.529412
PM 23.666667
PP 15.105263
RM 8.333333
SS NaN
1980 AB NaN
CB NaN
CM NaN
CQ NaN
DM 44.136082
DO 48.058824
DS 120.725664
DX NaN
NL 158.982143
OL 33.107143
OT 24.083333
PE 22.450000
PF 7.455556
PM 21.000000
PP 14.176471
RM 10.227273
SS NaN
1981 AB NaN
CB NaN
CQ NaN
DM 44.032907
DO 49.111111
DS 125.818444
NL 165.903226
OL 33.296296
OT 24.415385
PE 20.558140
PF 7.152542
PM 19.933333
PP 13.950000
RM 11.178571
SA NaN
SS 57.000000
1982 AB NaN
AH NaN
CQ NaN
DM 41.975042
DO 47.919643
DS 115.647929
NL 160.613208
OL 35.590361
OT 24.925532
OX 24.000000
PC NaN
PE 21.173077
PF 6.918919
PM 21.391304
PP 16.125000
RF 11.500000
RM 10.436242
SA NaN
SH 79.000000
SS NaN
1983 AB NaN
AH NaN
CB NaN
DM 40.944551
DO 47.150602
DS 122.033088
DX NaN
NL 156.691489
OL 34.584615
OT 24.697674
PC NaN
PE 20.864865
PF 6.833333
PG NaN
PM 22.023810
PP 15.468750
RM 9.872000
SA NaN
SH NaN
SS NaN
1984 AB NaN
AH NaN
CB NaN
CQ NaN
DM 40.765306
DO 48.415094
DS 124.082192
NL 150.095238
OL 32.550725
OT 22.416667
OX 18.000000
PE 20.210526
PF 7.428571
PH 28.000000
PM 19.545455
PP 15.307692
PU NaN
RM 11.152542
SA NaN
SS NaN
1985 AB NaN
AH NaN
CB NaN
DM 41.507645
DO 47.956731
DS 124.326316
NL 148.829268
OL 32.108696
OT 22.064516
PE 20.360000
PF NaN
PH 32.666667
PM 19.000000
PP 15.764706
RM 8.371429
SA NaN
SS NaN
1986 AB NaN
AH NaN
CB NaN
DM 43.361596
DO 49.372727
DS 128.071429
DX NaN
NL 159.724138
OL 30.880952
OT 21.500000
PE 22.520000
PF 8.000000
PH 39.000000
PM 21.696970
PP 16.750000
RM 10.750000
SA NaN
SH 55.000000
SS NaN
1987 AB NaN
AH NaN
CB NaN
CM NaN
CV NaN
DM 43.232609
DO 50.087379
DS 126.383838
NL 158.840000
OL 30.779221
OT 23.069767
PC NaN
PE 21.625899
PF 7.500000
PM 22.297710
PP 17.840909
RM 10.657609
SA NaN
SH 73.800000
SS NaN
UP NaN
UR NaN
1988 AB NaN
AH NaN
CB NaN
DM 43.397790
DO 51.463768
DS 129.490566
DX NaN
NL 163.104167
OL 30.306122
OT 24.070588
PE 22.625668
PF 8.000000
PG NaN
PH 33.000000
PM 21.759259
PP 18.280000
RF 13.818182
RM 10.322115
SA NaN
SH 72.806452
SS NaN
UL NaN
UP NaN
UR NaN
ZL NaN
1989 AB NaN
AH NaN
BA 7.000000
CB NaN
CS NaN
DM 44.349206
DO 51.025641
DS 121.896552
NL 151.278689
OL 31.947368
OT 24.663366
OX 20.000000
PC NaN
PE 21.935673
PF 7.230769
PH 30.500000
PM 20.222222
PP 17.409091
RF 13.346939
RM 10.411311
SA NaN
SC NaN
SF 54.800000
SH 76.345455
SS NaN
UP NaN
1990 AB NaN
AH NaN
BA 8.000000
DM 41.769912
DO 48.512048
DS 121.187500
NL 154.275862
OL 31.200000
OT 23.675325
PC NaN
PE 21.671233
PF 7.117647
PG NaN
PH 31.142857
PP 16.029412
RF 12.916667
RM 10.305677
SF 52.611111
SH 76.888889
SS NaN
UP NaN
1991 AB NaN
AH NaN
AS NaN
BA 9.240000
CB NaN
DM 43.148338
DO 49.695652
DS 113.000000
DX NaN
NL 148.785714
OL 28.171429
OT 24.588235
PC NaN
PE 21.435484
PF 7.827586
PP 17.904762
RM 10.498305
RO 11.000000
SA NaN
SF 93.166667
SH 63.000000
SO 53.909091
SS NaN
UR NaN
1992 AB NaN
AH NaN
BA 7.833333
CB NaN
DM 43.877966
DO 48.367816
DS 112.352941
DX NaN
NL 139.000000
OL 28.454545
OT 24.928571
PC NaN
PE 22.750000
PF 8.767857
PH 31.142857
PI 18.000000
PP 17.479675
RM 10.904762
SF 43.000000
SO 55.263158
SS NaN
UL NaN
UR NaN
1993 AB NaN
AH NaN
AS NaN
CB NaN
CU NaN
DM 43.373984
DO 48.461538
DS 105.722222
DX NaN
NL 127.391304
OL 27.545455
OT 22.363636
OX NaN
PC NaN
PE 19.230769
PF 8.238806
PP 17.954023
RM 10.747475
SF 44.000000
SO 55.600000
SS NaN
ST NaN
1994 AB NaN
AH NaN
CB NaN
DM 42.373288
DO 47.750000
DS 106.625000
DX NaN
NL 186.875000
OL 21.900000
OT 25.285714
PC NaN
PE 18.000000
PF 7.855072
PP 17.585714
PU NaN
RM 10.675000
SO 62.333333
SS NaN
SU NaN
UL NaN
UR NaN
1995 AB NaN
AH NaN
CB NaN
DM 44.806147
DO 49.592593
DX NaN
NL 155.833333
OL 27.296296
OT 24.868421
PB 34.000000
PC NaN
PE 21.230769
PF 8.780645
PH 35.333333
PI NaN
PL 22.333333
PM 27.375000
PP 16.844444
PU NaN
RM 12.653846
RX 20.000000
SS NaN
SU NaN
1996 AH NaN
DM 44.506173
DO 47.234940
DX NaN
NL 148.000000
OL 27.769231
OT 23.453704
PB 32.578947
PC NaN
PE 22.027778
PF 8.393846
PL 21.166667
PM 20.224490
PP 18.095563
PX NaN
RM 11.455556
SA NaN
SS NaN
1997 AH NaN
CB NaN
DM 44.708551
DO 48.177866
DS 111.000000
NL 150.688889
OL 33.625000
OT 24.785156
PB 31.085603
PE 20.929825
PF 8.448649
PH 22.000000
PL 18.789474
PM 21.126394
PP 18.175000
PX 19.000000
RF 20.000000
RM 11.230769
RX 11.000000
SA NaN
SH 50.857143
SO 54.666667
SS NaN
1998 AB NaN
AH NaN
CB NaN
CT NaN
DM 43.131403
DO 49.731183
DS 116.000000
DX NaN
NL 159.466667
OT 24.675676
PB 30.082237
PC NaN
PE 20.304348
PF 8.720000
PL 16.714286
PM 20.591398
PP 16.266990
PX NaN
RM 13.100000
SA NaN
SS NaN
1999 AH NaN
DM 43.945402
DO 48.012048
DS 120.714286
DX NaN
NL 182.200000
OT 25.723810
PB 31.710037
PE 25.428571
PM 22.523810
PP 16.521212
RM 10.555556
SH 54.000000
SS NaN
2000 AB NaN
AH NaN
DM 43.126638
DO 49.224719
DX NaN
NL 179.307692
OT 25.303448
PB 30.878899
PE 21.615385
PG NaN
PL 21.000000
PM 20.500000
PP 16.788618
PX NaN
RM 11.400000
SH 73.375000
SS NaN
2001 AB NaN
AH NaN
CB NaN
DM 45.442177
DO 52.233766
DX NaN
NL 167.851064
OT 23.707792
PB 32.798851
PE 20.400000
PF 8.538462
PI 18.000000
PM 26.666667
PP 17.752896
RM 11.333333
SH 79.900000
SS NaN
UR NaN
2002 AB NaN
AH NaN
DM 46.168317
DO 49.514403
DX NaN
NL 182.159091
OL 25.428571
OT 23.833333
PB 32.359447
PE 21.719298
PF 7.388889
PI 20.000000
PM 19.000000
PP 17.018617
RM 10.000000
RO 10.142857
SF 62.166667
SH 64.666667
SS NaN
UR NaN
US NaN
Name: weight, dtype: float64
How many individuals were counted for each species that was observed more than 10 times?
PYTHON
# Counts by species that appear more than 10 times
= surveys.groupby("species_id")["record_id"].count()
species > 10] species[species
OUTPUT
species_id
AB 303
AH 437
BA 46
CB 50
CM 13
CQ 16
DM 10596
DO 3027
DS 2504
DX 40
NL 1252
OL 1006
OT 2249
OX 12
PB 2891
PC 39
PE 1299
PF 1597
PH 32
PL 36
PM 899
PP 3123
RF 75
RM 2609
SA 75
SF 43
SH 147
SO 43
SS 248
Name: record_id, dtype: int64
Calculate the number of individuals observed and the average weight for each species in each year.
Note that weight contains NaNs, so counts for record_id and weight differ.
PYTHON
# Counts and mean weight by species and year
"year", "species_id"]).agg(
surveys.groupby(["record_id": "count", "weight": ["count", "mean"]}
{ )
record_id | weight | |||
---|---|---|---|---|
count | count | mean | ||
year | species_id | |||
1977 | DM | 264 | 184 | 41.141304 |
DO | 12 | 12 | 42.666667 | |
DS | 98 | 32 | 121.437500 | |
NL | 31 | 0 | NaN | |
OL | 10 | 3 | 21.666667 | |
… | … | … | … | … |
2002 | SF | 7 | 6 | 62.166667 |
SH | 11 | 9 | 64.666667 | |
SS | 9 | 0 | NaN | |
UR | 1 | 0 | NaN | |
US | 4 | 0 | NaN |
509 rows × 3 columns
Handling missing data
As we’ve discussed, some columns in the surveys dataframe have the value NaN instead of text or numbers. NaN, short for “not a number,” is a special type of float used by pandas to represent missing data. When reading from a CSV, as we have done throughout this lesson, pandas interprets certains values as NaN (see na_values in the pd.read_csv() documentation for the default list). NaNs are excluded from groups and most aggregation calculations in pandas, including counts.
It is crucial to understand how missing data is represented in a dataset. Failing to do so may introduce errors into our analysis. The ecology dataset used in this lesson uses empty cells to represent missing data, but other disciplines have different conventions. For example, some geographic datasets use -9999 to represent null values. Failure to convert such values to NaN will result in significant errors on any calculations performed on that dataset.
In some cases, it can be useful to fill in cells containing NaN with a non-null value. For example, the groupby()
method excludes NaN cells. When looking at sex, the following code counts only those cells with either F (female) or M (male):
PYTHON
"sex")["record_id"].count() surveys.groupby(
OUTPUT
sex
F 15690
M 17348
Name: record_id, dtype: int64
But not all records specify a sex. To include records where sex was not specified, we can use the fillna()
method on the sex column. This method replaces each NaN cell with the value passed as the first argument to the function call. To replace all NaN values in sex with “U”, use:
PYTHON
"sex"] = surveys["sex"].fillna("U") surveys[
The grouped calculation now accounts for all records in the dataframe:
PYTHON
"sex")["record_id"].count() surveys.groupby(
OUTPUT
sex
F 15690
M 17348
U 2511
Name: record_id, dtype: int64
In other cases, we may want to ignore rows that contain NaNs. This can be done using dropna()
:
PYTHON
surveys.dropna()
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
62 | 63 | 8 | 19 | 1977 | 3 | DM | M | 35.0 | 40.0 |
63 | 64 | 8 | 19 | 1977 | 7 | DM | M | 37.0 | 48.0 |
64 | 65 | 8 | 19 | 1977 | 4 | DM | F | 34.0 | 29.0 |
65 | 66 | 8 | 19 | 1977 | 4 | DM | F | 35.0 | 46.0 |
66 | 67 | 8 | 19 | 1977 | 7 | DM | M | 35.0 | 36.0 |
… | … | … | … | … | … | … | … | … | … |
35540 | 35541 | 12 | 31 | 2002 | 15 | PB | F | 24.0 | 31.0 |
35541 | 35542 | 12 | 31 | 2002 | 15 | PB | F | 26.0 | 29.0 |
35542 | 35543 | 12 | 31 | 2002 | 15 | PB | F | 27.0 | 34.0 |
35546 | 35547 | 12 | 31 | 2002 | 10 | RM | F | 15.0 | 14.0 |
35547 | 35548 | 12 | 31 | 2002 | 7 | DO | M | 36.0 | 51.0 |
30738 rows × 9 columns
This method returns a copy of the dataframe containing only those rows that have valid data in every field.
Key Points
- Calculate individual summary statistics using dataframe methods like
mean()
,max()
, andmin()
- Calculate multiple summary statistics at once using the dataframe methods
describe()
andagg()
- Group data by one or more columns using the
groupby()
method - pandas uses NaN to represent missing data in a dataframe
- Failing to consider how missing data is interpreted in a dataset can introduce errors into calculations