groupby

groupby lets you perform aggregations of your data. It is similar to the SQL GROUP BY operation. I'll use the iris dataset to demonstrate, here's a sample of that data.

sepal_length sepal_width petal_length petal_width species
139 6.9 3.1 5.4 2.1 virginica
137 6.4 3.1 5.5 1.8 virginica
128 6.4 2.8 5.6 2.1 virginica
19 5.1 3.8 1.5 0.3 setosa
143 6.8 3.2 5.9 2.3 virginica

Usually perform the grouping by passing in a list of columns that you want to group by. Here I'll just group by a single column, but you can easily add in multiple.


(
    df
    .groupby(['species'])
)

# returns <pandas.core.groupby.DataFrameGroupBy object at 0x1065c37f0>
        

There is a good section in the pandas documentation about the operations you can perform on a groupby object. I'm just going to extract the bits that I use a lot and have to look up.

Performing aggregations

Most frequently you want to aggregate up your group somehow. Here I'll take the mean of all the numeric columns in my groups.


(
    df
    .groupby(['species'])
    .agg(np.mean)
)
        
sepal_length sepal_width petal_length petal_width
species
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026

You can also specify particular aggregation functions to different columns in the group. You do this by passing in a dictionary to the agg function. Here I'll take the mean of the sepal_length, and take the max of the sepal_width.


(
    df
    .groupby(['species'])
    .agg({
        'sepal_length': np.mean,
        'sepal_width': np.max
    })
)
        
sepal_length sepal_width
species
setosa 5.006 4.4
versicolor 5.936 3.4
virginica 6.588 3.8

Applying function to each group

The transform function applies a function over the group and returns a transformed version of that group. The output has the same number of observations as go into the function. I've used it to calculate the z-score of all the observations in the data frame.


(
    df
    .groupby(['species'])
    .transform(
        # formula for the z score
        lambda x: (x - x.mean()) / x.std()
    )
)
        
sepal_length sepal_width petal_length petal_width
0 0.266674 0.189941 -0.357011 -0.436492
1 -0.300718 -1.129096 -0.357011 -0.436492
2 -0.868111 -0.601481 -0.932836 -0.436492
3 -1.151807 -0.865288 0.218813 -0.436492
4 -0.017022 0.453749 -0.357011 -0.436492

To perform more complex aggregations, you usually want to write a function that does that aggregation. It should return a dataframe, a series, or scalar. The apply function will figure out how to join them all back together into a dataframe.


def summarise(x):
    return pd.Series({
        'sepal_length_max': x.sepal_length.max(),
        'sepal_length_min': x.sepal_length.min()
    })

(
    df
    .groupby(['species'])
    .apply(summarise)
)

        
sepal_length_max sepal_length_min
species
setosa 5.8 4.3
versicolor 7.0 4.9
virginica 7.9 4.9

This is very flexible, but will be slower than the aggregate or transform functions. Note that this will apply the function twice to the first group. If you're applying a function with side effects, for example writing results out to a file, you might get unexpected results.

Grouping by without an index

When you do a group by, it creates an index on the output from the columns you've grouped by. This can be fiddly for subsequent operations that expect a plain dataframe. Either you can use the reset_index() function to remove the index, or you can pass the as_index=False argument to the groupby.


(
    df
    .groupby(['species'], as_index=False)
    .agg({
        'sepal_length': np.mean,
        'sepal_width': np.max
    })
)
        
species sepal_length sepal_width
0 setosa 5.006 4.4
1 versicolor 5.936 3.4
2 virginica 6.588 3.8

That looks pretty much the same as the grouping from above, but note that the grouping column (species) doesn't appear in the index column any more.