pipe

All the other data verbs we've been using are functions that are part of the dataframe class. That set of verbs is very comprehensive, and covers most use cases. The pipe operation is a way for us to be more flexible, and perform any operation as a data verb. It applies the function to the dataframe, returning whatever the function returns.

I'll be using the movies dataframe. Here's a sample of that data.

budget id original_title overview popularity release_date runtime status tagline title vote_average vote_count year
0 237000000 19995 Avatar In the 22nd century, a paraplegic Marine is di... 150.437577 2009-12-10 162.0 Released Enter the World of Pandora. Avatar 7.2 11800 2009.0
1 300000000 285 Pirates of the Caribbean: At World's End Captain Barbossa, long believed to be dead, ha... 139.082615 2007-05-19 169.0 Released At the end of the world, the adventure begins. Pirates of the Caribbean: At World's End 6.9 4500 2007.0
2 245000000 206647 Spectre A cryptic message from Bond’s past sends him o... 107.376788 2015-10-26 148.0 Released A Plan No One Escapes Spectre 6.3 4466 2015.0
3 250000000 49026 The Dark Knight Rises Following the death of District Attorney Harve... 112.312950 2012-07-16 165.0 Released The Legend Ends The Dark Knight Rises 7.6 9106 2012.0
4 260000000 49529 John Carter John Carter is a war-weary, former military ca... 43.926995 2012-03-07 132.0 Released Lost in our world, found in another. John Carter 6.1 2124 2012.0

Arbitrary operations, making your own verbs

You can make your own functions that work like verbs. These take a dataframe and return a copy of that dataframe with some changes applied. Here I'm making two functions, one that converts a string formatted date to a pandas date, and a second one that extracts the year from that date.


def convert_date(x):
    return (
        x
        .assign(date = pd.to_datetime(x.release_date))
    )

def extract_year(x):
    return (
        x
        .assign(yr = x.date.dt.year)
    )
          

Instead of having to apply these functions like this:


extract_year(convert_date(movies))
        

I can use them in a chain like this:


(
    movies
    .pipe(convert_date)
    .pipe(extract_year)
)
        

This gives the same result, but is nicer pandas code.

budget id original_title overview popularity release_date runtime status tagline title vote_average vote_count date yr
0 237000000 19995 Avatar In the 22nd century, a paraplegic Marine is di... 150.437577 2009-12-10 162.0 Released Enter the World of Pandora. Avatar 7.2 11800 2009-12-10 2009.0
1 300000000 285 Pirates of the Caribbean: At World's End Captain Barbossa, long believed to be dead, ha... 139.082615 2007-05-19 169.0 Released At the end of the world, the adventure begins. Pirates of the Caribbean: At World's End 6.9 4500 2007-05-19 2007.0
2 245000000 206647 Spectre A cryptic message from Bond’s past sends him o... 107.376788 2015-10-26 148.0 Released A Plan No One Escapes Spectre 6.3 4466 2015-10-26 2015.0
3 250000000 49026 The Dark Knight Rises Following the death of District Attorney Harve... 112.312950 2012-07-16 165.0 Released The Legend Ends The Dark Knight Rises 7.6 9106 2012-07-16 2012.0
4 260000000 49529 John Carter John Carter is a war-weary, former military ca... 43.926995 2012-03-07 132.0 Released Lost in our world, found in another. John Carter 6.1 2124 2012-03-07 2012.0

Working with a mutated dataframe

This is a specific use case of making your own verb, but is a very useful case. If you are making a chain that mutates the dataframe somehow, let's say that we're making a new column. If we want to use that new column in subsequent steps pandas will say that it can't find it. Here I'm making a new date column, which I want to take the month from.


(
    movies
    .assign(date = pd.to_datetime(movies.release_date))
    .assign(month = movies.date.dt.month)
)
# Fails with AttributeError: 'DataFrame' object has no attribute 'date'
          

If I'd been mutating in-place in the dataframe (assigning the new values to the release_date column), then the second row would try and work with the original column and not the mutated one. That can cause some tricky bugs.

Instead we can use the pipe function to get access to the mutated columns in subsequent stages of the chain. I'm using lambda functions here because I don't intend to use them again.


(
    movies
    .pipe(lambda x: x.assign(date = pd.to_datetime(x.release_date)))
    .pipe(lambda x: x.assign(month = x.date.dt.month))
)
          

This gives the intended result.

budget id original_title overview popularity release_date runtime status tagline title vote_average vote_count date month
0 237000000 19995 Avatar In the 22nd century, a paraplegic Marine is di... 150.437577 2009-12-10 162.0 Released Enter the World of Pandora. Avatar 7.2 11800 2009-12-10 12.0
1 300000000 285 Pirates of the Caribbean: At World's End Captain Barbossa, long believed to be dead, ha... 139.082615 2007-05-19 169.0 Released At the end of the world, the adventure begins. Pirates of the Caribbean: At World's End 6.9 4500 2007-05-19 5.0
2 245000000 206647 Spectre A cryptic message from Bond’s past sends him o... 107.376788 2015-10-26 148.0 Released A Plan No One Escapes Spectre 6.3 4466 2015-10-26 10.0
3 250000000 49026 The Dark Knight Rises Following the death of District Attorney Harve... 112.312950 2012-07-16 165.0 Released The Legend Ends The Dark Knight Rises 7.6 9106 2012-07-16 7.0
4 260000000 49529 John Carter John Carter is a war-weary, former military ca... 43.926995 2012-03-07 132.0 Released Lost in our world, found in another. John Carter 6.1 2124 2012-03-07 3.0