merge

Merge is a lot like the JOIN operation in SQL. You use it to join together dataframe that contain a common key. The common key is a column in each dataframe that you can use to look up between the dataframes, some sort of identifier.

I'm going to use three dataframes for this that contain information about movies. There are dataframes named movies, cast, and crew.

budget id original_title overview popularity release_date runtime status tagline title vote_average vote_count year
0 237000000 19995 Avatar In the 22nd century, a paraplegic Marine is di... 150.437577 2009-12-10 162.0 Released Enter the World of Pandora. Avatar 7.2 11800 2009.0
1 300000000 285 Pirates of the Caribbean: At World's End Captain Barbossa, long believed to be dead, ha... 139.082615 2007-05-19 169.0 Released At the end of the world, the adventure begins. Pirates of the Caribbean: At World's End 6.9 4500 2007.0
2 245000000 206647 Spectre A cryptic message from Bond’s past sends him o... 107.376788 2015-10-26 148.0 Released A Plan No One Escapes Spectre 6.3 4466 2015.0
3 250000000 49026 The Dark Knight Rises Following the death of District Attorney Harve... 112.312950 2012-07-16 165.0 Released The Legend Ends The Dark Knight Rises 7.6 9106 2012.0
4 260000000 49529 John Carter John Carter is a war-weary, former military ca... 43.926995 2012-03-07 132.0 Released Lost in our world, found in another. John Carter 6.1 2124 2012.0
Movies dataframe
cast_id character credit_id gender id movie_id name order
0 242 Jake Sully 5602a8a7c3a3685532001c9a 2 65731 19995 Sam Worthington 0
1 3 Neytiri 52fe48009251416c750ac9cb 1 8691 19995 Zoe Saldana 1
2 25 Dr. Grace Augustine 52fe48009251416c750aca39 1 10205 19995 Sigourney Weaver 2
3 4 Col. Quaritch 52fe48009251416c750ac9cf 2 32747 19995 Stephen Lang 3
4 5 Trudy Chacon 52fe48009251416c750ac9d3 1 17647 19995 Michelle Rodriguez 4
Cast dataframe
credit_id department gender id job movie_id name
0 52fe48009251416c750aca23 Editing 0 1721 Editor 19995 Stephen E. Rivkin
1 539c47ecc3a36810e3001f87 Art 2 496 Production Design 19995 Rick Carter
2 54491c89c3a3680fb4001cf7 Sound 0 900 Sound Designer 19995 Christopher Boyes
3 54491cb70e0a267480001bd0 Sound 0 900 Supervising Sound Editor 19995 Christopher Boyes
4 539c4a4cc3a36810c9002101 Production 1 1262 Casting 19995 Mali Finn
Crew dataframe

Note that there is an id column in the movies dataframe. This matches the movie_id column in the cast and crew dataframes.

Joins in pandas


(
    # first a subquery to get all the Christopher Nolan directed films
    crew
    .query('job == "Director"')
    .query('name == "Christopher Nolan"')
    # now join with the movies and get their titles
    .merge(
        movies,
        left_on='movie_id',
        right_on='id'
    )
    [['title', 'release_date']]
)
          
title release_date
0 The Dark Knight Rises 2012-07-16
1 The Dark Knight 2008-07-16
2 Interstellar 2014-11-05
3 Inception 2010-07-14
4 Batman Begins 2005-06-10
5 Insomnia 2002-05-24
6 The Prestige 2006-10-19
7 Memento 2000-10-11

left, right, inner, outer

By default pandas does an inner join. This means that only rows in both the dataframes are returned. The how parameter lets you do different sorts of joins. This (slightly contrived) example returns any years that either Susan Sarandon or Steven Spielberg made a film. This is an outer join, because we want rows that are present in either.


# get all the susan sarandon movies
sarandon = (
    cast
    .query('name == "Susan Sarandon"')
    .merge(movies,
           left_on='movie_id',
           right_on='id')
    [['name', 'title', 'year']]
)

# get all the steven spielberg movies
spielberg = (
    crew
    .query('job == "Director"')
    .query('name == "Steven Spielberg"')
    .merge(movies,
           left_on='movie_id',
           right_on='id')
    [['name', 'title', 'year']]
)

# perform the outer join
(
    sarandon
    .merge(
        spielberg,
        how='outer',
        on='year',
    )
    [['year', 'name_x', 'name_y']]
    .drop_duplicates()
    .sort_values('year')
)
            
year name_x name_y
41 1974.0 Susan Sarandon NaN
54 1975.0 NaN Steven Spielberg
52 1977.0 NaN Steven Spielberg
49 1979.0 NaN Steven Spielberg
51 1981.0 NaN Steven Spielberg
55 1982.0 NaN Steven Spielberg
56 1983.0 NaN Steven Spielberg
50 1984.0 NaN Steven Spielberg
53 1985.0 NaN Steven Spielberg
47 1989.0 NaN Steven Spielberg
44 1991.0 NaN Steven Spielberg
40 1992.0 Susan Sarandon NaN
45 1993.0 NaN Steven Spielberg
19 1994.0 Susan Sarandon NaN
37 1995.0 Susan Sarandon NaN
42 1997.0 NaN Steven Spielberg
16 1998.0 Susan Sarandon Steven Spielberg
25 1999.0 Susan Sarandon NaN
27 2000.0 Susan Sarandon NaN
11 2001.0 Susan Sarandon Steven Spielberg
28 2002.0 Susan Sarandon Steven Spielberg
21 2004.0 Susan Sarandon Steven Spielberg
12 2005.0 Susan Sarandon Steven Spielberg
8 2007.0 Susan Sarandon NaN
0 2008.0 Susan Sarandon Steven Spielberg
5 2009.0 Susan Sarandon NaN
10 2010.0 Susan Sarandon NaN
38 2011.0 Susan Sarandon Steven Spielberg
1 2012.0 Susan Sarandon Steven Spielberg
23 2013.0 Susan Sarandon NaN
34 2014.0 Susan Sarandon NaN
48 2015.0 NaN Steven Spielberg
18 2016.0 Susan Sarandon Steven Spielberg

Checking your merges

A nice feature introduced in pandas 0.21 is the ability to validate your merges, using the validate parameter. Sometimes you know that there should be only one matching row in the dataframe you're joining. For example, I thought there would only be one director for each film. Putting in the validation told me that I was wrong.


directors = (
    crew
    .query('job == "Director"')
)

(
    movies
    .merge(
        directors,
        left_on='id',
        right_on='movie_id',
        validate='1:1'
    )
)

# This fails with a MergeError
          

That code fails. Looking into the data a bit more, I see that my assumption was incorrect. Some films have multiple directors, and Paris, je t'aime has 21! I might have to put logic further along in my analysis to catch these cases.


(
    (
        movies
        .merge(
            directors,
            left_on='id',
            right_on='movie_id',
        )
    )
    .groupby('title')
    .agg({'name': 'count'})
    .sort_values('name')
)
          
name
title
Bambi 7
Fantasia 2000 8
Fantasia 12
Movie 43 12
Paris, je t'aime 21

This is very useful if you are working with messy data, or if you want to apply your code to new data in the future. It helps you validate that your assumptions about the structure of the data are correct. It makes your code more resilient.