The two best sources I've found for learning about pandas are the Python for data analysis book, and the Modern Pandas blog posts from Tom Augspurger. I like to work throught the Modern Pandas blog posts every few months to refresh myself.

Writing clean pandas

Writing pandas code this way makes it easier to understand complex data processing pipelines.

Method chaining

Method chaining technique allows you to apply multiple functions to your data frame. If I have three functions (preprocess, aggregate, postprocess) that I want to apply to my dataframe in order. The idea is that instead of writing code like this:


result = postprocess(aggregate(preprocess(df)))
              

You start writing code like this:


results = df.preprocess().aggregate().postprocess()
            

This works if each of the functions are part of the df object, and each of the functions returns an adjusted df object. This technique is encouraged by pandas. Many of the functions that are part of the Pandas dataframe object work in this way: they return an adjusted copy of the original dataframe. I'll talk about using these dataframe functions in the data verbs section below.

Wrap long lines

The preferred way of wrapping long lines is by using Python's implied line continuation inside parentheses, brackets and braces. If necessary, you can add an extra pair of parentheses around an expression, but sometimes using a backslash looks better. Make sure to indent the continued line appropriately. The preferred place to break around a binary operator is after the operator, not before it.

Having super long lines in your code makes it hard to read. Using the recommendation from PEP8, we can use parentheses to wrap our long lines up.


results = (
  df
  .preprocess()
  .aggregate()
  .postprocess()
)

Data verbs