Mihai Ion
  • Home
  • Research
  • Teaching
  • CV
  1. FIN 525
  2. Lectures
  3. L08: Pandas filtering
  • Teaching
  • FIN 525
    • Lectures
      • L00: Jupyter basics
      • L01: Introduction
      • L02: Variables, types, operators
      • L03: Data structures
      • L04: Conditionals, loops
      • L05: Functions, packages
      • L06: Pandas intro
      • L07: Pandas I/O
      • L08: Pandas filtering
      • L09: Pandas data cleaning
      • L10: Merging, reshaping datasets
      • L11: Dates, lags, sorting
      • L12: Descriptive stats
      • L13: Conditional stats, outliers
      • L14: Conditional stats applied
      • L15: Linear regression intro
      • L16: Linear regression applications
      • L17: Panel regression intro
      • L18: Robust panel regression
      • L19: Robust timeseries regression
      • L20: Backtesting - data prep
      • L21: Backtesting - sumstats
      • L22: Backtesting -returns
      • L23: Backtesting - risk adjustment
  • FIN 421
    • Lectures
      • L01: Introduction
      • L02: Analyzing past returns
      • L03: Modeling future returns
      • L04: Portfolio theory intro
      • L05: Optimal capital allocation
      • L06: Tangency portfolios
      • L07_08: Optimal asset allocation
      • L09: Review
      • L10_11: Statistical models of returns
      • L12: CAPM
      • L13: Cost of equity
      • L14: Bond pricing
      • L15: Bond yields
      • L16: Bond risk
      • L17: Valuation data processing
      • L18_19: Multiples valuation
      • L20_21: Dividend discount models
      • L22_23: Discounted cash flow analysis
      • L24: Valuation sensitivity analysis
      • L25: Options intro
      • L26: Risk management with options

On this page

  • Preliminaries
  • Filtering (subsetting) pandas DataFrames
    • The “.loc” attribute
      • Subsetting with explicit labels
      • Subsetting with slices on labels
      • Subsetting with boolean arrays
    • The “.iloc” attribute
    • The “.filter()” attribute
    • Copies vs “views”
  • Filtering on a MultiIndex

L08: Pandas filtering

Preliminaries

import pandas as pd
import numpy as np
df = pd.DataFrame(data=[['TSLA',1000, 0.1],['AAPL',2000, 0.05], ['MSFT', 100, 0.07]], 
                  index = ['Tesla','Apple', 'Microsoft'], 
                  columns = ['ticker','price', 'return'])
df
ticker price return
Tesla TSLA 1000 0.10
Apple AAPL 2000 0.05
Microsoft MSFT 100 0.07

Filtering (subsetting) pandas DataFrames

The “.loc” attribute

The most common way to access a subset of the data in a dataframe is through the “.loc” attribute. This attribute uses square brackets instead of parentheses and contains two arguments: the first one tells Python which rows you want, and the second one tells it which columns you want, which generally looks like this:

DataFrame.loc[<which_rows>, <which_columns>]

where, instead of DataFrame you wouls use the name of the full dataframe you want to subset. Pandas allows for a lot of flexibility as to what you can use instead of <which_rows> and <which_column> above. See the examples in the official documentation to get a more complete picture of what is possible with .loc[]: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html

Below, I cover the most common ways to specify which rows and columns you want:

  1. By explicitly specifying the names (labels) of the index and/or column(s)
  2. Using slices (ranges) on the index labels or common labels
  3. Anything that returns a boolean sequence (True will be interpreted as “I want this row/column”)

Subsetting with explicit labels

df
ticker price return
Tesla TSLA 1000 0.10
Apple AAPL 2000 0.05
Microsoft MSFT 100 0.07
df.loc['Tesla', 'price']
1000
df.loc[['Tesla', 'Microsoft'], ['ticker','return']]
ticker return
Tesla TSLA 0.10
Microsoft MSFT 0.07
df.loc[:,['ticker']] #this returns a Pandas DataFrame
ticker
Tesla TSLA
Apple AAPL
Microsoft MSFT
df.loc[:,'ticker'] #this returns a Pandas Series
Tesla        TSLA
Apple        AAPL
Microsoft    MSFT
Name: ticker, dtype: object
df.loc[['Apple'],:]
ticker price return
Apple AAPL 2000 0.05
df.loc['Apple']
ticker    AAPL
price     2000
return    0.05
Name: Apple, dtype: object

Subsetting with slices on labels

df
ticker price return
Tesla TSLA 1000 0.10
Apple AAPL 2000 0.05
Microsoft MSFT 100 0.07
df.loc['Apple':'Microsoft', 'price':'return'] #note that endpoints of the range are included
price return
Apple 2000 0.05
Microsoft 100 0.07
df.loc['Tesla':'Apple']
ticker price return
Tesla TSLA 1000 0.10
Apple AAPL 2000 0.05
df.loc[:,'ticker':'price']
ticker price
Tesla TSLA 1000
Apple AAPL 2000
Microsoft MSFT 100

Subsetting with boolean arrays

df
ticker price return
Tesla TSLA 1000 0.10
Apple AAPL 2000 0.05
Microsoft MSFT 100 0.07
df.loc[[True,True,False],
      [True, False, True]]
ticker return
Tesla TSLA 0.10
Apple AAPL 0.05
df.loc[df['return'] > 0.05, :]
ticker price return
Tesla TSLA 1000 0.10
Microsoft MSFT 100 0.07
df.loc[:, df.columns.str.contains("tic")] #we'll talk more about ".str" next class
ticker
Tesla TSLA
Apple AAPL
Microsoft MSFT
df.loc[df['return'] > 0.07,
       df.columns.str.contains("tic")]
ticker
Tesla TSLA

The “.iloc” attribute

Works almost identically to .loc() with one crucial exception: .iloc() uses index/column integer positions (as opposed to labels like .loc()).

df
ticker price return
Tesla TSLA 1000 0.10
Apple AAPL 2000 0.05
Microsoft MSFT 100 0.07
df.iloc[[0,2], [0,2]]
ticker return
Tesla TSLA 0.10
Microsoft MSFT 0.07
df.loc[['Tesla', 'Microsoft'], ['ticker','return']]
ticker return
Tesla TSLA 0.10
Microsoft MSFT 0.07

Slicing also works, but this time we have to use index/column numbers, and the right-most end of the range is not included:

df.iloc[1:2, 0:2]
ticker price
Apple AAPL 2000

We rarely use boolean arrays with .iloc() so we will not cover it here.

The “.filter()” attribute

The .filter() attribute comes in handy if we want to subset based on index or column names (labels). In particular, its like parameter allows us to specify that we want all rows/columns that contain a particular piece of text in their label.

Syntax:

DataFrame.filter(items=None, like=None, regex=None, axis=None)

For example:

df
ticker price return
Tesla TSLA 1000 0.10
Apple AAPL 2000 0.05
Microsoft MSFT 100 0.07
df.filter(like="esla", axis=0)
ticker price return
Tesla TSLA 1000 0.1
df.filter(like='ret', axis=1)
return
Tesla 0.10
Apple 0.05
Microsoft 0.07

Copies vs “views”

Let’s make a copy of df that we can safely change for this section:

newdf = df
newdf
ticker price return
Tesla TSLA 1000 0.10
Apple AAPL 2000 0.05
Microsoft MSFT 100 0.07

Many times, we want to store a subset of a dataframe inside a new dataframe. For example:

sub = newdf.loc[:,'price']
sub
Tesla        1000
Apple        2000
Microsoft     100
Name: price, dtype: int64

Now suppose we have to make a change to the larger dataframe newdf. For example:

newdf.loc['Tesla','price'] = 0
newdf
ticker price return
Tesla TSLA 0 0.10
Apple AAPL 2000 0.05
Microsoft MSFT 100 0.07

This change will be passed to sub, even tough we never created this change explicitly ourselves:

sub
Tesla           0
Apple        2000
Microsoft     100
Name: price, dtype: int64

This happened because, when we created sub with the command sub = newdf.loc[:,'price'], Pyhton did not actually create an entirely new dataframe. Instead, it just returned something like an address of where in newdf the price data can be found. This is called a view of the data.

This is done to preserve memory and speed up the code, but, like we saw above, it can cause some of our dataframes change when we edit other dataframes.

To avoid this possible problem, I recommend always telling Python to create a copy of the subset of data you want, using the .copy() attribute. In our example above, sub should have been created like this:

sub = newdf.loc[:,'price'].copy()
sub
Tesla           0
Apple        2000
Microsoft     100
Name: price, dtype: int64

Now, changes to newdf, like this:

newdf.loc['Apple','price'] = 123
newdf
ticker price return
Tesla TSLA 0 0.10
Apple AAPL 123 0.05
Microsoft MSFT 100 0.07

Will not cause sub to change:

sub
Tesla           0
Apple        2000
Microsoft     100
Name: price, dtype: int64

Filtering on a MultiIndex

So far, the all the dataframes we’ve seen have had a one-dimensional index (a signle column). Dataframes can have a higher-dimensional index, and when they do, Pandas calls that index a MultiIndex (for a more thorough tutorial on MultiIndex, see https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)

Let’s create an example dataframe with a MultiIndex. Below, we create the MultiIndex using the pd.MultiIndex.from_product() function, but there are several other ways of creating one (see the link above).

m = pd.DataFrame(data = np.random.rand(9,3), 
                 columns = list('ABC'),
                 index = pd.MultiIndex.from_product([['AAPL','TSLA','MSFT'], [2007,2008,2009]],
                                                    names = ['ticker','year']
                                                   )
                )
m
A B C
ticker year
AAPL 2007 0.116294 0.686215 0.323383
2008 0.343762 0.697231 0.670671
2009 0.419409 0.657433 0.987052
TSLA 2007 0.994865 0.463342 0.093536
2008 0.762471 0.623193 0.221052
2009 0.439225 0.659502 0.111453
MSFT 2007 0.006004 0.500773 0.291589
2008 0.861580 0.300969 0.346424
2009 0.693928 0.514345 0.973823

Now, the index has two columns instead of one:

m.index
MultiIndex([('AAPL', 2007),
            ('AAPL', 2008),
            ('AAPL', 2009),
            ('TSLA', 2007),
            ('TSLA', 2008),
            ('TSLA', 2009),
            ('MSFT', 2007),
            ('MSFT', 2008),
            ('MSFT', 2009)],
           names=['ticker', 'year'])

Each entry (row) in the index is a tuple (note the parentheses):

m.index[0]
('AAPL', 2007)

This means that if we want to use index labels with .loc to extract some subset of a dataframe, we need to use a tuple when we specify those index labels:

m.loc[('AAPL',2007), :]
A    0.116294
B    0.686215
C    0.323383
Name: (AAPL, 2007), dtype: float64

We can use slices on each dimension of the index, though you should make sure that you sorted your data by the values in the index first (using the sort_index() function:

m = m.sort_index()
m
A B C
ticker year
AAPL 2007 0.116294 0.686215 0.323383
2008 0.343762 0.697231 0.670671
2009 0.419409 0.657433 0.987052
MSFT 2007 0.006004 0.500773 0.291589
2008 0.861580 0.300969 0.346424
2009 0.693928 0.514345 0.973823
TSLA 2007 0.994865 0.463342 0.093536
2008 0.762471 0.623193 0.221052
2009 0.439225 0.659502 0.111453

I recommend using the slice() to slide on each dimension of the index. Below, slice(None) means no condition should be imposed on that dimension of the index:

m.loc[(slice('AAPL','MSFT'), slice(None)), :]
A B C
ticker year
AAPL 2007 0.116294 0.686215 0.323383
2008 0.343762 0.697231 0.670671
2009 0.419409 0.657433 0.987052
MSFT 2007 0.006004 0.500773 0.291589
2008 0.861580 0.300969 0.346424
2009 0.693928 0.514345 0.973823
m.loc[(slice(None), slice(2008,2010)), :]
A B C
ticker year
AAPL 2008 0.343762 0.697231 0.670671
2009 0.419409 0.657433 0.987052
MSFT 2008 0.861580 0.300969 0.346424
2009 0.693928 0.514345 0.973823
TSLA 2008 0.762471 0.623193 0.221052
2009 0.439225 0.659502 0.111453
m.loc[(slice('AAPL','MSFT'), slice(2008,2010)), :]
A B C
ticker year
AAPL 2008 0.343762 0.697231 0.670671
2009 0.419409 0.657433 0.987052
MSFT 2008 0.861580 0.300969 0.346424
2009 0.693928 0.514345 0.973823

Though indexing on the first dimension of the index is a lot easier:

m.loc['AAPL':'MSFT']
A B C
ticker year
AAPL 2007 0.116294 0.686215 0.323383
2008 0.343762 0.697231 0.670671
2009 0.419409 0.657433 0.987052
MSFT 2007 0.006004 0.500773 0.291589
2008 0.861580 0.300969 0.346424
2009 0.693928 0.514345 0.973823

However, the convenient syntax above does not work for the other dimensions of the index (e.g. m.loc[2008:2010] will not work).