L08: Pandas filtering

Preliminaries

import pandas as pd
import numpy as np

df = pd.DataFrame(data=[['TSLA',1000, 0.1],['AAPL',2000, 0.05], ['MSFT', 100, 0.07]], 
                  index = ['Tesla','Apple', 'Microsoft'], 
                  columns = ['ticker','price', 'return'])
df

	ticker	price	return
Tesla	TSLA	1000	0.10
Apple	AAPL	2000	0.05
Microsoft	MSFT	100	0.07

Filtering (subsetting) pandas DataFrames

The “.loc” attribute

The most common way to access a subset of the data in a dataframe is through the “.loc” attribute. This attribute uses square brackets instead of parentheses and contains two arguments: the first one tells Python which rows you want, and the second one tells it which columns you want, which generally looks like this:

DataFrame.loc[<which_rows>, <which_columns>]

where, instead of DataFrame you wouls use the name of the full dataframe you want to subset. Pandas allows for a lot of flexibility as to what you can use instead of <which_rows> and <which_column> above. See the examples in the official documentation to get a more complete picture of what is possible with .loc[]: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html

Below, I cover the most common ways to specify which rows and columns you want:

By explicitly specifying the names (labels) of the index and/or column(s)
Using slices (ranges) on the index labels or common labels
Anything that returns a boolean sequence (True will be interpreted as “I want this row/column”)

Subsetting with explicit labels

df

	ticker	price	return
Tesla	TSLA	1000	0.10
Apple	AAPL	2000	0.05
Microsoft	MSFT	100	0.07

df.loc['Tesla', 'price']

df.loc[['Tesla', 'Microsoft'], ['ticker','return']]

	ticker	return
Tesla	TSLA	0.10
Microsoft	MSFT	0.07

df.loc[:,['ticker']] #this returns a Pandas DataFrame

	ticker
Tesla	TSLA
Apple	AAPL
Microsoft	MSFT

df.loc[:,'ticker'] #this returns a Pandas Series

Tesla        TSLA
Apple        AAPL
Microsoft    MSFT
Name: ticker, dtype: object

df.loc[['Apple'],:]

	ticker	price	return
Apple	AAPL	2000	0.05

df.loc['Apple']

ticker    AAPL
price     2000
return    0.05
Name: Apple, dtype: object

Subsetting with slices on labels

df

	ticker	price	return
Tesla	TSLA	1000	0.10
Apple	AAPL	2000	0.05
Microsoft	MSFT	100	0.07

df.loc['Apple':'Microsoft', 'price':'return'] #note that endpoints of the range are included

	price	return
Apple	2000	0.05
Microsoft	100	0.07

df.loc['Tesla':'Apple']

	ticker	price	return
Tesla	TSLA	1000	0.10
Apple	AAPL	2000	0.05

df.loc[:,'ticker':'price']

	ticker	price
Tesla	TSLA	1000
Apple	AAPL	2000
Microsoft	MSFT	100

Subsetting with boolean arrays

df

	ticker	price	return
Tesla	TSLA	1000	0.10
Apple	AAPL	2000	0.05
Microsoft	MSFT	100	0.07

df.loc[[True,True,False],
      [True, False, True]]

	ticker	return
Tesla	TSLA	0.10
Apple	AAPL	0.05

df.loc[df['return'] > 0.05, :]

	ticker	price	return
Tesla	TSLA	1000	0.10
Microsoft	MSFT	100	0.07

df.loc[:, df.columns.str.contains("tic")] #we'll talk more about ".str" next class

	ticker
Tesla	TSLA
Apple	AAPL
Microsoft	MSFT

df.loc[df['return'] > 0.07,
       df.columns.str.contains("tic")]

	ticker
Tesla	TSLA

The “.iloc” attribute

Works almost identically to .loc() with one crucial exception: .iloc() uses index/column integer positions (as opposed to labels like .loc()).

df

	ticker	price	return
Tesla	TSLA	1000	0.10
Apple	AAPL	2000	0.05
Microsoft	MSFT	100	0.07

df.iloc[[0,2], [0,2]]

	ticker	return
Tesla	TSLA	0.10
Microsoft	MSFT	0.07

df.loc[['Tesla', 'Microsoft'], ['ticker','return']]

	ticker	return
Tesla	TSLA	0.10
Microsoft	MSFT	0.07

Slicing also works, but this time we have to use index/column numbers, and the right-most end of the range is not included:

df.iloc[1:2, 0:2]

	ticker	price
Apple	AAPL	2000

We rarely use boolean arrays with .iloc() so we will not cover it here.

The “.filter()” attribute

The .filter() attribute comes in handy if we want to subset based on index or column names (labels). In particular, its like parameter allows us to specify that we want all rows/columns that contain a particular piece of text in their label.

Syntax:

DataFrame.filter(items=None, like=None, regex=None, axis=None)

For example:

df

	ticker	price	return
Tesla	TSLA	1000	0.10
Apple	AAPL	2000	0.05
Microsoft	MSFT	100	0.07

df.filter(like="esla", axis=0)

	ticker	price	return
Tesla	TSLA	1000	0.1

df.filter(like='ret', axis=1)

	return
Tesla	0.10
Apple	0.05
Microsoft	0.07

Copies vs “views”

Let’s make a copy of df that we can safely change for this section:

newdf = df
newdf

	ticker	price	return
Tesla	TSLA	1000	0.10
Apple	AAPL	2000	0.05
Microsoft	MSFT	100	0.07

Many times, we want to store a subset of a dataframe inside a new dataframe. For example:

sub = newdf.loc[:,'price']
sub

Tesla        1000
Apple        2000
Microsoft     100
Name: price, dtype: int64

Now suppose we have to make a change to the larger dataframe newdf. For example:

newdf.loc['Tesla','price'] = 0
newdf

	ticker	price	return
Tesla	TSLA	0	0.10
Apple	AAPL	2000	0.05
Microsoft	MSFT	100	0.07

This change will be passed to sub, even tough we never created this change explicitly ourselves:

sub

Tesla           0
Apple        2000
Microsoft     100
Name: price, dtype: int64

This happened because, when we created sub with the command sub = newdf.loc[:,'price'], Pyhton did not actually create an entirely new dataframe. Instead, it just returned something like an address of where in newdf the price data can be found. This is called a view of the data.

This is done to preserve memory and speed up the code, but, like we saw above, it can cause some of our dataframes change when we edit other dataframes.

To avoid this possible problem, I recommend always telling Python to create a copy of the subset of data you want, using the .copy() attribute. In our example above, sub should have been created like this:

sub = newdf.loc[:,'price'].copy()
sub

Tesla           0
Apple        2000
Microsoft     100
Name: price, dtype: int64

Now, changes to newdf, like this:

newdf.loc['Apple','price'] = 123
newdf

	ticker	price	return
Tesla	TSLA	0	0.10
Apple	AAPL	123	0.05
Microsoft	MSFT	100	0.07

Will not cause sub to change:

sub

Tesla           0
Apple        2000
Microsoft     100
Name: price, dtype: int64

Filtering on a MultiIndex

So far, the all the dataframes we’ve seen have had a one-dimensional index (a signle column). Dataframes can have a higher-dimensional index, and when they do, Pandas calls that index a MultiIndex (for a more thorough tutorial on MultiIndex, see https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)

Let’s create an example dataframe with a MultiIndex. Below, we create the MultiIndex using the pd.MultiIndex.from_product() function, but there are several other ways of creating one (see the link above).

m = pd.DataFrame(data = np.random.rand(9,3), 
                 columns = list('ABC'),
                 index = pd.MultiIndex.from_product([['AAPL','TSLA','MSFT'], [2007,2008,2009]],
                                                    names = ['ticker','year']
                                                   )
                )
m

		A	B	C
ticker	year
AAPL	2007	0.116294	0.686215	0.323383
	2008	0.343762	0.697231	0.670671
	2009	0.419409	0.657433	0.987052
TSLA	2007	0.994865	0.463342	0.093536
	2008	0.762471	0.623193	0.221052
	2009	0.439225	0.659502	0.111453
MSFT	2007	0.006004	0.500773	0.291589
	2008	0.861580	0.300969	0.346424
	2009	0.693928	0.514345	0.973823

Now, the index has two columns instead of one:

m.index

MultiIndex([('AAPL', 2007),
            ('AAPL', 2008),
            ('AAPL', 2009),
            ('TSLA', 2007),
            ('TSLA', 2008),
            ('TSLA', 2009),
            ('MSFT', 2007),
            ('MSFT', 2008),
            ('MSFT', 2009)],
           names=['ticker', 'year'])

Each entry (row) in the index is a tuple (note the parentheses):

m.index[0]

('AAPL', 2007)

This means that if we want to use index labels with .loc to extract some subset of a dataframe, we need to use a tuple when we specify those index labels:

m.loc[('AAPL',2007), :]

A    0.116294
B    0.686215
C    0.323383
Name: (AAPL, 2007), dtype: float64

We can use slices on each dimension of the index, though you should make sure that you sorted your data by the values in the index first (using the sort_index() function:

m = m.sort_index()
m

		A	B	C
ticker	year
AAPL	2007	0.116294	0.686215	0.323383
	2008	0.343762	0.697231	0.670671
	2009	0.419409	0.657433	0.987052
MSFT	2007	0.006004	0.500773	0.291589
	2008	0.861580	0.300969	0.346424
	2009	0.693928	0.514345	0.973823
TSLA	2007	0.994865	0.463342	0.093536
	2008	0.762471	0.623193	0.221052
	2009	0.439225	0.659502	0.111453

I recommend using the slice() to slide on each dimension of the index. Below, slice(None) means no condition should be imposed on that dimension of the index:

m.loc[(slice('AAPL','MSFT'), slice(None)), :]

		A	B	C
ticker	year
AAPL	2007	0.116294	0.686215	0.323383
	2008	0.343762	0.697231	0.670671
	2009	0.419409	0.657433	0.987052
MSFT	2007	0.006004	0.500773	0.291589
	2008	0.861580	0.300969	0.346424
	2009	0.693928	0.514345	0.973823

m.loc[(slice(None), slice(2008,2010)), :]

		A	B	C
ticker	year
AAPL	2008	0.343762	0.697231	0.670671
AAPL	2009	0.419409	0.657433	0.987052
MSFT	2008	0.861580	0.300969	0.346424
MSFT	2009	0.693928	0.514345	0.973823
TSLA	2008	0.762471	0.623193	0.221052
TSLA	2009	0.439225	0.659502	0.111453

m.loc[(slice('AAPL','MSFT'), slice(2008,2010)), :]

		A	B	C
ticker	year
AAPL	2008	0.343762	0.697231	0.670671
AAPL	2009	0.419409	0.657433	0.987052
MSFT	2008	0.861580	0.300969	0.346424
MSFT	2009	0.693928	0.514345	0.973823

Though indexing on the first dimension of the index is a lot easier:

m.loc['AAPL':'MSFT']

		A	B	C
ticker	year
AAPL	2007	0.116294	0.686215	0.323383
	2008	0.343762	0.697231	0.670671
	2009	0.419409	0.657433	0.987052
MSFT	2007	0.006004	0.500773	0.291589
	2008	0.861580	0.300969	0.346424
	2009	0.693928	0.514345	0.973823

However, the convenient syntax above does not work for the other dimensions of the index (e.g. m.loc[2008:2010] will not work).