Mihai Ion
  • Home
  • Research
  • Teaching
  • CV
  1. FIN 525
  2. Lectures
  3. L09: Pandas data cleaning
  • Teaching
  • FIN 525
    • Lectures
      • L00: Jupyter basics
      • L01: Introduction
      • L02: Variables, types, operators
      • L03: Data structures
      • L04: Conditionals, loops
      • L05: Functions, packages
      • L06: Pandas intro
      • L07: Pandas I/O
      • L08: Pandas filtering
      • L09: Pandas data cleaning
      • L10: Merging, reshaping datasets
      • L11: Dates, lags, sorting
      • L12: Descriptive stats
      • L13: Conditional stats, outliers
      • L14: Conditional stats applied
      • L15: Linear regression intro
      • L16: Linear regression applications
      • L17: Panel regression intro
      • L18: Robust panel regression
      • L19: Robust timeseries regression
      • L20: Backtesting - data prep
      • L21: Backtesting - sumstats
      • L22: Backtesting -returns
      • L23: Backtesting - risk adjustment
  • FIN 421
    • Lectures
      • L01: Introduction
      • L02: Analyzing past returns
      • L03: Modeling future returns
      • L04: Portfolio theory intro
      • L05: Optimal capital allocation
      • L06: Tangency portfolios
      • L07_08: Optimal asset allocation
      • L09: Review
      • L10_11: Statistical models of returns
      • L12: CAPM
      • L13: Cost of equity
      • L14: Bond pricing
      • L15: Bond yields
      • L16: Bond risk
      • L17: Valuation data processing
      • L18_19: Multiples valuation
      • L20_21: Dividend discount models
      • L22_23: Discounted cash flow analysis
      • L24: Valuation sensitivity analysis
      • L25: Options intro
      • L26: Risk management with options

On this page

  • Preliminaries
  • Missing values
  • Changing data types
    • The “.astype” attribute
    • The “.to_numeric” attribute
  • Duplicates and counts
    • .duplicated
    • .value_counts()
  • Operating on text data (strings)
    • Slicing into string data
    • Converting to lower case or upper case
    • Substrings
    • Splitting
    • Stripping white spaces
    • Chaining .str methods

L09: Pandas data cleaning

Preliminaries

import pandas as pd
import numpy as np
df = pd.read_excel('./rawdata.xlsx')
df
firmid date return industry ind_code
0 1 12/31/2008 0.05 Finance 1
1 2 NaN 0.45 Construction 2
2 3 12/31/2008 23.00 NaN M
3 1 12/31/2009 0.87 Finance 1
4 2 12/31/2009 NaN Construct 2
5 2 12/31/2009 0.34 Construction 2
df.dtypes
firmid        int64
date         object
return      float64
industry     object
ind_code     object
dtype: object
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   firmid    6 non-null      int64  
 1   date      5 non-null      object 
 2   return    5 non-null      float64
 3   industry  5 non-null      object 
 4   ind_code  6 non-null      object 
dtypes: float64(1), int64(1), object(3)
memory usage: 368.0+ bytes

The “object” data type (for the industry and ind_code columns) is a catch-all term for when Pandas can not determine the exact data type of that column (e.g. int, float, str, etc). Many times, columns containing strings will have this data type.

df.describe()
firmid return
count 6.000000 5.000000
mean 1.833333 4.942000
std 0.752773 10.099018
min 1.000000 0.050000
25% 1.250000 0.340000
50% 2.000000 0.450000
75% 2.000000 0.870000
max 3.000000 23.000000

Missing values

Missing values appear as a special code depending on the datatype of the column in which they appear: NaN (which stands for “not a number”) for numeric data types, None or NaN for object data type, NaT for “datetime” columns (more on this data type later).

To find the missing values in the data, we can use the .isnull (or its equivalent: .isna()):

df['return'].isnull()
0    False
1    False
2    False
3    False
4     True
5    False
Name: return, dtype: bool
df.loc[df['return'].isnull(), :]
firmid date return industry ind_code
4 2 12/31/2009 NaN Construct 2

We can drop all the rows that have any missing values using the .dropna() function:

df2 = df.dropna()
df2
firmid date return industry ind_code
0 1 12/31/2008 0.05 Finance 1
3 1 12/31/2009 0.87 Finance 1
5 2 12/31/2009 0.34 Construction 2

If we want to drop the rows that have only missing values, we have the use how = 'all' as a parameter:

df2 = df.dropna(how = 'all')
df2
firmid date return industry ind_code
0 1 12/31/2008 0.05 Finance 1
1 2 NaN 0.45 Construction 2
2 3 12/31/2008 23.00 NaN M
3 1 12/31/2009 0.87 Finance 1
4 2 12/31/2009 NaN Construct 2
5 2 12/31/2009 0.34 Construction 2

If we want to remove all raws that contain missing values in a given column, we have to use .loc[] and the .notnull() function:

df2 = df.loc[df['return'].notnull(), :]
df2
firmid date return industry ind_code
0 1 12/31/2008 0.05 Finance 1
1 2 NaN 0.45 Construction 2
2 3 12/31/2008 23.00 NaN M
3 1 12/31/2009 0.87 Finance 1
5 2 12/31/2009 0.34 Construction 2

or we can use the subset parameter of the dropna function, which tells the function to look for missing values only in a subset of the columns:

df2 = df.dropna(subset=['return'])
df2
firmid date return industry ind_code
0 1 12/31/2008 0.05 Finance 1
1 2 NaN 0.45 Construction 2
2 3 12/31/2008 23.00 NaN M
3 1 12/31/2009 0.87 Finance 1
5 2 12/31/2009 0.34 Construction 2

Changing data types

Many times, a particular column in our dataframe does not have the datatype we want. There are several functions that can allow us to convert one datatype to another. Below, we cover the most commonly used ones:

The “.astype” attribute

Specify the new datatype that you want to convert to as an argument to .astype():

df2 = df.copy()
df2.dtypes
firmid        int64
date         object
return      float64
industry     object
ind_code     object
dtype: object
df2['firmid'] = df2['firmid'].astype('float64')
df2.dtypes
firmid      float64
date         object
return      float64
industry     object
ind_code     object
dtype: object
df2
firmid date return industry ind_code
0 1.0 12/31/2008 0.05 Finance 1
1 2.0 NaN 0.45 Construction 2
2 3.0 12/31/2008 23.00 NaN M
3 1.0 12/31/2009 0.87 Finance 1
4 2.0 12/31/2009 NaN Construct 2
5 2.0 12/31/2009 0.34 Construction 2
df2['firmid'] = df2['firmid'].astype('string')
df2.dtypes
firmid       string
date         object
return      float64
industry     object
ind_code     object
dtype: object
df2
firmid date return industry ind_code
0 1.0 12/31/2008 0.05 Finance 1
1 2.0 NaN 0.45 Construction 2
2 3.0 12/31/2008 23.00 NaN M
3 1.0 12/31/2009 0.87 Finance 1
4 2.0 12/31/2009 NaN Construct 2
5 2.0 12/31/2009 0.34 Construction 2

It may not look like firmid is a string data type now but it is. For example, the below command would not work if firmid was still numeric:

df2['newid'] = df2['firmid'] + "abc"
df2
firmid date return industry ind_code newid
0 1.0 12/31/2008 0.05 Finance 1 1.0abc
1 2.0 NaN 0.45 Construction 2 2.0abc
2 3.0 12/31/2008 23.00 NaN M 3.0abc
3 1.0 12/31/2009 0.87 Finance 1 1.0abc
4 2.0 12/31/2009 NaN Construct 2 2.0abc
5 2.0 12/31/2009 0.34 Construction 2 2.0abc

The “.to_numeric” attribute

This is generally used to convert string (or object) data types to a numeric data type. Unlike .astype() which can be applied after the name of the dataframe we want to convert, with .to_numeric(), you have to supply that dataframe as an argument:

df2.dtypes
firmid       string
date         object
return      float64
industry     object
ind_code     object
newid        string
dtype: object
df2['firmid'] = pd.to_numeric(df2['firmid'])
df2.dtypes
firmid      float64
date         object
return      float64
industry     object
ind_code     object
newid        string
dtype: object

In some situations, the .to_numeric() function will not be successful unless you specify the parameter errors = `coerce'. For example, the code below would not work without that parameter (which is why I always specify it):

df2['ind_code'] = pd.to_numeric(df2['ind_code'], errors = 'coerce')
df2.dtypes
firmid      float64
date         object
return      float64
industry     object
ind_code    float64
newid        string
dtype: object

Note that this converted the non-numeric values in the ind_code column to NaN:

df2
firmid date return industry ind_code newid
0 1.0 12/31/2008 0.05 Finance 1.0 1.0abc
1 2.0 NaN 0.45 Construction 2.0 2.0abc
2 3.0 12/31/2008 23.00 NaN NaN 3.0abc
3 1.0 12/31/2009 0.87 Finance 1.0 1.0abc
4 2.0 12/31/2009 NaN Construct 2.0 2.0abc
5 2.0 12/31/2009 0.34 Construction 2.0 2.0abc

Duplicates and counts

In many situations, it is important to know if our data contains any duplicate entries (most of the time we want to eliminate those) as well as explicitly count duplicate entries in any particular column or set of columns in our data. We can do these operations with the .duplicated() and .value_counts() functions:

.duplicated

Syntax:

DataFrame.duplicated(subset=None, keep='first')

where the subset parameter allows us to specifies where in the dataset (which columns) we are looking for duplicated rows (if unspecified, Pandas will look for instances where an entire row is duplicated). The keep parameter allows us to specify,

df.duplicated()
0    False
1    False
2    False
3    False
4    False
5    False
dtype: bool
df.duplicated(subset = ['firmid','date'])
0    False
1    False
2    False
3    False
4    False
5     True
dtype: bool
df.duplicated(subset = ['firmid','date'], keep='last')
0    False
1    False
2    False
3    False
4     True
5    False
dtype: bool
df.duplicated(subset = ['firmid','date'], keep=False)
0    False
1    False
2    False
3    False
4     True
5     True
dtype: bool

To drop duplicated data, we can use the .duplicated() function inside a .loc[]:

df2 = df.loc[~df.duplicated(['firmid','date'])]
df2
firmid date return industry ind_code
0 1 12/31/2008 0.05 Finance 1
1 2 NaN 0.45 Construction 2
2 3 12/31/2008 23.00 NaN M
3 1 12/31/2009 0.87 Finance 1
4 2 12/31/2009 NaN Construct 2

or, more commonly, using the .drop_duplicates() function:

df2 = df.drop_duplicates(['firmid','date'])
df2
firmid date return industry ind_code
0 1 12/31/2008 0.05 Finance 1
1 2 NaN 0.45 Construction 2
2 3 12/31/2008 23.00 NaN M
3 1 12/31/2009 0.87 Finance 1
4 2 12/31/2009 NaN Construct 2

Note that the above still keeps the 4th row, and drops the 5th (a duplicate of the 4th). This is because keep='first' by default for the .drop_duplicates() function. To eliminate both duplicated rows, we would have to set keep=False:

df2 = df.drop_duplicates(['firmid','date'], keep = False)
df2
firmid date return industry ind_code
0 1 12/31/2008 0.05 Finance 1
1 2 NaN 0.45 Construction 2
2 3 12/31/2008 23.00 NaN M
3 1 12/31/2009 0.87 Finance 1

Note also that the meaning of “first” and “last” for the keep parameter depends on how your dataframe happens to be sorted at the time you drop the duplicates.

.value_counts()

Syntax:

DataFrame.value_counts(subset=None, normalize=False, sort=True, ascending=False, dropna=True)
df['industry'].value_counts()
Finance         2
Construction    2
Construct       1
Name: industry, dtype: int64
df.value_counts('industry')
industry
Construction    2
Finance         2
Construct       1
dtype: int64

Operating on text data (strings)

Working with text data is a huge topic in data analysis. The Pandas user guide offers a detailed discussion on the way the Pandas package can be used to operate on text data: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#. For the most part, all of this is done with the .str subpackage and its methods.

Here, we cover a very small subset of the functions that are commonly used for string manipulation inside a dataframe.

We’ll work on the df dataframe:

df
firmid date return industry ind_code
0 1 12/31/2008 0.05 Finance 1
1 2 NaN 0.45 Construction 2
2 3 12/31/2008 23.00 NaN M
3 1 12/31/2009 0.87 Finance 1
4 2 12/31/2009 NaN Construct 2
5 2 12/31/2009 0.34 Construction 2

It is important to convert a text column to string format before we manipulate it with .str functions. For example, the industry column is currently of type object so we will convert it to string:

df['industry'] = df['industry'].astype('string')
df
firmid date return industry ind_code
0 1 12/31/2008 0.05 Finance 1
1 2 NaN 0.45 Construction 2
2 3 12/31/2008 23.00 <NA> M
3 1 12/31/2009 0.87 Finance 1
4 2 12/31/2009 NaN Construct 2
5 2 12/31/2009 0.34 Construction 2

Slicing into string data

df['industry'].str[0:3]
0     Fin
1     Con
2    <NA>
3     Fin
4     Con
5     Con
Name: industry, dtype: string

Converting to lower case or upper case

df['industry'].str.lower()
0         finance
1    construction
2            <NA>
3         finance
4       construct
5    construction
Name: industry, dtype: string
df.columns.str.upper()
Index(['FIRMID', 'DATE', 'RETURN', 'INDUSTRY', 'IND_CODE'], dtype='object')

Substrings

df.loc[df['industry'].str.contains("Cons"),:]
firmid date return industry ind_code
1 2 NaN 0.45 Construction 2
4 2 12/31/2009 NaN Construct 2
5 2 12/31/2009 0.34 Construction 2
df['industry'] = df['industry'].str.replace("Construct","Construction")
df
firmid date return industry ind_code
0 1 12/31/2008 0.05 Finance 1
1 2 NaN 0.45 Constructionion 2
2 3 12/31/2008 23.00 <NA> M
3 1 12/31/2009 0.87 Finance 1
4 2 12/31/2009 NaN Construction 2
5 2 12/31/2009 0.34 Constructionion 2
df.loc[df['industry'].str.contains('Cons'), 'industry'] = 'Construction'
df
firmid date return industry ind_code
0 1 12/31/2008 0.05 Finance 1
1 2 NaN 0.45 Construction 2
2 3 12/31/2008 23.00 <NA> M
3 1 12/31/2009 0.87 Finance 1
4 2 12/31/2009 NaN Construction 2
5 2 12/31/2009 0.34 Construction 2

Splitting

df[['month','day','year']] = df['date'].str.split(pat = "/", expand = True)
df
firmid date return industry ind_code month day year
0 1 12/31/2008 0.05 Finance 1 12 31 2008
1 2 NaN 0.45 Construction 2 NaN NaN NaN
2 3 12/31/2008 23.00 <NA> M 12 31 2008
3 1 12/31/2009 0.87 Finance 1 12 31 2009
4 2 12/31/2009 NaN Construction 2 12 31 2009
5 2 12/31/2009 0.34 Construction 2 12 31 2009

Stripping white spaces

newdf = pd.DataFrame(np.random.randn(3, 2), columns=[" Column A ", " Column B "])
newdf
Column A Column B
0 1.764450 0.697717
1 -0.865316 -0.428507
2 0.165881 0.089415
#newdf['Column A'] #this will not work
newdf.columns = newdf.columns.str.strip()
newdf
Column A Column B
0 1.764450 0.697717
1 -0.865316 -0.428507
2 0.165881 0.089415
newdf['Column A'] #this will work
0    1.764450
1   -0.865316
2    0.165881
Name: Column A, dtype: float64

Chaining .str methods

newdf.columns = newdf.columns.str.strip().str.replace(" ","_").str.lower()
newdf
column_a column_b
0 1.764450 0.697717
1 -0.865316 -0.428507
2 0.165881 0.089415