L09: Pandas data cleaning

Preliminaries

import pandas as pd
import numpy as np

df = pd.read_excel('./rawdata.xlsx')
df

	firmid	date	return	industry	ind_code
0	1	12/31/2008	0.05	Finance	1
1	2	NaN	0.45	Construction	2
2	3	12/31/2008	23.00	NaN	M
3	1	12/31/2009	0.87	Finance	1
4	2	12/31/2009	NaN	Construct	2
5	2	12/31/2009	0.34	Construction	2

df.dtypes

firmid        int64
date         object
return      float64
industry     object
ind_code     object
dtype: object

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   firmid    6 non-null      int64  
 1   date      5 non-null      object 
 2   return    5 non-null      float64
 3   industry  5 non-null      object 
 4   ind_code  6 non-null      object 
dtypes: float64(1), int64(1), object(3)
memory usage: 368.0+ bytes

The “object” data type (for the industry and ind_code columns) is a catch-all term for when Pandas can not determine the exact data type of that column (e.g. int, float, str, etc). Many times, columns containing strings will have this data type.

df.describe()

	firmid	return
count	6.000000	5.000000
mean	1.833333	4.942000
std	0.752773	10.099018
min	1.000000	0.050000
25%	1.250000	0.340000
50%	2.000000	0.450000
75%	2.000000	0.870000
max	3.000000	23.000000

Missing values

Missing values appear as a special code depending on the datatype of the column in which they appear: NaN (which stands for “not a number”) for numeric data types, None or NaN for object data type, NaT for “datetime” columns (more on this data type later).

To find the missing values in the data, we can use the .isnull (or its equivalent: .isna()):

df['return'].isnull()

0    False
1    False
2    False
3    False
4     True
5    False
Name: return, dtype: bool

df.loc[df['return'].isnull(), :]

	firmid	date	return	industry	ind_code
4	2	12/31/2009	NaN	Construct	2

We can drop all the rows that have any missing values using the .dropna() function:

df2 = df.dropna()
df2

	firmid	date	return	industry	ind_code
0	1	12/31/2008	0.05	Finance	1
3	1	12/31/2009	0.87	Finance	1
5	2	12/31/2009	0.34	Construction	2

If we want to drop the rows that have only missing values, we have the use how = 'all' as a parameter:

df2 = df.dropna(how = 'all')
df2

	firmid	date	return	industry	ind_code
0	1	12/31/2008	0.05	Finance	1
1	2	NaN	0.45	Construction	2
2	3	12/31/2008	23.00	NaN	M
3	1	12/31/2009	0.87	Finance	1
4	2	12/31/2009	NaN	Construct	2
5	2	12/31/2009	0.34	Construction	2

If we want to remove all raws that contain missing values in a given column, we have to use .loc[] and the .notnull() function:

df2 = df.loc[df['return'].notnull(), :]
df2

	firmid	date	return	industry	ind_code
0	1	12/31/2008	0.05	Finance	1
1	2	NaN	0.45	Construction	2
2	3	12/31/2008	23.00	NaN	M
3	1	12/31/2009	0.87	Finance	1
5	2	12/31/2009	0.34	Construction	2

or we can use the subset parameter of the dropna function, which tells the function to look for missing values only in a subset of the columns:

df2 = df.dropna(subset=['return'])
df2

	firmid	date	return	industry	ind_code
0	1	12/31/2008	0.05	Finance	1
1	2	NaN	0.45	Construction	2
2	3	12/31/2008	23.00	NaN	M
3	1	12/31/2009	0.87	Finance	1
5	2	12/31/2009	0.34	Construction	2

Changing data types

Many times, a particular column in our dataframe does not have the datatype we want. There are several functions that can allow us to convert one datatype to another. Below, we cover the most commonly used ones:

The “.astype” attribute

Specify the new datatype that you want to convert to as an argument to .astype():

df2 = df.copy()
df2.dtypes

firmid        int64
date         object
return      float64
industry     object
ind_code     object
dtype: object

df2['firmid'] = df2['firmid'].astype('float64')
df2.dtypes

firmid      float64
date         object
return      float64
industry     object
ind_code     object
dtype: object

df2

	firmid	date	return	industry	ind_code
0	1.0	12/31/2008	0.05	Finance	1
1	2.0	NaN	0.45	Construction	2
2	3.0	12/31/2008	23.00	NaN	M
3	1.0	12/31/2009	0.87	Finance	1
4	2.0	12/31/2009	NaN	Construct	2
5	2.0	12/31/2009	0.34	Construction	2

df2['firmid'] = df2['firmid'].astype('string')
df2.dtypes

firmid       string
date         object
return      float64
industry     object
ind_code     object
dtype: object

df2

	firmid	date	return	industry	ind_code
0	1.0	12/31/2008	0.05	Finance	1
1	2.0	NaN	0.45	Construction	2
2	3.0	12/31/2008	23.00	NaN	M
3	1.0	12/31/2009	0.87	Finance	1
4	2.0	12/31/2009	NaN	Construct	2
5	2.0	12/31/2009	0.34	Construction	2

It may not look like firmid is a string data type now but it is. For example, the below command would not work if firmid was still numeric:

df2['newid'] = df2['firmid'] + "abc"
df2

	firmid	date	return	industry	ind_code	newid
0	1.0	12/31/2008	0.05	Finance	1	1.0abc
1	2.0	NaN	0.45	Construction	2	2.0abc
2	3.0	12/31/2008	23.00	NaN	M	3.0abc
3	1.0	12/31/2009	0.87	Finance	1	1.0abc
4	2.0	12/31/2009	NaN	Construct	2	2.0abc
5	2.0	12/31/2009	0.34	Construction	2	2.0abc

The “.to_numeric” attribute

This is generally used to convert string (or object) data types to a numeric data type. Unlike .astype() which can be applied after the name of the dataframe we want to convert, with .to_numeric(), you have to supply that dataframe as an argument:

df2.dtypes

firmid       string
date         object
return      float64
industry     object
ind_code     object
newid        string
dtype: object

df2['firmid'] = pd.to_numeric(df2['firmid'])
df2.dtypes

firmid      float64
date         object
return      float64
industry     object
ind_code     object
newid        string
dtype: object

In some situations, the .to_numeric() function will not be successful unless you specify the parameter errors = `coerce'. For example, the code below would not work without that parameter (which is why I always specify it):

df2['ind_code'] = pd.to_numeric(df2['ind_code'], errors = 'coerce')
df2.dtypes

firmid      float64
date         object
return      float64
industry     object
ind_code    float64
newid        string
dtype: object

Note that this converted the non-numeric values in the ind_code column to NaN:

df2

	firmid	date	return	industry	ind_code	newid
0	1.0	12/31/2008	0.05	Finance	1.0	1.0abc
1	2.0	NaN	0.45	Construction	2.0	2.0abc
2	3.0	12/31/2008	23.00	NaN	NaN	3.0abc
3	1.0	12/31/2009	0.87	Finance	1.0	1.0abc
4	2.0	12/31/2009	NaN	Construct	2.0	2.0abc
5	2.0	12/31/2009	0.34	Construction	2.0	2.0abc

Duplicates and counts

In many situations, it is important to know if our data contains any duplicate entries (most of the time we want to eliminate those) as well as explicitly count duplicate entries in any particular column or set of columns in our data. We can do these operations with the .duplicated() and .value_counts() functions:

.duplicated

Syntax:

DataFrame.duplicated(subset=None, keep='first')

where the subset parameter allows us to specifies where in the dataset (which columns) we are looking for duplicated rows (if unspecified, Pandas will look for instances where an entire row is duplicated). The keep parameter allows us to specify,

df.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
dtype: bool

df.duplicated(subset = ['firmid','date'])

0    False
1    False
2    False
3    False
4    False
5     True
dtype: bool

df.duplicated(subset = ['firmid','date'], keep='last')

0    False
1    False
2    False
3    False
4     True
5    False
dtype: bool

df.duplicated(subset = ['firmid','date'], keep=False)

0    False
1    False
2    False
3    False
4     True
5     True
dtype: bool

To drop duplicated data, we can use the .duplicated() function inside a .loc[]:

df2 = df.loc[~df.duplicated(['firmid','date'])]
df2

	firmid	date	return	industry	ind_code
0	1	12/31/2008	0.05	Finance	1
1	2	NaN	0.45	Construction	2
2	3	12/31/2008	23.00	NaN	M
3	1	12/31/2009	0.87	Finance	1
4	2	12/31/2009	NaN	Construct	2

or, more commonly, using the .drop_duplicates() function:

df2 = df.drop_duplicates(['firmid','date'])
df2

	firmid	date	return	industry	ind_code
0	1	12/31/2008	0.05	Finance	1
1	2	NaN	0.45	Construction	2
2	3	12/31/2008	23.00	NaN	M
3	1	12/31/2009	0.87	Finance	1
4	2	12/31/2009	NaN	Construct	2

Note that the above still keeps the 4th row, and drops the 5th (a duplicate of the 4th). This is because keep='first' by default for the .drop_duplicates() function. To eliminate both duplicated rows, we would have to set keep=False:

df2 = df.drop_duplicates(['firmid','date'], keep = False)
df2

	firmid	date	return	industry	ind_code
0	1	12/31/2008	0.05	Finance	1
1	2	NaN	0.45	Construction	2
2	3	12/31/2008	23.00	NaN	M
3	1	12/31/2009	0.87	Finance	1

Note also that the meaning of “first” and “last” for the keep parameter depends on how your dataframe happens to be sorted at the time you drop the duplicates.

.value_counts()

Syntax:

DataFrame.value_counts(subset=None, normalize=False, sort=True, ascending=False, dropna=True)

df['industry'].value_counts()

Finance         2
Construction    2
Construct       1
Name: industry, dtype: int64

df.value_counts('industry')

industry
Construction    2
Finance         2
Construct       1
dtype: int64

Operating on text data (strings)

Working with text data is a huge topic in data analysis. The Pandas user guide offers a detailed discussion on the way the Pandas package can be used to operate on text data: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#. For the most part, all of this is done with the .str subpackage and its methods.

Here, we cover a very small subset of the functions that are commonly used for string manipulation inside a dataframe.

We’ll work on the df dataframe:

df

	firmid	date	return	industry	ind_code
0	1	12/31/2008	0.05	Finance	1
1	2	NaN	0.45	Construction	2
2	3	12/31/2008	23.00	NaN	M
3	1	12/31/2009	0.87	Finance	1
4	2	12/31/2009	NaN	Construct	2
5	2	12/31/2009	0.34	Construction	2

It is important to convert a text column to string format before we manipulate it with .str functions. For example, the industry column is currently of type object so we will convert it to string:

df['industry'] = df['industry'].astype('string')
df

	firmid	date	return	industry	ind_code
0	1	12/31/2008	0.05	Finance	1
1	2	NaN	0.45	Construction	2
2	3	12/31/2008	23.00	<NA>	M
3	1	12/31/2009	0.87	Finance	1
4	2	12/31/2009	NaN	Construct	2
5	2	12/31/2009	0.34	Construction	2

Slicing into string data

df['industry'].str[0:3]

0     Fin
1     Con
2    <NA>
3     Fin
4     Con
5     Con
Name: industry, dtype: string

Converting to lower case or upper case

df['industry'].str.lower()

0         finance
1    construction
2            <NA>
3         finance
4       construct
5    construction
Name: industry, dtype: string

df.columns.str.upper()

Index(['FIRMID', 'DATE', 'RETURN', 'INDUSTRY', 'IND_CODE'], dtype='object')

Substrings

df.loc[df['industry'].str.contains("Cons"),:]

	firmid	date	return	industry	ind_code
1	2	NaN	0.45	Construction	2
4	2	12/31/2009	NaN	Construct	2
5	2	12/31/2009	0.34	Construction	2

df['industry'] = df['industry'].str.replace("Construct","Construction")
df

	firmid	date	return	industry	ind_code
0	1	12/31/2008	0.05	Finance	1
1	2	NaN	0.45	Constructionion	2
2	3	12/31/2008	23.00	<NA>	M
3	1	12/31/2009	0.87	Finance	1
4	2	12/31/2009	NaN	Construction	2
5	2	12/31/2009	0.34	Constructionion	2

df.loc[df['industry'].str.contains('Cons'), 'industry'] = 'Construction'
df

	firmid	date	return	industry	ind_code
0	1	12/31/2008	0.05	Finance	1
1	2	NaN	0.45	Construction	2
2	3	12/31/2008	23.00	<NA>	M
3	1	12/31/2009	0.87	Finance	1
4	2	12/31/2009	NaN	Construction	2
5	2	12/31/2009	0.34	Construction	2

Splitting

df[['month','day','year']] = df['date'].str.split(pat = "/", expand = True)
df

	firmid	date	return	industry	ind_code	month	day	year
0	1	12/31/2008	0.05	Finance	1	12	31	2008
1	2	NaN	0.45	Construction	2	NaN	NaN	NaN
2	3	12/31/2008	23.00	<NA>	M	12	31	2008
3	1	12/31/2009	0.87	Finance	1	12	31	2009
4	2	12/31/2009	NaN	Construction	2	12	31	2009
5	2	12/31/2009	0.34	Construction	2	12	31	2009

Stripping white spaces

newdf = pd.DataFrame(np.random.randn(3, 2), columns=[" Column A ", " Column B "])
newdf

	Column A	Column B
0	1.764450	0.697717
1	-0.865316	-0.428507
2	0.165881	0.089415

#newdf['Column A'] #this will not work

newdf.columns = newdf.columns.str.strip()
newdf

	Column A	Column B
0	1.764450	0.697717
1	-0.865316	-0.428507
2	0.165881	0.089415

newdf['Column A'] #this will work

0    1.764450
1   -0.865316
2    0.165881
Name: Column A, dtype: float64

Chaining .str methods

newdf.columns = newdf.columns.str.strip().str.replace(" ","_").str.lower()
newdf

	column_a	column_b
0	1.764450	0.697717
1	-0.865316	-0.428507
2	0.165881	0.089415