import pandas as pd
import numpy as np
L09: Pandas data cleaning
Preliminaries
= pd.read_excel('./rawdata.xlsx')
df df
firmid | date | return | industry | ind_code | |
---|---|---|---|---|---|
0 | 1 | 12/31/2008 | 0.05 | Finance | 1 |
1 | 2 | NaN | 0.45 | Construction | 2 |
2 | 3 | 12/31/2008 | 23.00 | NaN | M |
3 | 1 | 12/31/2009 | 0.87 | Finance | 1 |
4 | 2 | 12/31/2009 | NaN | Construct | 2 |
5 | 2 | 12/31/2009 | 0.34 | Construction | 2 |
df.dtypes
firmid int64
date object
return float64
industry object
ind_code object
dtype: object
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 firmid 6 non-null int64
1 date 5 non-null object
2 return 5 non-null float64
3 industry 5 non-null object
4 ind_code 6 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 368.0+ bytes
The “object” data type (for the industry
and ind_code
columns) is a catch-all term for when Pandas can not determine the exact data type of that column (e.g. int, float, str, etc). Many times, columns containing strings will have this data type.
df.describe()
firmid | return | |
---|---|---|
count | 6.000000 | 5.000000 |
mean | 1.833333 | 4.942000 |
std | 0.752773 | 10.099018 |
min | 1.000000 | 0.050000 |
25% | 1.250000 | 0.340000 |
50% | 2.000000 | 0.450000 |
75% | 2.000000 | 0.870000 |
max | 3.000000 | 23.000000 |
Missing values
Missing values appear as a special code depending on the datatype of the column in which they appear: NaN
(which stands for “not a number”) for numeric data types, None
or NaN
for object data type, NaT
for “datetime” columns (more on this data type later).
To find the missing values in the data, we can use the .isnull
(or its equivalent: .isna()
):
'return'].isnull() df[
0 False
1 False
2 False
3 False
4 True
5 False
Name: return, dtype: bool
'return'].isnull(), :] df.loc[df[
firmid | date | return | industry | ind_code | |
---|---|---|---|---|---|
4 | 2 | 12/31/2009 | NaN | Construct | 2 |
We can drop all the rows that have any missing values using the .dropna()
function:
= df.dropna()
df2 df2
firmid | date | return | industry | ind_code | |
---|---|---|---|---|---|
0 | 1 | 12/31/2008 | 0.05 | Finance | 1 |
3 | 1 | 12/31/2009 | 0.87 | Finance | 1 |
5 | 2 | 12/31/2009 | 0.34 | Construction | 2 |
If we want to drop the rows that have only missing values, we have the use how = 'all'
as a parameter:
= df.dropna(how = 'all')
df2 df2
firmid | date | return | industry | ind_code | |
---|---|---|---|---|---|
0 | 1 | 12/31/2008 | 0.05 | Finance | 1 |
1 | 2 | NaN | 0.45 | Construction | 2 |
2 | 3 | 12/31/2008 | 23.00 | NaN | M |
3 | 1 | 12/31/2009 | 0.87 | Finance | 1 |
4 | 2 | 12/31/2009 | NaN | Construct | 2 |
5 | 2 | 12/31/2009 | 0.34 | Construction | 2 |
If we want to remove all raws that contain missing values in a given column, we have to use .loc[]
and the .notnull()
function:
= df.loc[df['return'].notnull(), :]
df2 df2
firmid | date | return | industry | ind_code | |
---|---|---|---|---|---|
0 | 1 | 12/31/2008 | 0.05 | Finance | 1 |
1 | 2 | NaN | 0.45 | Construction | 2 |
2 | 3 | 12/31/2008 | 23.00 | NaN | M |
3 | 1 | 12/31/2009 | 0.87 | Finance | 1 |
5 | 2 | 12/31/2009 | 0.34 | Construction | 2 |
or we can use the subset
parameter of the dropna
function, which tells the function to look for missing values only in a subset of the columns:
= df.dropna(subset=['return'])
df2 df2
firmid | date | return | industry | ind_code | |
---|---|---|---|---|---|
0 | 1 | 12/31/2008 | 0.05 | Finance | 1 |
1 | 2 | NaN | 0.45 | Construction | 2 |
2 | 3 | 12/31/2008 | 23.00 | NaN | M |
3 | 1 | 12/31/2009 | 0.87 | Finance | 1 |
5 | 2 | 12/31/2009 | 0.34 | Construction | 2 |
Changing data types
Many times, a particular column in our dataframe does not have the datatype we want. There are several functions that can allow us to convert one datatype to another. Below, we cover the most commonly used ones:
The “.astype” attribute
Specify the new datatype that you want to convert to as an argument to .astype()
:
= df.copy()
df2 df2.dtypes
firmid int64
date object
return float64
industry object
ind_code object
dtype: object
'firmid'] = df2['firmid'].astype('float64')
df2[ df2.dtypes
firmid float64
date object
return float64
industry object
ind_code object
dtype: object
df2
firmid | date | return | industry | ind_code | |
---|---|---|---|---|---|
0 | 1.0 | 12/31/2008 | 0.05 | Finance | 1 |
1 | 2.0 | NaN | 0.45 | Construction | 2 |
2 | 3.0 | 12/31/2008 | 23.00 | NaN | M |
3 | 1.0 | 12/31/2009 | 0.87 | Finance | 1 |
4 | 2.0 | 12/31/2009 | NaN | Construct | 2 |
5 | 2.0 | 12/31/2009 | 0.34 | Construction | 2 |
'firmid'] = df2['firmid'].astype('string')
df2[ df2.dtypes
firmid string
date object
return float64
industry object
ind_code object
dtype: object
df2
firmid | date | return | industry | ind_code | |
---|---|---|---|---|---|
0 | 1.0 | 12/31/2008 | 0.05 | Finance | 1 |
1 | 2.0 | NaN | 0.45 | Construction | 2 |
2 | 3.0 | 12/31/2008 | 23.00 | NaN | M |
3 | 1.0 | 12/31/2009 | 0.87 | Finance | 1 |
4 | 2.0 | 12/31/2009 | NaN | Construct | 2 |
5 | 2.0 | 12/31/2009 | 0.34 | Construction | 2 |
It may not look like firmid
is a string data type now but it is. For example, the below command would not work if firmid
was still numeric:
'newid'] = df2['firmid'] + "abc"
df2[ df2
firmid | date | return | industry | ind_code | newid | |
---|---|---|---|---|---|---|
0 | 1.0 | 12/31/2008 | 0.05 | Finance | 1 | 1.0abc |
1 | 2.0 | NaN | 0.45 | Construction | 2 | 2.0abc |
2 | 3.0 | 12/31/2008 | 23.00 | NaN | M | 3.0abc |
3 | 1.0 | 12/31/2009 | 0.87 | Finance | 1 | 1.0abc |
4 | 2.0 | 12/31/2009 | NaN | Construct | 2 | 2.0abc |
5 | 2.0 | 12/31/2009 | 0.34 | Construction | 2 | 2.0abc |
The “.to_numeric” attribute
This is generally used to convert string (or object) data types to a numeric data type. Unlike .astype()
which can be applied after the name of the dataframe we want to convert, with .to_numeric()
, you have to supply that dataframe as an argument:
df2.dtypes
firmid string
date object
return float64
industry object
ind_code object
newid string
dtype: object
'firmid'] = pd.to_numeric(df2['firmid'])
df2[ df2.dtypes
firmid float64
date object
return float64
industry object
ind_code object
newid string
dtype: object
In some situations, the .to_numeric()
function will not be successful unless you specify the parameter errors = `coerce'
. For example, the code below would not work without that parameter (which is why I always specify it):
'ind_code'] = pd.to_numeric(df2['ind_code'], errors = 'coerce')
df2[ df2.dtypes
firmid float64
date object
return float64
industry object
ind_code float64
newid string
dtype: object
Note that this converted the non-numeric values in the ind_code
column to NaN
:
df2
firmid | date | return | industry | ind_code | newid | |
---|---|---|---|---|---|---|
0 | 1.0 | 12/31/2008 | 0.05 | Finance | 1.0 | 1.0abc |
1 | 2.0 | NaN | 0.45 | Construction | 2.0 | 2.0abc |
2 | 3.0 | 12/31/2008 | 23.00 | NaN | NaN | 3.0abc |
3 | 1.0 | 12/31/2009 | 0.87 | Finance | 1.0 | 1.0abc |
4 | 2.0 | 12/31/2009 | NaN | Construct | 2.0 | 2.0abc |
5 | 2.0 | 12/31/2009 | 0.34 | Construction | 2.0 | 2.0abc |
Duplicates and counts
In many situations, it is important to know if our data contains any duplicate entries (most of the time we want to eliminate those) as well as explicitly count duplicate entries in any particular column or set of columns in our data. We can do these operations with the .duplicated()
and .value_counts()
functions:
.duplicated
Syntax:
=None, keep='first') DataFrame.duplicated(subset
where the subset
parameter allows us to specifies where in the dataset (which columns) we are looking for duplicated rows (if unspecified, Pandas will look for instances where an entire row is duplicated). The keep
parameter allows us to specify,
df.duplicated()
0 False
1 False
2 False
3 False
4 False
5 False
dtype: bool
= ['firmid','date']) df.duplicated(subset
0 False
1 False
2 False
3 False
4 False
5 True
dtype: bool
= ['firmid','date'], keep='last') df.duplicated(subset
0 False
1 False
2 False
3 False
4 True
5 False
dtype: bool
= ['firmid','date'], keep=False) df.duplicated(subset
0 False
1 False
2 False
3 False
4 True
5 True
dtype: bool
To drop duplicated data, we can use the .duplicated()
function inside a .loc[]
:
= df.loc[~df.duplicated(['firmid','date'])]
df2 df2
firmid | date | return | industry | ind_code | |
---|---|---|---|---|---|
0 | 1 | 12/31/2008 | 0.05 | Finance | 1 |
1 | 2 | NaN | 0.45 | Construction | 2 |
2 | 3 | 12/31/2008 | 23.00 | NaN | M |
3 | 1 | 12/31/2009 | 0.87 | Finance | 1 |
4 | 2 | 12/31/2009 | NaN | Construct | 2 |
or, more commonly, using the .drop_duplicates()
function:
= df.drop_duplicates(['firmid','date'])
df2 df2
firmid | date | return | industry | ind_code | |
---|---|---|---|---|---|
0 | 1 | 12/31/2008 | 0.05 | Finance | 1 |
1 | 2 | NaN | 0.45 | Construction | 2 |
2 | 3 | 12/31/2008 | 23.00 | NaN | M |
3 | 1 | 12/31/2009 | 0.87 | Finance | 1 |
4 | 2 | 12/31/2009 | NaN | Construct | 2 |
Note that the above still keeps the 4th row, and drops the 5th (a duplicate of the 4th). This is because keep='first'
by default for the .drop_duplicates()
function. To eliminate both duplicated rows, we would have to set keep=False
:
= df.drop_duplicates(['firmid','date'], keep = False)
df2 df2
firmid | date | return | industry | ind_code | |
---|---|---|---|---|---|
0 | 1 | 12/31/2008 | 0.05 | Finance | 1 |
1 | 2 | NaN | 0.45 | Construction | 2 |
2 | 3 | 12/31/2008 | 23.00 | NaN | M |
3 | 1 | 12/31/2009 | 0.87 | Finance | 1 |
Note also that the meaning of “first” and “last” for the keep
parameter depends on how your dataframe happens to be sorted at the time you drop the duplicates.
.value_counts()
Syntax:
=None, normalize=False, sort=True, ascending=False, dropna=True) DataFrame.value_counts(subset
'industry'].value_counts() df[
Finance 2
Construction 2
Construct 1
Name: industry, dtype: int64
'industry') df.value_counts(
industry
Construction 2
Finance 2
Construct 1
dtype: int64
Operating on text data (strings)
Working with text data is a huge topic in data analysis. The Pandas user guide offers a detailed discussion on the way the Pandas package can be used to operate on text data: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#. For the most part, all of this is done with the .str
subpackage and its methods.
Here, we cover a very small subset of the functions that are commonly used for string manipulation inside a dataframe.
We’ll work on the df
dataframe:
df
firmid | date | return | industry | ind_code | |
---|---|---|---|---|---|
0 | 1 | 12/31/2008 | 0.05 | Finance | 1 |
1 | 2 | NaN | 0.45 | Construction | 2 |
2 | 3 | 12/31/2008 | 23.00 | NaN | M |
3 | 1 | 12/31/2009 | 0.87 | Finance | 1 |
4 | 2 | 12/31/2009 | NaN | Construct | 2 |
5 | 2 | 12/31/2009 | 0.34 | Construction | 2 |
It is important to convert a text column to string
format before we manipulate it with .str
functions. For example, the industry
column is currently of type object
so we will convert it to string
:
'industry'] = df['industry'].astype('string')
df[ df
firmid | date | return | industry | ind_code | |
---|---|---|---|---|---|
0 | 1 | 12/31/2008 | 0.05 | Finance | 1 |
1 | 2 | NaN | 0.45 | Construction | 2 |
2 | 3 | 12/31/2008 | 23.00 | <NA> | M |
3 | 1 | 12/31/2009 | 0.87 | Finance | 1 |
4 | 2 | 12/31/2009 | NaN | Construct | 2 |
5 | 2 | 12/31/2009 | 0.34 | Construction | 2 |
Slicing into string data
'industry'].str[0:3] df[
0 Fin
1 Con
2 <NA>
3 Fin
4 Con
5 Con
Name: industry, dtype: string
Converting to lower case or upper case
'industry'].str.lower() df[
0 finance
1 construction
2 <NA>
3 finance
4 construct
5 construction
Name: industry, dtype: string
str.upper() df.columns.
Index(['FIRMID', 'DATE', 'RETURN', 'INDUSTRY', 'IND_CODE'], dtype='object')
Substrings
'industry'].str.contains("Cons"),:] df.loc[df[
firmid | date | return | industry | ind_code | |
---|---|---|---|---|---|
1 | 2 | NaN | 0.45 | Construction | 2 |
4 | 2 | 12/31/2009 | NaN | Construct | 2 |
5 | 2 | 12/31/2009 | 0.34 | Construction | 2 |
'industry'] = df['industry'].str.replace("Construct","Construction")
df[ df
firmid | date | return | industry | ind_code | |
---|---|---|---|---|---|
0 | 1 | 12/31/2008 | 0.05 | Finance | 1 |
1 | 2 | NaN | 0.45 | Constructionion | 2 |
2 | 3 | 12/31/2008 | 23.00 | <NA> | M |
3 | 1 | 12/31/2009 | 0.87 | Finance | 1 |
4 | 2 | 12/31/2009 | NaN | Construction | 2 |
5 | 2 | 12/31/2009 | 0.34 | Constructionion | 2 |
'industry'].str.contains('Cons'), 'industry'] = 'Construction'
df.loc[df[ df
firmid | date | return | industry | ind_code | |
---|---|---|---|---|---|
0 | 1 | 12/31/2008 | 0.05 | Finance | 1 |
1 | 2 | NaN | 0.45 | Construction | 2 |
2 | 3 | 12/31/2008 | 23.00 | <NA> | M |
3 | 1 | 12/31/2009 | 0.87 | Finance | 1 |
4 | 2 | 12/31/2009 | NaN | Construction | 2 |
5 | 2 | 12/31/2009 | 0.34 | Construction | 2 |
Splitting
'month','day','year']] = df['date'].str.split(pat = "/", expand = True)
df[[ df
firmid | date | return | industry | ind_code | month | day | year | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 12/31/2008 | 0.05 | Finance | 1 | 12 | 31 | 2008 |
1 | 2 | NaN | 0.45 | Construction | 2 | NaN | NaN | NaN |
2 | 3 | 12/31/2008 | 23.00 | <NA> | M | 12 | 31 | 2008 |
3 | 1 | 12/31/2009 | 0.87 | Finance | 1 | 12 | 31 | 2009 |
4 | 2 | 12/31/2009 | NaN | Construction | 2 | 12 | 31 | 2009 |
5 | 2 | 12/31/2009 | 0.34 | Construction | 2 | 12 | 31 | 2009 |
Stripping white spaces
= pd.DataFrame(np.random.randn(3, 2), columns=[" Column A ", " Column B "])
newdf newdf
Column A | Column B | |
---|---|---|
0 | 1.764450 | 0.697717 |
1 | -0.865316 | -0.428507 |
2 | 0.165881 | 0.089415 |
#newdf['Column A'] #this will not work
= newdf.columns.str.strip()
newdf.columns newdf
Column A | Column B | |
---|---|---|
0 | 1.764450 | 0.697717 |
1 | -0.865316 | -0.428507 |
2 | 0.165881 | 0.089415 |
'Column A'] #this will work newdf[
0 1.764450
1 -0.865316
2 0.165881
Name: Column A, dtype: float64
Chaining .str methods
= newdf.columns.str.strip().str.replace(" ","_").str.lower()
newdf.columns newdf
column_a | column_b | |
---|---|---|
0 | 1.764450 | 0.697717 |
1 | -0.865316 | -0.428507 |
2 | 0.165881 | 0.089415 |