core

Extends pandas with common functions used in finance and economics research

Almost all these functions make a copy of the input DataFrame. When that DataFrame is large, use these functions as df = func(df).

Exported source

from __future__ import annotations
from typing import List, Callable 
import os, glob 
import pandas as pd
import numpy as np

First, we set up an example dataset to showcase the functions in this module.

raw = pd.DataFrame(np.random.rand(15,2), 
                  columns=list('AB'), 
                  index=pd.MultiIndex.from_product(
                      [[1,2, np.nan],[np.nan,'2010-01','2010-02','2010-02','2010-04']],
                      names = ['permno','date'])
                    ).reset_index()
raw

	permno	date	A	B
0	1.0	NaN	0.234344	0.698915
1	1.0	2010-01	0.104762	0.923778
2	1.0	2010-02	0.134963	0.287128
3	1.0	2010-02	0.561185	0.629761
4	1.0	2010-04	0.452953	0.137185
5	2.0	NaN	0.242651	0.010276
6	2.0	2010-01	0.481129	0.734026
7	2.0	2010-02	0.663068	0.887246
8	2.0	2010-02	0.627125	0.525601
9	2.0	2010-04	0.413195	0.860653
10	NaN	NaN	0.858186	0.907775
11	NaN	2010-01	0.660384	0.601805
12	NaN	2010-02	0.037070	0.510598
13	NaN	2010-02	0.579008	0.682778
14	NaN	2010-04	0.697347	0.750504

Common panel setup procedures

source

order_columns

 order_columns (df:pandas.core.frame.DataFrame, these_first:List[str])

Returns df with reordered columns. Use as df = order_columns(df,_)

order_columns(raw, these_first=['B']).head()

	B	permno	date	A
0	0.698915	1.0	NaN	0.234344
1	0.923778	1.0	2010-01	0.104762
2	0.287128	1.0	2010-02	0.134963
3	0.629761	1.0	2010-02	0.561185
4	0.137185	1.0	2010-04	0.452953

source

process_dates

 process_dates (df:pandas.core.frame.DataFrame, time_var:str='date',
                time_var_format:str='%Y-%m-%d', dtdate_var:str='dtdate',
                freq:str=None)

Makes datetime date dtdate_var from time_var; adds period date f'{freq}date'.

	Type	Default	Details
df	DataFrame		Function returns copy of this df with `dtdate_var` and `f'{freq}date'` cols added
time_var	str	date	This will be the date variable used to generate datetime var `dtdate_var`
time_var_format	str	%Y-%m-%d	Format of `time_var`; must be valid pandas `strftime`
dtdate_var	str	dtdate	Name of datetime var to be created from `time_var`
freq	str	None	Used to create `f'{freq}date'` period date; must be valid pandas offset string
Returns	DataFrame

newdf = process_dates(raw, time_var_format="%Y-%m", freq='M')
newdf.head()

	date	dtdate	Mdate	permno	A	B
0	NaN	NaT	NaT	1.0	0.234344	0.698915
1	2010-01	2010-01-01	2010-01	1.0	0.104762	0.923778
2	2010-02	2010-02-01	2010-02	1.0	0.134963	0.287128
3	2010-02	2010-02-01	2010-02	1.0	0.561185	0.629761
4	2010-04	2010-04-01	2010-04	1.0	0.452953	0.137185

source

setup_tseries

 setup_tseries (df:pandas.core.series.Series|pandas.core.frame.DataFrame,
                dates_processed:bool=False, time_var:str='date',
                time_var_format:str='%Y-%m-%d', dtdate_var:str='dtdate',
                freq:str=None, drop_missing_index_vals:bool=True,
                drop_index_duplicates:bool=True,
                duplicates_which_keep:str='last')

Applies process_dates to df; cleans up resulting f'{freq}date' period date and sets it as index.

	Type	Default	Details
df	pandas.core.series.Series \| pandas.core.frame.DataFrame		Input DataFrame; a copy is returned
dates_processed	bool	False	If True, assumes dates are already processed with `process_dates`
time_var	str	date	This will be the date variable used to generate datetime var `dtdate_var`
time_var_format	str	%Y-%m-%d	Format of `time_var`; must be valid pandas `strftime`
dtdate_var	str	dtdate	Name of datetime var to be created from `time_var`
freq	str	None	Used to create `f'{freq}date'` period date; must be valid pandas offset string
drop_missing_index_vals	bool	True	What to do with missing `f'{freq}date'`
drop_index_duplicates	bool	True	What to do with duplicates in `f'{freq}date'` values
duplicates_which_keep	str	last	If duplicates in index, which to keep; must be ‘first’, ‘last’ or `False`
Returns	DataFrame

raw.query('permno==1')

	permno	date	A	B
0	1.0	NaN	0.234344	0.698915
1	1.0	2010-01	0.104762	0.923778
2	1.0	2010-02	0.134963	0.287128
3	1.0	2010-02	0.561185	0.629761
4	1.0	2010-04	0.452953	0.137185

df = setup_tseries(raw.query('permno==1'),
                 time_var='date', time_var_format="%Y-%m",
                 freq='M')
df

	date	dtdate	permno	A	B
Mdate
2010-01	2010-01	2010-01-01	1.0	0.104762	0.923778
2010-02	2010-02	2010-02-01	1.0	0.561185	0.629761
2010-04	2010-04	2010-04-01	1.0	0.452953	0.137185

source

setup_panel

 setup_panel (df:pandas.core.frame.DataFrame, panel_ids:str=None,
              dates_processed:bool=False, time_var:str='date',
              time_var_format:str='%Y-%m-%d', dtdate_var:str='dtdate',
              freq:str=None, drop_missing_index_vals:bool=True,
              panel_ids_toint:str='Int64',
              drop_index_duplicates:bool=True,
              duplicates_which_keep:str='last')

Applies process_dates to df; cleans up (panel_ids ,f'{freq}date') and sets it as index.

	Type	Default	Details
df	DataFrame		Input DataFrame; a copy is returned
panel_ids	str	None	Name of variable that identifies panel entities
dates_processed	bool	False	If True, assumes dates are already processed with `process_dates`
time_var	str	date	This will be the date variable used to generate datetime var `dtdate_var`
time_var_format	str	%Y-%m-%d	Format of `time_var`; must be valid pandas `strftime`
dtdate_var	str	dtdate	Name of datetime var to be created from `time_var`
freq	str	None	Used to create `f'{freq}date'` period date; must be valid pandas offset string
drop_missing_index_vals	bool	True	What to do with missing `panel_ids` or `f'{freq}date'`
panel_ids_toint	str	Int64	Converts `panel_ids` to int in place; use falsy value if not wanted
drop_index_duplicates	bool	True	What to do with duplicates in (`panel_ids`, `f'{freq}date'`) values
duplicates_which_keep	str	last	If duplicates in index, which to keep; must be ‘first’, ‘last’ or `False`
Returns	DataFrame

df = setup_panel(raw,
                 panel_ids='permno',
                 time_var='date', time_var_format="%Y-%m",
                 freq='M')
df

		date	dtdate	A	B
permno	Mdate
1	2010-01	2010-01	2010-01-01	0.104762	0.923778
	2010-02	2010-02	2010-02-01	0.561185	0.629761
	2010-04	2010-04	2010-04-01	0.452953	0.137185
2	2010-01	2010-01	2010-01-01	0.481129	0.734026
	2010-02	2010-02	2010-02-01	0.627125	0.525601
	2010-04	2010-04	2010-04-01	0.413195	0.860653

Robust lagging

Lagging with shift fails when we have (1) panel data, (2) gaps in the time-series, (3) duplicate dates, (4) data is not sorted by dates (5) NaN dates.

The fast_lag function below correctly lags data (using shift()), assuming we do not have problems (2), (3), (4), and (5).

The lag function below correctly lags data (using merge()), assuming we do not have problem (5).

source

fast_lag

 fast_lag (df:pandas.core.series.Series|pandas.core.frame.DataFrame,
           n:int=1)

Lag data in df by n periods. ASSUMES DATA IS SORTED BY DATES AND HAS NO DUPLICATE OR NaN DATES, AND NO GAPS IN THE TIME SERIES. Apply df = setup_panel(df) before using.

	Type	Default	Details
df	pandas.core.series.Series \| pandas.core.frame.DataFrame		Index of `df` (or level 1 of MultiIndex) must be pandas period date.
n	int	1	Number of periods to lag based on frequency of df.index; Negative values means lead.
Returns	Series		Series with lagged values of `df`; Name is taken from `df.columns[0]`, with ’_lag{n}’ or ’_lead{n}’ suffixed.

source

lag

 lag (df:pandas.core.series.Series|pandas.core.frame.DataFrame, n:int=1,
      fast:bool=False)

Lag data in ‘df’ by ‘n’ periods. ASSUMES NO NaN DATES. Apply df = setup_panel(df) before using.

	Type	Default	Details
df	pandas.core.series.Series \| pandas.core.frame.DataFrame		Index (or level 1 of MultiIndex) must be period date with no missing values.
n	int	1	Number of periods to lag based on frequency of `df.index`; Negative values means lead.
fast	bool	False	If True, uses `fast_lag()`, which assumes data is sorted by date and has no duplicate or missing dates
Returns	Series		Series with lagged values of `df`; Name is taken from `df.columns[0]`, with ’_lag{n}’ or ’_lead{n}’ suffixed.

The index of the df parameter can not contain missing values.

lag(df['A'])

permno  Mdate  
1       2010-01         NaN
        2010-02    0.104762
        2010-04         NaN
2       2010-01         NaN
        2010-02    0.481129
        2010-04         NaN
Name: A_lag1, dtype: float64

lag(df['A'],fast=False)

permno  Mdate  
1       2010-01         NaN
        2010-02    0.104762
        2010-04         NaN
2       2010-01         NaN
        2010-02    0.481129
        2010-04         NaN
Name: A_lag1, dtype: float64

source

add_lags

 add_lags (df:pandas.core.series.Series|pandas.core.frame.DataFrame,
           vars:Union[str,List[str]], lags:Union[int,List[int]]=1,
           lag_suffix:str='_lag', lead_suffix:str='_lead',
           use_fast_lags:bool=False)

Returns a copy of df with all lags of all vars added to it.

	Type	Default	Details
df	pandas.core.series.Series \| pandas.core.frame.DataFrame		If pd.Series, it must have a name equal to `vars` param
vars	Union		Variables to be lagged; must be a subset of `df.columns()`
lags	Union	1	Which lags to be added
lag_suffix	str	_lag	Used to create new lagged variable names
lead_suffix	str	_lead	Used to create new lead variable names
use_fast_lags	bool	False	Weather to use `fast_lag()` function when lagging
Returns	DataFrame

Because this makes a copy of df, when df is a large dataset, this should be used as df = add_lags(df).

add_lags(df['A'], vars='A')

		A	A_lag1
permno	Mdate
1	2010-01	0.104762	NaN
	2010-02	0.561185	0.104762
	2010-04	0.452953	NaN
2	2010-01	0.481129	NaN
	2010-02	0.627125	0.481129
	2010-04	0.413195	NaN

add_lags(df, vars=['A','B'], lags=[2,-1])

		date	dtdate	A	B	A_lag2	A_lead1	B_lag2	B_lead1
permno	Mdate
1	2010-01	2010-01	2010-01-01	0.104762	0.923778	NaN	0.561185	NaN	0.629761
	2010-02	2010-02	2010-02-01	0.561185	0.629761	NaN	NaN	NaN	NaN
	2010-04	2010-04	2010-04-01	0.452953	0.137185	0.561185	NaN	0.629761	NaN
2	2010-01	2010-01	2010-01-01	0.481129	0.734026	NaN	0.627125	NaN	0.525601
	2010-02	2010-02	2010-02-01	0.627125	0.525601	NaN	NaN	NaN	NaN
	2010-04	2010-04	2010-04-01	0.413195	0.860653	0.627125	NaN	0.525601	NaN

add_lags(df,vars=['A','B'],lags=[2,-2], lag_suffix='_lg', lead_suffix='_ld')

		date	dtdate	A	B	A_lg2	A_ld2	B_lg2	B_ld2
permno	Mdate
1	2010-01	2010-01	2010-01-01	0.104762	0.923778	NaN	NaN	NaN	NaN
	2010-02	2010-02	2010-02-01	0.561185	0.629761	NaN	0.452953	NaN	0.137185
	2010-04	2010-04	2010-04-01	0.452953	0.137185	0.561185	NaN	0.629761	NaN
2	2010-01	2010-01	2010-01-01	0.481129	0.734026	NaN	NaN	NaN	NaN
	2010-02	2010-02	2010-02-01	0.627125	0.525601	NaN	0.413195	NaN	0.860653
	2010-04	2010-04	2010-04-01	0.413195	0.860653	0.627125	NaN	0.525601	NaN

And remember that by default, lag uses fast=True, which is not robust to duplicate dates (or unsorted dates).

Utilities using robust lagging

source

rpct_change

 rpct_change (df:pandas.core.series.Series, n:int=1, use_fast_lags=False)

Percentage change using robust lag() or fast_lag() function.

rpct_change(df['A'])

permno  Mdate  
1       2010-01         NaN
        2010-02    4.356736
        2010-04         NaN
2       2010-01         NaN
        2010-02    0.303446
        2010-04         NaN
dtype: float64

source

rdiff

 rdiff (df:pandas.core.series.Series, n:int=1, use_fast_lags=False)

Difference using robust lag() or fast_lag() function.

rdiff(df['A'])

permno  Mdate  
1       2010-01         NaN
        2010-02    0.456422
        2010-04         NaN
2       2010-01         NaN
        2010-02    0.145997
        2010-04         NaN
dtype: float64

source

rrolling

 rrolling (df:pandas.core.series.Series|pandas.core.frame.DataFrame,
           func:str, window:int=None, skipna:bool|None=False,
           use_fast_lags:bool=False)

Like pd.DataFrame.rolling but using robust lags. Run df = setup_tseries(df) or df = setup_panel(df) prior to using.

	Type	Default	Details
df	pandas.core.series.Series \| pandas.core.frame.DataFrame		Must have period date Index (if Series) or (panel_id, period_date) Multiindex (if DataFrame)
func	str		Name of any pandas aggregation function (to applied to `df` data within each rolling window
window	int	None	Rolling window length; if None, uses ‘expanding’ without fixing lags
skipna	bool \| None	False	Use None if `func` does not take `skipna` arg.
use_fast_lags	bool	False
Returns	Series

df.assign(rolling_A = rrolling(df['A'], func='mean', window=2, skipna=True))

		date	dtdate	A	B	rolling_A
permno	Mdate
1	2010-01	2010-01	2010-01-01	0.104762	0.923778	0.104762
	2010-02	2010-02	2010-02-01	0.561185	0.629761	0.332974
	2010-04	2010-04	2010-04-01	0.452953	0.137185	0.452953
2	2010-01	2010-01	2010-01-01	0.481129	0.734026	0.481129
	2010-02	2010-02	2010-02-01	0.627125	0.525601	0.554127
	2010-04	2010-04	2010-04-01	0.413195	0.860653	0.413195

df.assign(rolling_A = rrolling(df['A'], func='mean', window=2, skipna=False))

		date	dtdate	A	B	rolling_A
permno	Mdate
1	2010-01	2010-01	2010-01-01	0.104762	0.923778	NaN
	2010-02	2010-02	2010-02-01	0.561185	0.629761	0.332974
	2010-04	2010-04	2010-04-01	0.452953	0.137185	NaN
2	2010-01	2010-01	2010-01-01	0.481129	0.734026	NaN
	2010-02	2010-02	2010-02-01	0.627125	0.525601	0.554127
	2010-04	2010-04	2010-04-01	0.413195	0.860653	NaN

Test that it works for a time-series, not just a panel

df['A'].loc[1]

Mdate
2010-01    0.104762
2010-02    0.561185
2010-04    0.452953
Freq: M, Name: A, dtype: float64

rrolling(df['A'].loc[1], func='mean', window=2, skipna=True)

Mdate
2010-01    0.104762
2010-02    0.332974
2010-04    0.452953
Freq: M, dtype: float64

Very common data transformations

source

wins

 wins (df:pandas.core.series.Series|pandas.core.frame.DataFrame, low=0.01,
       high=0.99, byvars:List[str]=None)

Winsorizes all columns in df.

	Type	Default	Details
df	pandas.core.series.Series \| pandas.core.frame.DataFrame
low	float	0.01	Lower quantile at which to winsorize
high	float	0.99	Upper quantile at which to winsorize
byvars	List	None	If None, quantiles use full sample, o/w they are calculate within each group given by `byvars`
Returns	DataFrame

source

norm

 norm (df:pandas.core.series.Series|pandas.core.frame.DataFrame,
       divide_by_mean=False)

Subtract means from all columns of df and divide by their std. deviations, unless divide_by_mean is True

I/O

source

to_stata

 to_stata (df:pandas.core.frame.DataFrame=None, outfile:str=None,
           obj_drop:bool=False, obj_to_str:bool=False, **to_stata_kwargs)

Writes df to stata outfile

	Type	Default	Details
df	DataFrame	None
outfile	str	None	Output file path; must include .dta extension
obj_drop	bool	False	Whether to drop all columns of `object` type
obj_to_str	bool	False	Whether to convert all columns of `object` type to `string` type
to_stata_kwargs	VAR_KEYWORD

Index data is automatically included in the output, unless the index is the default range from 0 to len(df).

Columns of object data type are by default left to pd.to_stata to figure out how to convert. Note that columns of strings with missing data might cause an error in this default case. In this case, the best thing to do is for you convert it to string type before calling this function. If not, setting obj_to_str to True will deal with this internally but it will be slower.

For time data, columns of period type and columns of datetime64[ns] type with time zone information will be dropped. All other datetime64[ns] variables will be converted to td dates in Stata.

to_stata(df, outfile='../data/df.dta', version=117)

df = pd.read_stata('../data/df.dta')
df

	permno	date	dtdate	A	B
0	1	2010-01	2010-01-01	0.104762	0.923778
1	1	2010-02	2010-02-01	0.561185	0.629761
2	1	2010-04	2010-04-01	0.452953	0.137185
3	2	2010-01	2010-01-01	0.481129	0.734026
4	2	2010-02	2010-02-01	0.627125	0.525601
5	2	2010-04	2010-04-01	0.413195	0.860653

Sorting and binning

source

bins_using_masked_cutoffs

 bins_using_masked_cutoffs (df:pandas.core.frame.DataFrame=None,
                            sortvar:str=None, maskvar:str=None,
                            quantiles:list=None, outvar:str=None)

Returns column of bin numbers (1 to len(quantiles)) created by binning sortvar based on cuttoffs give by quantiles of df.loc[df[maskvar], sortvar]

	Type	Default	Details
df	DataFrame	None	Dataframe containing `sortvar` and `maskvar`. Must have panelvar x datevar multiindex
sortvar	str	None	Variable containing the values to be binned
maskvar	str	None	Mask to be applied to `df[sortvar]` before bin cutoffs are calculated
quantiles	list	None	List of quantiles to be applied to df.loc[df[maskvar], sortvar] to determine bin cutoffs
outvar	str	None	Name to give to the column of bins created. If none, will use f”{sortvar}_bins”
Returns	DataFrame