pandasmore
  1. core
  • pandasmore
  • core

On this page

  • Common panel setup procedures
    • order_columns
    • process_dates
    • setup_tseries
    • setup_panel
  • Robust lagging
    • fast_lag
    • lag
    • add_lags
  • Utilities using robust lagging
    • rpct_change
    • rdiff
    • rrolling
  • Very common data transformations
    • wins
    • norm
  • I/O
    • to_stata
  • Sorting and binning
    • bins_using_masked_cutoffs
  • Report an issue

core

Extends pandas with common functions used in finance and economics research

Almost all these functions make a copy of the input DataFrame. When that DataFrame is large, use these functions as df = func(df).

Exported source
from __future__ import annotations
from typing import List, Callable 
import os, glob 
import pandas as pd
import numpy as np

First, we set up an example dataset to showcase the functions in this module.

raw = pd.DataFrame(np.random.rand(15,2), 
                  columns=list('AB'), 
                  index=pd.MultiIndex.from_product(
                      [[1,2, np.nan],[np.nan,'2010-01','2010-02','2010-02','2010-04']],
                      names = ['permno','date'])
                    ).reset_index()
raw
permno date A B
0 1.0 NaN 0.234344 0.698915
1 1.0 2010-01 0.104762 0.923778
2 1.0 2010-02 0.134963 0.287128
3 1.0 2010-02 0.561185 0.629761
4 1.0 2010-04 0.452953 0.137185
5 2.0 NaN 0.242651 0.010276
6 2.0 2010-01 0.481129 0.734026
7 2.0 2010-02 0.663068 0.887246
8 2.0 2010-02 0.627125 0.525601
9 2.0 2010-04 0.413195 0.860653
10 NaN NaN 0.858186 0.907775
11 NaN 2010-01 0.660384 0.601805
12 NaN 2010-02 0.037070 0.510598
13 NaN 2010-02 0.579008 0.682778
14 NaN 2010-04 0.697347 0.750504

Common panel setup procedures


source

order_columns

 order_columns (df:pandas.core.frame.DataFrame, these_first:List[str])

Returns df with reordered columns. Use as df = order_columns(df,_)

order_columns(raw, these_first=['B']).head()
B permno date A
0 0.698915 1.0 NaN 0.234344
1 0.923778 1.0 2010-01 0.104762
2 0.287128 1.0 2010-02 0.134963
3 0.629761 1.0 2010-02 0.561185
4 0.137185 1.0 2010-04 0.452953

source

process_dates

 process_dates (df:pandas.core.frame.DataFrame, time_var:str='date',
                time_var_format:str='%Y-%m-%d', dtdate_var:str='dtdate',
                freq:str=None)

Makes datetime date dtdate_var from time_var; adds period date f'{freq}date'.

Type Default Details
df DataFrame Function returns copy of this df with dtdate_var and f'{freq}date' cols added
time_var str date This will be the date variable used to generate datetime var dtdate_var
time_var_format str %Y-%m-%d Format of time_var; must be valid pandas strftime
dtdate_var str dtdate Name of datetime var to be created from time_var
freq str None Used to create f'{freq}date' period date; must be valid pandas offset string
Returns DataFrame
newdf = process_dates(raw, time_var_format="%Y-%m", freq='M')
newdf.head()
date dtdate Mdate permno A B
0 NaN NaT NaT 1.0 0.234344 0.698915
1 2010-01 2010-01-01 2010-01 1.0 0.104762 0.923778
2 2010-02 2010-02-01 2010-02 1.0 0.134963 0.287128
3 2010-02 2010-02-01 2010-02 1.0 0.561185 0.629761
4 2010-04 2010-04-01 2010-04 1.0 0.452953 0.137185

source

setup_tseries

 setup_tseries (df:pandas.core.series.Series|pandas.core.frame.DataFrame,
                dates_processed:bool=False, time_var:str='date',
                time_var_format:str='%Y-%m-%d', dtdate_var:str='dtdate',
                freq:str=None, drop_missing_index_vals:bool=True,
                drop_index_duplicates:bool=True,
                duplicates_which_keep:str='last')

Applies process_dates to df; cleans up resulting f'{freq}date' period date and sets it as index.

Type Default Details
df pandas.core.series.Series | pandas.core.frame.DataFrame Input DataFrame; a copy is returned
dates_processed bool False If True, assumes dates are already processed with process_dates
time_var str date This will be the date variable used to generate datetime var dtdate_var
time_var_format str %Y-%m-%d Format of time_var; must be valid pandas strftime
dtdate_var str dtdate Name of datetime var to be created from time_var
freq str None Used to create f'{freq}date' period date; must be valid pandas offset string
drop_missing_index_vals bool True What to do with missing f'{freq}date'
drop_index_duplicates bool True What to do with duplicates in f'{freq}date' values
duplicates_which_keep str last If duplicates in index, which to keep; must be ‘first’, ‘last’ or False
Returns DataFrame
raw.query('permno==1')
permno date A B
0 1.0 NaN 0.234344 0.698915
1 1.0 2010-01 0.104762 0.923778
2 1.0 2010-02 0.134963 0.287128
3 1.0 2010-02 0.561185 0.629761
4 1.0 2010-04 0.452953 0.137185
df = setup_tseries(raw.query('permno==1'),
                 time_var='date', time_var_format="%Y-%m",
                 freq='M')
df
date dtdate permno A B
Mdate
2010-01 2010-01 2010-01-01 1.0 0.104762 0.923778
2010-02 2010-02 2010-02-01 1.0 0.561185 0.629761
2010-04 2010-04 2010-04-01 1.0 0.452953 0.137185

source

setup_panel

 setup_panel (df:pandas.core.frame.DataFrame, panel_ids:str=None,
              dates_processed:bool=False, time_var:str='date',
              time_var_format:str='%Y-%m-%d', dtdate_var:str='dtdate',
              freq:str=None, drop_missing_index_vals:bool=True,
              panel_ids_toint:str='Int64',
              drop_index_duplicates:bool=True,
              duplicates_which_keep:str='last')

Applies process_dates to df; cleans up (panel_ids ,f'{freq}date') and sets it as index.

Type Default Details
df DataFrame Input DataFrame; a copy is returned
panel_ids str None Name of variable that identifies panel entities
dates_processed bool False If True, assumes dates are already processed with process_dates
time_var str date This will be the date variable used to generate datetime var dtdate_var
time_var_format str %Y-%m-%d Format of time_var; must be valid pandas strftime
dtdate_var str dtdate Name of datetime var to be created from time_var
freq str None Used to create f'{freq}date' period date; must be valid pandas offset string
drop_missing_index_vals bool True What to do with missing panel_ids or f'{freq}date'
panel_ids_toint str Int64 Converts panel_ids to int in place; use falsy value if not wanted
drop_index_duplicates bool True What to do with duplicates in (panel_ids, f'{freq}date') values
duplicates_which_keep str last If duplicates in index, which to keep; must be ‘first’, ‘last’ or False
Returns DataFrame
df = setup_panel(raw,
                 panel_ids='permno',
                 time_var='date', time_var_format="%Y-%m",
                 freq='M')
df
date dtdate A B
permno Mdate
1 2010-01 2010-01 2010-01-01 0.104762 0.923778
2010-02 2010-02 2010-02-01 0.561185 0.629761
2010-04 2010-04 2010-04-01 0.452953 0.137185
2 2010-01 2010-01 2010-01-01 0.481129 0.734026
2010-02 2010-02 2010-02-01 0.627125 0.525601
2010-04 2010-04 2010-04-01 0.413195 0.860653

Robust lagging

Lagging with shift fails when we have (1) panel data, (2) gaps in the time-series, (3) duplicate dates, (4) data is not sorted by dates (5) NaN dates.

The fast_lag function below correctly lags data (using shift()), assuming we do not have problems (2), (3), (4), and (5).

The lag function below correctly lags data (using merge()), assuming we do not have problem (5).


source

fast_lag

 fast_lag (df:pandas.core.series.Series|pandas.core.frame.DataFrame,
           n:int=1)

Lag data in df by n periods. ASSUMES DATA IS SORTED BY DATES AND HAS NO DUPLICATE OR NaN DATES, AND NO GAPS IN THE TIME SERIES. Apply df = setup_panel(df) before using.

Type Default Details
df pandas.core.series.Series | pandas.core.frame.DataFrame Index of df (or level 1 of MultiIndex) must be pandas period date.
n int 1 Number of periods to lag based on frequency of df.index; Negative values means lead.
Returns Series **Series with lagged values of df; Name is taken from df.columns[0], with ’_lag{n}’ or ’_lead{n}’ suffixed.**

source

lag

 lag (df:pandas.core.series.Series|pandas.core.frame.DataFrame, n:int=1,
      fast:bool=False)

Lag data in ‘df’ by ‘n’ periods. ASSUMES NO NaN DATES. Apply df = setup_panel(df) before using.

Type Default Details
df pandas.core.series.Series | pandas.core.frame.DataFrame Index (or level 1 of MultiIndex) must be period date with no missing values.
n int 1 Number of periods to lag based on frequency of df.index; Negative values means lead.
fast bool False If True, uses fast_lag(), which assumes data is sorted by date and has no duplicate or missing dates
Returns Series **Series with lagged values of df; Name is taken from df.columns[0], with ’_lag{n}’ or ’_lead{n}’ suffixed.**

The index of the df parameter can not contain missing values.

lag(df['A'])
permno  Mdate  
1       2010-01         NaN
        2010-02    0.104762
        2010-04         NaN
2       2010-01         NaN
        2010-02    0.481129
        2010-04         NaN
Name: A_lag1, dtype: float64
lag(df['A'],fast=False)
permno  Mdate  
1       2010-01         NaN
        2010-02    0.104762
        2010-04         NaN
2       2010-01         NaN
        2010-02    0.481129
        2010-04         NaN
Name: A_lag1, dtype: float64

source

add_lags

 add_lags (df:pandas.core.series.Series|pandas.core.frame.DataFrame,
           vars:Union[str,List[str]], lags:Union[int,List[int]]=1,
           lag_suffix:str='_lag', lead_suffix:str='_lead',
           use_fast_lags:bool=False)

Returns a copy of df with all lags of all vars added to it.

Type Default Details
df pandas.core.series.Series | pandas.core.frame.DataFrame If pd.Series, it must have a name equal to vars param
vars Union Variables to be lagged; must be a subset of df.columns()
lags Union 1 Which lags to be added
lag_suffix str _lag Used to create new lagged variable names
lead_suffix str _lead Used to create new lead variable names
use_fast_lags bool False Weather to use fast_lag() function when lagging
Returns DataFrame

Because this makes a copy of df, when df is a large dataset, this should be used as df = add_lags(df).

add_lags(df['A'], vars='A')
A A_lag1
permno Mdate
1 2010-01 0.104762 NaN
2010-02 0.561185 0.104762
2010-04 0.452953 NaN
2 2010-01 0.481129 NaN
2010-02 0.627125 0.481129
2010-04 0.413195 NaN
add_lags(df, vars=['A','B'], lags=[2,-1])
date dtdate A B A_lag2 A_lead1 B_lag2 B_lead1
permno Mdate
1 2010-01 2010-01 2010-01-01 0.104762 0.923778 NaN 0.561185 NaN 0.629761
2010-02 2010-02 2010-02-01 0.561185 0.629761 NaN NaN NaN NaN
2010-04 2010-04 2010-04-01 0.452953 0.137185 0.561185 NaN 0.629761 NaN
2 2010-01 2010-01 2010-01-01 0.481129 0.734026 NaN 0.627125 NaN 0.525601
2010-02 2010-02 2010-02-01 0.627125 0.525601 NaN NaN NaN NaN
2010-04 2010-04 2010-04-01 0.413195 0.860653 0.627125 NaN 0.525601 NaN
add_lags(df,vars=['A','B'],lags=[2,-2], lag_suffix='_lg', lead_suffix='_ld')
date dtdate A B A_lg2 A_ld2 B_lg2 B_ld2
permno Mdate
1 2010-01 2010-01 2010-01-01 0.104762 0.923778 NaN NaN NaN NaN
2010-02 2010-02 2010-02-01 0.561185 0.629761 NaN 0.452953 NaN 0.137185
2010-04 2010-04 2010-04-01 0.452953 0.137185 0.561185 NaN 0.629761 NaN
2 2010-01 2010-01 2010-01-01 0.481129 0.734026 NaN NaN NaN NaN
2010-02 2010-02 2010-02-01 0.627125 0.525601 NaN 0.413195 NaN 0.860653
2010-04 2010-04 2010-04-01 0.413195 0.860653 0.627125 NaN 0.525601 NaN

And remember that by default, lag uses fast=True, which is not robust to duplicate dates (or unsorted dates).

Utilities using robust lagging


source

rpct_change

 rpct_change (df:pandas.core.series.Series, n:int=1, use_fast_lags=False)

Percentage change using robust lag() or fast_lag() function.

rpct_change(df['A'])
permno  Mdate  
1       2010-01         NaN
        2010-02    4.356736
        2010-04         NaN
2       2010-01         NaN
        2010-02    0.303446
        2010-04         NaN
dtype: float64

source

rdiff

 rdiff (df:pandas.core.series.Series, n:int=1, use_fast_lags=False)

Difference using robust lag() or fast_lag() function.

rdiff(df['A'])
permno  Mdate  
1       2010-01         NaN
        2010-02    0.456422
        2010-04         NaN
2       2010-01         NaN
        2010-02    0.145997
        2010-04         NaN
dtype: float64

source

rrolling

 rrolling (df:pandas.core.series.Series|pandas.core.frame.DataFrame,
           func:str, window:int=None, skipna:bool|None=False,
           use_fast_lags:bool=False)

Like pd.DataFrame.rolling but using robust lags. Run df = setup_tseries(df) or df = setup_panel(df) prior to using.

Type Default Details
df pandas.core.series.Series | pandas.core.frame.DataFrame Must have period date Index (if Series) or (panel_id, period_date) Multiindex (if DataFrame)
func str Name of any pandas aggregation function (to applied to df data within each rolling window
window int None Rolling window length; if None, uses ‘expanding’ without fixing lags
skipna bool | None False Use None if func does not take skipna arg.
use_fast_lags bool False
Returns Series
df.assign(rolling_A = rrolling(df['A'], func='mean', window=2, skipna=True))
date dtdate A B rolling_A
permno Mdate
1 2010-01 2010-01 2010-01-01 0.104762 0.923778 0.104762
2010-02 2010-02 2010-02-01 0.561185 0.629761 0.332974
2010-04 2010-04 2010-04-01 0.452953 0.137185 0.452953
2 2010-01 2010-01 2010-01-01 0.481129 0.734026 0.481129
2010-02 2010-02 2010-02-01 0.627125 0.525601 0.554127
2010-04 2010-04 2010-04-01 0.413195 0.860653 0.413195
df.assign(rolling_A = rrolling(df['A'], func='mean', window=2, skipna=False))
date dtdate A B rolling_A
permno Mdate
1 2010-01 2010-01 2010-01-01 0.104762 0.923778 NaN
2010-02 2010-02 2010-02-01 0.561185 0.629761 0.332974
2010-04 2010-04 2010-04-01 0.452953 0.137185 NaN
2 2010-01 2010-01 2010-01-01 0.481129 0.734026 NaN
2010-02 2010-02 2010-02-01 0.627125 0.525601 0.554127
2010-04 2010-04 2010-04-01 0.413195 0.860653 NaN

Test that it works for a time-series, not just a panel

df['A'].loc[1]
Mdate
2010-01    0.104762
2010-02    0.561185
2010-04    0.452953
Freq: M, Name: A, dtype: float64
rrolling(df['A'].loc[1], func='mean', window=2, skipna=True)
Mdate
2010-01    0.104762
2010-02    0.332974
2010-04    0.452953
Freq: M, dtype: float64

Very common data transformations


source

wins

 wins (df:pandas.core.series.Series|pandas.core.frame.DataFrame, low=0.01,
       high=0.99, byvars:List[str]=None)

Winsorizes all columns in df.

Type Default Details
df pandas.core.series.Series | pandas.core.frame.DataFrame
low float 0.01 Lower quantile at which to winsorize
high float 0.99 Upper quantile at which to winsorize
byvars List None If None, quantiles use full sample, o/w they are calculate within each group given by byvars
Returns DataFrame

source

norm

 norm (df:pandas.core.series.Series|pandas.core.frame.DataFrame,
       divide_by_mean=False)

Subtract means from all columns of df and divide by their std. deviations, unless divide_by_mean is True

I/O


source

to_stata

 to_stata (df:pandas.core.frame.DataFrame=None, outfile:str=None,
           obj_drop:bool=False, obj_to_str:bool=False, **to_stata_kwargs)

Writes df to stata outfile

Type Default Details
df DataFrame None
outfile str None Output file path; must include .dta extension
obj_drop bool False Whether to drop all columns of object type
obj_to_str bool False Whether to convert all columns of object type to string type
to_stata_kwargs VAR_KEYWORD

Index data is automatically included in the output, unless the index is the default range from 0 to len(df).

Columns of object data type are by default left to pd.to_stata to figure out how to convert. Note that columns of strings with missing data might cause an error in this default case. In this case, the best thing to do is for you convert it to string type before calling this function. If not, setting obj_to_str to True will deal with this internally but it will be slower.

For time data, columns of period type and columns of datetime64[ns] type with time zone information will be dropped. All other datetime64[ns] variables will be converted to td dates in Stata.

to_stata(df, outfile='../data/df.dta', version=117)
df = pd.read_stata('../data/df.dta')
df
permno date dtdate A B
0 1 2010-01 2010-01-01 0.104762 0.923778
1 1 2010-02 2010-02-01 0.561185 0.629761
2 1 2010-04 2010-04-01 0.452953 0.137185
3 2 2010-01 2010-01-01 0.481129 0.734026
4 2 2010-02 2010-02-01 0.627125 0.525601
5 2 2010-04 2010-04-01 0.413195 0.860653

Sorting and binning


source

bins_using_masked_cutoffs

 bins_using_masked_cutoffs (df:pandas.core.frame.DataFrame=None,
                            sortvar:str=None, maskvar:str=None,
                            quantiles:list=None, outvar:str=None)

Returns column of bin numbers (1 to len(quantiles)) created by binning sortvar based on cuttoffs give by quantiles of df.loc[df[maskvar], sortvar]

Type Default Details
df DataFrame None Dataframe containing sortvar and maskvar. Must have panelvar x datevar multiindex
sortvar str None Variable containing the values to be binned
maskvar str None Mask to be applied to df[sortvar] before bin cutoffs are calculated
quantiles list None List of quantiles to be applied to df.loc[df[maskvar], sortvar] to determine bin cutoffs
outvar str None Name to give to the column of bins created. If none, will use f”{sortvar}_bins”
Returns DataFrame
  • Report an issue