Exported source
from __future__ import annotations
from typing import List, Callable
import os, glob
import pandas as pd
import numpy as npAlmost all these functions make a copy of the input DataFrame. When that DataFrame is large, use these functions as df = func(df).
First, we set up an example dataset to showcase the functions in this module.
| permno | date | A | B | |
|---|---|---|---|---|
| 0 | 1.0 | NaN | 0.234344 | 0.698915 |
| 1 | 1.0 | 2010-01 | 0.104762 | 0.923778 |
| 2 | 1.0 | 2010-02 | 0.134963 | 0.287128 |
| 3 | 1.0 | 2010-02 | 0.561185 | 0.629761 |
| 4 | 1.0 | 2010-04 | 0.452953 | 0.137185 |
| 5 | 2.0 | NaN | 0.242651 | 0.010276 |
| 6 | 2.0 | 2010-01 | 0.481129 | 0.734026 |
| 7 | 2.0 | 2010-02 | 0.663068 | 0.887246 |
| 8 | 2.0 | 2010-02 | 0.627125 | 0.525601 |
| 9 | 2.0 | 2010-04 | 0.413195 | 0.860653 |
| 10 | NaN | NaN | 0.858186 | 0.907775 |
| 11 | NaN | 2010-01 | 0.660384 | 0.601805 |
| 12 | NaN | 2010-02 | 0.037070 | 0.510598 |
| 13 | NaN | 2010-02 | 0.579008 | 0.682778 |
| 14 | NaN | 2010-04 | 0.697347 | 0.750504 |
order_columns (df:pandas.core.frame.DataFrame, these_first:List[str])
Returns df with reordered columns. Use as df = order_columns(df,_)
| B | permno | date | A | |
|---|---|---|---|---|
| 0 | 0.698915 | 1.0 | NaN | 0.234344 |
| 1 | 0.923778 | 1.0 | 2010-01 | 0.104762 |
| 2 | 0.287128 | 1.0 | 2010-02 | 0.134963 |
| 3 | 0.629761 | 1.0 | 2010-02 | 0.561185 |
| 4 | 0.137185 | 1.0 | 2010-04 | 0.452953 |
process_dates (df:pandas.core.frame.DataFrame, time_var:str='date', time_var_format:str='%Y-%m-%d', dtdate_var:str='dtdate', freq:str=None)
Makes datetime date dtdate_var from time_var; adds period date f'{freq}date'.
| Type | Default | Details | |
|---|---|---|---|
| df | DataFrame | Function returns copy of this df with dtdate_var and f'{freq}date' cols added |
|
| time_var | str | date | This will be the date variable used to generate datetime var dtdate_var |
| time_var_format | str | %Y-%m-%d | Format of time_var; must be valid pandas strftime |
| dtdate_var | str | dtdate | Name of datetime var to be created from time_var |
| freq | str | None | Used to create f'{freq}date' period date; must be valid pandas offset string |
| Returns | DataFrame |
| date | dtdate | Mdate | permno | A | B | |
|---|---|---|---|---|---|---|
| 0 | NaN | NaT | NaT | 1.0 | 0.234344 | 0.698915 |
| 1 | 2010-01 | 2010-01-01 | 2010-01 | 1.0 | 0.104762 | 0.923778 |
| 2 | 2010-02 | 2010-02-01 | 2010-02 | 1.0 | 0.134963 | 0.287128 |
| 3 | 2010-02 | 2010-02-01 | 2010-02 | 1.0 | 0.561185 | 0.629761 |
| 4 | 2010-04 | 2010-04-01 | 2010-04 | 1.0 | 0.452953 | 0.137185 |
setup_tseries (df:pandas.core.series.Series|pandas.core.frame.DataFrame, dates_processed:bool=False, time_var:str='date', time_var_format:str='%Y-%m-%d', dtdate_var:str='dtdate', freq:str=None, drop_missing_index_vals:bool=True, drop_index_duplicates:bool=True, duplicates_which_keep:str='last')
Applies process_dates to df; cleans up resulting f'{freq}date' period date and sets it as index.
| Type | Default | Details | |
|---|---|---|---|
| df | pandas.core.series.Series | pandas.core.frame.DataFrame | Input DataFrame; a copy is returned | |
| dates_processed | bool | False | If True, assumes dates are already processed with process_dates |
| time_var | str | date | This will be the date variable used to generate datetime var dtdate_var |
| time_var_format | str | %Y-%m-%d | Format of time_var; must be valid pandas strftime |
| dtdate_var | str | dtdate | Name of datetime var to be created from time_var |
| freq | str | None | Used to create f'{freq}date' period date; must be valid pandas offset string |
| drop_missing_index_vals | bool | True | What to do with missing f'{freq}date' |
| drop_index_duplicates | bool | True | What to do with duplicates in f'{freq}date' values |
| duplicates_which_keep | str | last | If duplicates in index, which to keep; must be ‘first’, ‘last’ or False |
| Returns | DataFrame |
| permno | date | A | B | |
|---|---|---|---|---|
| 0 | 1.0 | NaN | 0.234344 | 0.698915 |
| 1 | 1.0 | 2010-01 | 0.104762 | 0.923778 |
| 2 | 1.0 | 2010-02 | 0.134963 | 0.287128 |
| 3 | 1.0 | 2010-02 | 0.561185 | 0.629761 |
| 4 | 1.0 | 2010-04 | 0.452953 | 0.137185 |
| date | dtdate | permno | A | B | |
|---|---|---|---|---|---|
| Mdate | |||||
| 2010-01 | 2010-01 | 2010-01-01 | 1.0 | 0.104762 | 0.923778 |
| 2010-02 | 2010-02 | 2010-02-01 | 1.0 | 0.561185 | 0.629761 |
| 2010-04 | 2010-04 | 2010-04-01 | 1.0 | 0.452953 | 0.137185 |
setup_panel (df:pandas.core.frame.DataFrame, panel_ids:str=None, dates_processed:bool=False, time_var:str='date', time_var_format:str='%Y-%m-%d', dtdate_var:str='dtdate', freq:str=None, drop_missing_index_vals:bool=True, panel_ids_toint:str='Int64', drop_index_duplicates:bool=True, duplicates_which_keep:str='last')
Applies process_dates to df; cleans up (panel_ids ,f'{freq}date') and sets it as index.
| Type | Default | Details | |
|---|---|---|---|
| df | DataFrame | Input DataFrame; a copy is returned | |
| panel_ids | str | None | Name of variable that identifies panel entities |
| dates_processed | bool | False | If True, assumes dates are already processed with process_dates |
| time_var | str | date | This will be the date variable used to generate datetime var dtdate_var |
| time_var_format | str | %Y-%m-%d | Format of time_var; must be valid pandas strftime |
| dtdate_var | str | dtdate | Name of datetime var to be created from time_var |
| freq | str | None | Used to create f'{freq}date' period date; must be valid pandas offset string |
| drop_missing_index_vals | bool | True | What to do with missing panel_ids or f'{freq}date' |
| panel_ids_toint | str | Int64 | Converts panel_ids to int in place; use falsy value if not wanted |
| drop_index_duplicates | bool | True | What to do with duplicates in (panel_ids, f'{freq}date') values |
| duplicates_which_keep | str | last | If duplicates in index, which to keep; must be ‘first’, ‘last’ or False |
| Returns | DataFrame |
| date | dtdate | A | B | ||
|---|---|---|---|---|---|
| permno | Mdate | ||||
| 1 | 2010-01 | 2010-01 | 2010-01-01 | 0.104762 | 0.923778 |
| 2010-02 | 2010-02 | 2010-02-01 | 0.561185 | 0.629761 | |
| 2010-04 | 2010-04 | 2010-04-01 | 0.452953 | 0.137185 | |
| 2 | 2010-01 | 2010-01 | 2010-01-01 | 0.481129 | 0.734026 |
| 2010-02 | 2010-02 | 2010-02-01 | 0.627125 | 0.525601 | |
| 2010-04 | 2010-04 | 2010-04-01 | 0.413195 | 0.860653 |
Lagging with shift fails when we have (1) panel data, (2) gaps in the time-series, (3) duplicate dates, (4) data is not sorted by dates (5) NaN dates.
The fast_lag function below correctly lags data (using shift()), assuming we do not have problems (2), (3), (4), and (5).
The lag function below correctly lags data (using merge()), assuming we do not have problem (5).
fast_lag (df:pandas.core.series.Series|pandas.core.frame.DataFrame, n:int=1)
Lag data in df by n periods. ASSUMES DATA IS SORTED BY DATES AND HAS NO DUPLICATE OR NaN DATES, AND NO GAPS IN THE TIME SERIES. Apply df = setup_panel(df) before using.
| Type | Default | Details | |
|---|---|---|---|
| df | pandas.core.series.Series | pandas.core.frame.DataFrame | Index of df (or level 1 of MultiIndex) must be pandas period date. |
|
| n | int | 1 | Number of periods to lag based on frequency of df.index; Negative values means lead. |
| Returns | Series | **Series with lagged values of df; Name is taken from df.columns[0], with ’_lag{n}’ or ’_lead{n}’ suffixed.** |
lag (df:pandas.core.series.Series|pandas.core.frame.DataFrame, n:int=1, fast:bool=False)
Lag data in ‘df’ by ‘n’ periods. ASSUMES NO NaN DATES. Apply df = setup_panel(df) before using.
| Type | Default | Details | |
|---|---|---|---|
| df | pandas.core.series.Series | pandas.core.frame.DataFrame | Index (or level 1 of MultiIndex) must be period date with no missing values. | |
| n | int | 1 | Number of periods to lag based on frequency of df.index; Negative values means lead. |
| fast | bool | False | If True, uses fast_lag(), which assumes data is sorted by date and has no duplicate or missing dates |
| Returns | Series | **Series with lagged values of df; Name is taken from df.columns[0], with ’_lag{n}’ or ’_lead{n}’ suffixed.** |
The index of the df parameter can not contain missing values.
permno Mdate
1 2010-01 NaN
2010-02 0.104762
2010-04 NaN
2 2010-01 NaN
2010-02 0.481129
2010-04 NaN
Name: A_lag1, dtype: float64
permno Mdate
1 2010-01 NaN
2010-02 0.104762
2010-04 NaN
2 2010-01 NaN
2010-02 0.481129
2010-04 NaN
Name: A_lag1, dtype: float64
add_lags (df:pandas.core.series.Series|pandas.core.frame.DataFrame, vars:Union[str,List[str]], lags:Union[int,List[int]]=1, lag_suffix:str='_lag', lead_suffix:str='_lead', use_fast_lags:bool=False)
Returns a copy of df with all lags of all vars added to it.
| Type | Default | Details | |
|---|---|---|---|
| df | pandas.core.series.Series | pandas.core.frame.DataFrame | If pd.Series, it must have a name equal to vars param |
|
| vars | Union | Variables to be lagged; must be a subset of df.columns() |
|
| lags | Union | 1 | Which lags to be added |
| lag_suffix | str | _lag | Used to create new lagged variable names |
| lead_suffix | str | _lead | Used to create new lead variable names |
| use_fast_lags | bool | False | Weather to use fast_lag() function when lagging |
| Returns | DataFrame |
Because this makes a copy of df, when df is a large dataset, this should be used as df = add_lags(df).
| A | A_lag1 | ||
|---|---|---|---|
| permno | Mdate | ||
| 1 | 2010-01 | 0.104762 | NaN |
| 2010-02 | 0.561185 | 0.104762 | |
| 2010-04 | 0.452953 | NaN | |
| 2 | 2010-01 | 0.481129 | NaN |
| 2010-02 | 0.627125 | 0.481129 | |
| 2010-04 | 0.413195 | NaN |
| date | dtdate | A | B | A_lag2 | A_lead1 | B_lag2 | B_lead1 | ||
|---|---|---|---|---|---|---|---|---|---|
| permno | Mdate | ||||||||
| 1 | 2010-01 | 2010-01 | 2010-01-01 | 0.104762 | 0.923778 | NaN | 0.561185 | NaN | 0.629761 |
| 2010-02 | 2010-02 | 2010-02-01 | 0.561185 | 0.629761 | NaN | NaN | NaN | NaN | |
| 2010-04 | 2010-04 | 2010-04-01 | 0.452953 | 0.137185 | 0.561185 | NaN | 0.629761 | NaN | |
| 2 | 2010-01 | 2010-01 | 2010-01-01 | 0.481129 | 0.734026 | NaN | 0.627125 | NaN | 0.525601 |
| 2010-02 | 2010-02 | 2010-02-01 | 0.627125 | 0.525601 | NaN | NaN | NaN | NaN | |
| 2010-04 | 2010-04 | 2010-04-01 | 0.413195 | 0.860653 | 0.627125 | NaN | 0.525601 | NaN |
| date | dtdate | A | B | A_lg2 | A_ld2 | B_lg2 | B_ld2 | ||
|---|---|---|---|---|---|---|---|---|---|
| permno | Mdate | ||||||||
| 1 | 2010-01 | 2010-01 | 2010-01-01 | 0.104762 | 0.923778 | NaN | NaN | NaN | NaN |
| 2010-02 | 2010-02 | 2010-02-01 | 0.561185 | 0.629761 | NaN | 0.452953 | NaN | 0.137185 | |
| 2010-04 | 2010-04 | 2010-04-01 | 0.452953 | 0.137185 | 0.561185 | NaN | 0.629761 | NaN | |
| 2 | 2010-01 | 2010-01 | 2010-01-01 | 0.481129 | 0.734026 | NaN | NaN | NaN | NaN |
| 2010-02 | 2010-02 | 2010-02-01 | 0.627125 | 0.525601 | NaN | 0.413195 | NaN | 0.860653 | |
| 2010-04 | 2010-04 | 2010-04-01 | 0.413195 | 0.860653 | 0.627125 | NaN | 0.525601 | NaN |
And remember that by default, lag uses fast=True, which is not robust to duplicate dates (or unsorted dates).
rpct_change (df:pandas.core.series.Series, n:int=1, use_fast_lags=False)
Percentage change using robust lag() or fast_lag() function.
permno Mdate
1 2010-01 NaN
2010-02 4.356736
2010-04 NaN
2 2010-01 NaN
2010-02 0.303446
2010-04 NaN
dtype: float64
rdiff (df:pandas.core.series.Series, n:int=1, use_fast_lags=False)
Difference using robust lag() or fast_lag() function.
permno Mdate
1 2010-01 NaN
2010-02 0.456422
2010-04 NaN
2 2010-01 NaN
2010-02 0.145997
2010-04 NaN
dtype: float64
rrolling (df:pandas.core.series.Series|pandas.core.frame.DataFrame, func:str, window:int=None, skipna:bool|None=False, use_fast_lags:bool=False)
Like pd.DataFrame.rolling but using robust lags. Run df = setup_tseries(df) or df = setup_panel(df) prior to using.
| Type | Default | Details | |
|---|---|---|---|
| df | pandas.core.series.Series | pandas.core.frame.DataFrame | Must have period date Index (if Series) or (panel_id, period_date) Multiindex (if DataFrame) | |
| func | str | Name of any pandas aggregation function (to applied to df data within each rolling window |
|
| window | int | None | Rolling window length; if None, uses ‘expanding’ without fixing lags |
| skipna | bool | None | False | Use None if func does not take skipna arg. |
| use_fast_lags | bool | False | |
| Returns | Series |
| date | dtdate | A | B | rolling_A | ||
|---|---|---|---|---|---|---|
| permno | Mdate | |||||
| 1 | 2010-01 | 2010-01 | 2010-01-01 | 0.104762 | 0.923778 | 0.104762 |
| 2010-02 | 2010-02 | 2010-02-01 | 0.561185 | 0.629761 | 0.332974 | |
| 2010-04 | 2010-04 | 2010-04-01 | 0.452953 | 0.137185 | 0.452953 | |
| 2 | 2010-01 | 2010-01 | 2010-01-01 | 0.481129 | 0.734026 | 0.481129 |
| 2010-02 | 2010-02 | 2010-02-01 | 0.627125 | 0.525601 | 0.554127 | |
| 2010-04 | 2010-04 | 2010-04-01 | 0.413195 | 0.860653 | 0.413195 |
| date | dtdate | A | B | rolling_A | ||
|---|---|---|---|---|---|---|
| permno | Mdate | |||||
| 1 | 2010-01 | 2010-01 | 2010-01-01 | 0.104762 | 0.923778 | NaN |
| 2010-02 | 2010-02 | 2010-02-01 | 0.561185 | 0.629761 | 0.332974 | |
| 2010-04 | 2010-04 | 2010-04-01 | 0.452953 | 0.137185 | NaN | |
| 2 | 2010-01 | 2010-01 | 2010-01-01 | 0.481129 | 0.734026 | NaN |
| 2010-02 | 2010-02 | 2010-02-01 | 0.627125 | 0.525601 | 0.554127 | |
| 2010-04 | 2010-04 | 2010-04-01 | 0.413195 | 0.860653 | NaN |
Test that it works for a time-series, not just a panel
Mdate
2010-01 0.104762
2010-02 0.561185
2010-04 0.452953
Freq: M, Name: A, dtype: float64
wins (df:pandas.core.series.Series|pandas.core.frame.DataFrame, low=0.01, high=0.99, byvars:List[str]=None)
Winsorizes all columns in df.
| Type | Default | Details | |
|---|---|---|---|
| df | pandas.core.series.Series | pandas.core.frame.DataFrame | ||
| low | float | 0.01 | Lower quantile at which to winsorize |
| high | float | 0.99 | Upper quantile at which to winsorize |
| byvars | List | None | If None, quantiles use full sample, o/w they are calculate within each group given by byvars |
| Returns | DataFrame |
norm (df:pandas.core.series.Series|pandas.core.frame.DataFrame, divide_by_mean=False)
Subtract means from all columns of df and divide by their std. deviations, unless divide_by_mean is True
to_stata (df:pandas.core.frame.DataFrame=None, outfile:str=None, obj_drop:bool=False, obj_to_str:bool=False, **to_stata_kwargs)
Writes df to stata outfile
| Type | Default | Details | |
|---|---|---|---|
| df | DataFrame | None | |
| outfile | str | None | Output file path; must include .dta extension |
| obj_drop | bool | False | Whether to drop all columns of object type |
| obj_to_str | bool | False | Whether to convert all columns of object type to string type |
| to_stata_kwargs | VAR_KEYWORD |
Index data is automatically included in the output, unless the index is the default range from 0 to len(df).
Columns of object data type are by default left to pd.to_stata to figure out how to convert. Note that columns of strings with missing data might cause an error in this default case. In this case, the best thing to do is for you convert it to string type before calling this function. If not, setting obj_to_str to True will deal with this internally but it will be slower.
For time data, columns of period type and columns of datetime64[ns] type with time zone information will be dropped. All other datetime64[ns] variables will be converted to td dates in Stata.
| permno | date | dtdate | A | B | |
|---|---|---|---|---|---|
| 0 | 1 | 2010-01 | 2010-01-01 | 0.104762 | 0.923778 |
| 1 | 1 | 2010-02 | 2010-02-01 | 0.561185 | 0.629761 |
| 2 | 1 | 2010-04 | 2010-04-01 | 0.452953 | 0.137185 |
| 3 | 2 | 2010-01 | 2010-01-01 | 0.481129 | 0.734026 |
| 4 | 2 | 2010-02 | 2010-02-01 | 0.627125 | 0.525601 |
| 5 | 2 | 2010-04 | 2010-04-01 | 0.413195 | 0.860653 |
bins_using_masked_cutoffs (df:pandas.core.frame.DataFrame=None, sortvar:str=None, maskvar:str=None, quantiles:list=None, outvar:str=None)
Returns column of bin numbers (1 to len(quantiles)) created by binning sortvar based on cuttoffs give by quantiles of df.loc[df[maskvar], sortvar]
| Type | Default | Details | |
|---|---|---|---|
| df | DataFrame | None | Dataframe containing sortvar and maskvar. Must have panelvar x datevar multiindex |
| sortvar | str | None | Variable containing the values to be binned |
| maskvar | str | None | Mask to be applied to df[sortvar] before bin cutoffs are calculated |
| quantiles | list | None | List of quantiles to be applied to df.loc[df[maskvar], sortvar] to determine bin cutoffs |
| outvar | str | None | Name to give to the column of bins created. If none, will use f”{sortvar}_bins” |
| Returns | DataFrame |