import pandas as pd
import numpy as np
import pandasmore as pdmpandasmore
Extends pandas with common functions used in finance and economics research
The full documentation site is here, and the GitHub page is here.
Here is a short description of some of the main functions (more details below and in the documentation):
setup_tseries: cleans up dates and sets them as the indexsetup_panel: cleans up dates and panel id’s and sets them as the index (panel id, period date)lag: robust lagging that accounts for panel structure, unsorted or duplicate dates, or gaps in the time-series
Install
pip install pandasmoreHow to use
First, we set up an example dataset to showcase the functions in this module.
raw = pd.DataFrame(np.random.rand(15,2),
columns=list('AB'),
index=pd.MultiIndex.from_product(
[[1,2, np.nan],[np.nan,'2010-01','2010-02','2010-02','2010-04']],
names = ['firm_id','date'])
).reset_index()
raw| firm_id | date | A | B | |
|---|---|---|---|---|
| 0 | 1.0 | NaN | 0.249370 | 0.926335 |
| 1 | 1.0 | 2010-01 | 0.282501 | 0.513859 |
| 2 | 1.0 | 2010-02 | 0.804278 | 0.307171 |
| 3 | 1.0 | 2010-02 | 0.828895 | 0.746789 |
| 4 | 1.0 | 2010-04 | 0.569099 | 0.331814 |
| 5 | 2.0 | NaN | 0.533977 | 0.823457 |
| 6 | 2.0 | 2010-01 | 0.207558 | 0.401378 |
| 7 | 2.0 | 2010-02 | 0.086001 | 0.959371 |
| 8 | 2.0 | 2010-02 | 0.054230 | 0.993980 |
| 9 | 2.0 | 2010-04 | 0.062525 | 0.200272 |
| 10 | NaN | NaN | 0.091012 | 0.635409 |
| 11 | NaN | 2010-01 | 0.866369 | 0.972394 |
| 12 | NaN | 2010-02 | 0.432087 | 0.837597 |
| 13 | NaN | 2010-02 | 0.878219 | 0.148009 |
| 14 | NaN | 2010-04 | 0.820386 | 0.834821 |
df = pdm.setup_tseries(raw.query('firm_id==1'),
time_var='date', time_var_format="%Y-%m",
freq='M')
df| date | dtdate | firm_id | A | B | |
|---|---|---|---|---|---|
| Mdate | |||||
| 2010-01 | 2010-01 | 2010-01-01 | 1.0 | 0.282501 | 0.513859 |
| 2010-02 | 2010-02 | 2010-02-01 | 1.0 | 0.828895 | 0.746789 |
| 2010-04 | 2010-04 | 2010-04-01 | 1.0 | 0.569099 | 0.331814 |
df = pdm.setup_panel(raw,
panel_ids='firm_id',
time_var='date', time_var_format="%Y-%m",
freq='M')
df| date | dtdate | A | B | ||
|---|---|---|---|---|---|
| firm_id | Mdate | ||||
| 1 | 2010-01 | 2010-01 | 2010-01-01 | 0.282501 | 0.513859 |
| 2010-02 | 2010-02 | 2010-02-01 | 0.828895 | 0.746789 | |
| 2010-04 | 2010-04 | 2010-04-01 | 0.569099 | 0.331814 | |
| 2 | 2010-01 | 2010-01 | 2010-01-01 | 0.207558 | 0.401378 |
| 2010-02 | 2010-02 | 2010-02-01 | 0.054230 | 0.993980 | |
| 2010-04 | 2010-04 | 2010-04-01 | 0.062525 | 0.200272 |
pdm.lag(df['A'])firm_id Mdate
1 2010-01 NaN
2010-02 0.282501
2010-04 NaN
2 2010-01 NaN
2010-02 0.207558
2010-04 NaN
Name: A_lag1, dtype: float64