from __future__ import annotations
from typing import List
import pandas as pd
import numpy as np
import pandasmore as pdm
from finsets.wrds import wrds_apicompa_ccm
PROVIDER = 'Wharton Research Data Services (WRDS)'
URL = 'https://wrds-www.wharton.upenn.edu/pages/get-data/center-research-security-prices-crsp/annual-update/crspcompustat-merged/fundamentals-annual/'
LIBRARY = 'comp'
TABLE = 'funda'
LINK_LIBRARY = 'crsp'
LINK_TABLE = 'ccmxpf_lnkhist'
COMPANY_TABLE = 'company' #contains some header information that is missing from comp.funda (e.g. sic and naics) 
FREQ = 'A'
MIN_YEAR = 1950
MAX_YEAR = None
ENTITY_ID_IN_RAW_DSET = 'permno'
ENTITY_ID_IN_CLEAN_DSET = 'permno'
TIME_VAR_IN_RAW_DSET = 'datadate'
TIME_VAR_IN_CLEAN_DSET = f'{FREQ}date'list_all_vars
list_all_vars ()
Collects names of all available variables from WRDS f{LIBRARY}.{TABLE} and {LIBRARY}.{COMPANY_TABLE}.
all_vars = list_all_vars()all_vars.name.count()987default_raw_vars
default_raw_vars ()
Defines default variables used in get_raw_data if none are specified.
print(default_raw_vars())['datadate', 'gvkey', 'cusip', 'cik', 'tic', 'fyear', 'fyr', 'fic', 'naicsh', 'sich', 'sic', 'naics', 'exchg', 'lt', 'at', 'txditc', 'pstkl', 'pstkrv', 'pstk', 'csho', 'ajex', 'rdip', 'act', 'dvc', 'xad', 'seq', 'che', 'lct', 'dlc', 'ib', 'dvp', 'txdi', 'dp', 'txp', 'oancf', 'ivncf', 'fincf', 'dltt', 'mib', 'ceq', 'invt', 'cogs', 'revt', 'sale', 'capx', 'xrd', 'txdb', 'prcc_f', 'sstk', 'prstkc', 'dltis', 'dltr', 'emp', 'dd1', 'ppegt', 'ppent', 'xint', 'txt', 'sppe', 'gdwl', 'xrent', 're', 'dvpsx_f', 'tstk', 'wcap', 'rect', 'xsga', 'aqc', 'oibdp', 'dpact', 'ni', 'ivao', 'ivst', 'dv', 'intan', 'pi', 'txfo', 'pifo', 'xpp', 'drc', 'drlt', 'ap', 'xacc', 'itcb']parse_varlist
parse_varlist (vars:List[str]=None, required_vars=[])
Figures out which vars come from the {LIBRARY}.{TABLE} table and which come from the {LIBRARY}.{COMPANY_TABLE} table and adds a. and b. prefixes to variable names to feed into an SQL query
parse_varlist(['at','lt'])'a.at,a.lt'get_raw_data
get_raw_data (vars:List[str]=None, required_vars:List[str]=['gvkey', 'datadate'], nrows:int=None, start_date:str=None, end_date:str=None)
Downloads vars from start_date to end_date from WRDS {LIBRARY}.{TABLE} library and adds PERMNO and PERMCO as in CCM
| Type | Default | Details | |
|---|---|---|---|
| vars | List[str] | None | If None, downloads default_raw_vars; use ’*’ to get all available variables | 
| required_vars | List[str] | [‘gvkey’, ‘datadate’] | list of variables that will get downloaded, even if not in vars | 
| nrows | int | None | Number of rows to download. If None, full dataset will be downloaded | 
| start_date | str | None | Start date in MM/DD/YYYY format | 
| end_date | str | None | End date in MM/DD/YYYY format | 
| Returns | pd.DataFrame | 
The get_raw_data function will produce identical results to the ones we would obtain if we used the WRDS website (with the default options unchanged).
It results in no permno-datadate duplicates, but there is a small number of gvkey-datadate duplicates (about 1% of the data) because each permno maps to a unique gvkey+iid value and some gvkeys have multiple share classes (different iid’s).
If we restrict ourselves to primary securities, i.e. linkprim in ('P','C') (which retains 99% of the data), this results in unique gvkey-datadate records which is why we use it as the default option in the process_raw_data function below.
r = get_raw_data(vars='*', nrows=1)
r| permno | permco | liid | linkprim | gvkey | datadate | fyear | indfmt | consol | popsrc | ... | priusa | sic | spcindcd | spcseccd | spcsrc | state | stko | weburl | dldte | ipodate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25881.0 | 23369.0 | 01 | P | 001000 | 1970-12-31 | 1970.0 | INDL | C | D | ... | 01 | 3089 | 325.0 | 978.0 | None | None | 0.0 | None | 1978-06-30 | None | 
1 rows × 986 columns
raw = get_raw_data(vars = ['at', 'lt'], start_date='01/01/2021', end_date='01/01/2022', nrows=1)raw.head(0)| permno | permco | liid | linkprim | gvkey | datadate | at | lt | 
|---|
raw = get_raw_data(start_date='01/01/2021', end_date='01/01/2023')raw.head(0)| permno | permco | liid | linkprim | gvkey | datadate | rect | sich | cik | fyr | ... | ceq | dltt | ppegt | txdi | dltis | sstk | drc | rdip | cusip | txfo | 
|---|
0 rows × 88 columns
process_raw_data
process_raw_data (df:pandas.core.frame.DataFrame=None, linkprim_filter:list=['P', 'C'], clean_kwargs:dict={})
Applies pandasmore.setup_panel to df
| Type | Default | Details | |
|---|---|---|---|
| df | pd.DataFrame | None | Must contain permnoanddatadatecolumns | 
| linkprim_filter | list | [‘P’, ‘C’] | |
| clean_kwargs | dict | {} | Params to pass to pdm.setup_panelother thanpanel_ids,time_var, andfreq | 
| Returns | pd.DataFrame | 
df_clean = process_raw_data(raw)df_clean.head(0)| datadate | dtdate | permco | liid | linkprim | gvkey | rect | sich | cik | fyr | ... | ceq | dltt | ppegt | txdi | dltis | sstk | drc | rdip | cusip | txfo | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| permno | Adate | 
0 rows × 88 columns
features
features (df:pandas.core.frame.DataFrame=None)
ftrs = features(df_clean)ftrs.head(1)| sic_full | naics_full | stock_price | lag_at | mktcap | pstk0 | pref_stock | shreq | bookeq | bookeq_w_itcb | ... | cflow_is | cflow_cfs | cflow_full | cflow_is_2la | cflow_cfs_2la | cflow_full_2la | cash_2a | booklev | dividends_2la | repurchases_2la | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| permno | Adate | |||||||||||||||||||||
| 10026 | 2021 | 2050 | 311812 | 152.82 | NaN | 2916.41688 | 0.0 | 0.0 | 845.654 | 907.232 | 907.232 | ... | 104.998 | 101.499 | 101.499 | NaN | NaN | NaN | 0.259461 | 0.053934 | NaN | NaN | 
1 rows × 34 columns