finsets
  1. WRDS
  2. compa_ccm
  • finsets
  • FRED
    • fred
    • fred_api
  • WRDS
    • wrds_api
    • crspm
    • crspd
    • compa
    • compa_ccm
    • compq
    • compq_ccm
    • ratios
    • ibes_ltg
    • bondret
    • mergent
    • linking
  • Papers
    • Dickerson, et al. (2023)
    • Gilchrist and Zakrajsek (2012)
    • Gürkaynak, et al. (2007)
    • Hassan, et al. (2019)
    • Hoberg, Phillips (2010, 2016)
    • Peters and Taylor (2016)

On this page

  • list_all_vars
  • default_raw_vars
  • parse_varlist
  • get_raw_data
  • process_raw_data
  • features
  • Report an issue
  1. WRDS
  2. compa_ccm

compa_ccm

Retrieve and process data from WRDS CRSP/Compustat Merged (CCM) Fundamentals Annual
from __future__ import annotations
from typing import List

import pandas as pd
import numpy as np

import pandasmore as pdm
from finsets.wrds import wrds_api
PROVIDER = 'Wharton Research Data Services (WRDS)'
URL = 'https://wrds-www.wharton.upenn.edu/pages/get-data/center-research-security-prices-crsp/annual-update/crspcompustat-merged/fundamentals-annual/'
LIBRARY = 'comp'
TABLE = 'funda'
LINK_LIBRARY = 'crsp'
LINK_TABLE = 'ccmxpf_lnkhist'
COMPANY_TABLE = 'company' #contains some header information that is missing from comp.funda (e.g. sic and naics) 
FREQ = 'A'
MIN_YEAR = 1950
MAX_YEAR = None
ENTITY_ID_IN_RAW_DSET = 'permno'
ENTITY_ID_IN_CLEAN_DSET = 'permno'
TIME_VAR_IN_RAW_DSET = 'datadate'
TIME_VAR_IN_CLEAN_DSET = f'{FREQ}date'

source

list_all_vars

 list_all_vars ()

Collects names of all available variables from WRDS f{LIBRARY}.{TABLE} and {LIBRARY}.{COMPANY_TABLE}.

all_vars = list_all_vars()
all_vars.name.count()
987

source

default_raw_vars

 default_raw_vars ()

Defines default variables used in get_raw_data if none are specified.

print(default_raw_vars())
['datadate', 'gvkey', 'cusip', 'cik', 'tic', 'fyear', 'fyr', 'fic', 'naicsh', 'sich', 'sic', 'naics', 'exchg', 'lt', 'at', 'txditc', 'pstkl', 'pstkrv', 'pstk', 'csho', 'ajex', 'rdip', 'act', 'dvc', 'xad', 'seq', 'che', 'lct', 'dlc', 'ib', 'dvp', 'txdi', 'dp', 'txp', 'oancf', 'ivncf', 'fincf', 'dltt', 'mib', 'ceq', 'invt', 'cogs', 'revt', 'sale', 'capx', 'xrd', 'txdb', 'prcc_f', 'sstk', 'prstkc', 'dltis', 'dltr', 'emp', 'dd1', 'ppegt', 'ppent', 'xint', 'txt', 'sppe', 'gdwl', 'xrent', 're', 'dvpsx_f', 'tstk', 'wcap', 'rect', 'xsga', 'aqc', 'oibdp', 'dpact', 'ni', 'ivao', 'ivst', 'dv', 'intan', 'pi', 'txfo', 'pifo', 'xpp', 'drc', 'drlt', 'ap', 'xacc', 'itcb']

source

parse_varlist

 parse_varlist (vars:List[str]=None, required_vars=[])

Figures out which vars come from the {LIBRARY}.{TABLE} table and which come from the {LIBRARY}.{COMPANY_TABLE} table and adds a. and b. prefixes to variable names to feed into an SQL query

parse_varlist(['at','lt'])
'a.at,a.lt'

source

get_raw_data

 get_raw_data (vars:List[str]=None, required_vars:List[str]=['gvkey',
               'datadate'], nrows:int=None, start_date:str=None,
               end_date:str=None)

Downloads vars from start_date to end_date from WRDS {LIBRARY}.{TABLE} library and adds PERMNO and PERMCO as in CCM

Type Default Details
vars List[str] None If None, downloads default_raw_vars; use ’*’ to get all available variables
required_vars List[str] [‘gvkey’, ‘datadate’] list of variables that will get downloaded, even if not in vars
nrows int None Number of rows to download. If None, full dataset will be downloaded
start_date str None Start date in MM/DD/YYYY format
end_date str None End date in MM/DD/YYYY format
Returns pd.DataFrame

The get_raw_data function will produce identical results to the ones we would obtain if we used the WRDS website (with the default options unchanged).

It results in no permno-datadate duplicates, but there is a small number of gvkey-datadate duplicates (about 1% of the data) because each permno maps to a unique gvkey+iid value and some gvkeys have multiple share classes (different iid’s).

If we restrict ourselves to primary securities, i.e. linkprim in ('P','C') (which retains 99% of the data), this results in unique gvkey-datadate records which is why we use it as the default option in the process_raw_data function below.

r = get_raw_data(vars='*', nrows=1)
r
permno permco liid linkprim gvkey datadate fyear indfmt consol popsrc ... priusa sic spcindcd spcseccd spcsrc state stko weburl dldte ipodate
0 25881.0 23369.0 01 P 001000 1970-12-31 1970.0 INDL C D ... 01 3089 325.0 978.0 None None 0.0 None 1978-06-30 None

1 rows × 986 columns

raw = get_raw_data(vars = ['at', 'lt'], start_date='01/01/2021', end_date='01/01/2022', nrows=1)
raw.head(0)
permno permco liid linkprim gvkey datadate at lt
raw = get_raw_data(start_date='01/01/2021', end_date='01/01/2023')
raw.head(0)
permno permco liid linkprim gvkey datadate rect sich cik fyr ... ceq dltt ppegt txdi dltis sstk drc rdip cusip txfo

0 rows × 88 columns


source

process_raw_data

 process_raw_data (df:pandas.core.frame.DataFrame=None,
                   linkprim_filter:list=['P', 'C'], clean_kwargs:dict={})

Applies pandasmore.setup_panel to df

Type Default Details
df pd.DataFrame None Must contain permno and datadate columns
linkprim_filter list [‘P’, ‘C’]
clean_kwargs dict {} Params to pass to pdm.setup_panel other than panel_ids, time_var, and freq
Returns pd.DataFrame
df_clean = process_raw_data(raw)
df_clean.head(0)
datadate dtdate permco liid linkprim gvkey rect sich cik fyr ... ceq dltt ppegt txdi dltis sstk drc rdip cusip txfo
permno Adate

0 rows × 88 columns


source

features

 features (df:pandas.core.frame.DataFrame=None)
ftrs = features(df_clean)
ftrs.head(1)
sic_full naics_full stock_price lag_at mktcap pstk0 pref_stock shreq bookeq bookeq_w_itcb ... cflow_is cflow_cfs cflow_full cflow_is_2la cflow_cfs_2la cflow_full_2la cash_2a booklev dividends_2la repurchases_2la
permno Adate
10026 2021 2050 311812 152.82 NaN 2916.41688 0.0 0.0 845.654 907.232 907.232 ... 104.998 101.499 101.499 NaN NaN NaN 0.259461 0.053934 NaN NaN

1 rows × 34 columns

  • Report an issue