Mihai Ion
  • Home
  • Research
  • Teaching
  • CV
  1. FIN 525
  2. Lectures
  3. L06: Pandas intro
  • Teaching
  • FIN 525
    • Lectures
      • L00: Jupyter basics
      • L01: Introduction
      • L02: Variables, types, operators
      • L03: Data structures
      • L04: Conditionals, loops
      • L05: Functions, packages
      • L06: Pandas intro
      • L07: Pandas I/O
      • L08: Pandas filtering
      • L09: Pandas data cleaning
      • L10: Merging, reshaping datasets
      • L11: Dates, lags, sorting
      • L12: Descriptive stats
      • L13: Conditional stats, outliers
      • L14: Conditional stats applied
      • L15: Linear regression intro
      • L16: Linear regression applications
      • L17: Panel regression intro
      • L18: Robust panel regression
      • L19: Robust timeseries regression
      • L20: Backtesting - data prep
      • L21: Backtesting - sumstats
      • L22: Backtesting -returns
      • L23: Backtesting - risk adjustment
  • FIN 421
    • Lectures
      • L01: Introduction
      • L02: Analyzing past returns
      • L03: Modeling future returns
      • L04: Portfolio theory intro
      • L05: Optimal capital allocation
      • L06: Tangency portfolios
      • L07_08: Optimal asset allocation
      • L09: Review
      • L10_11: Statistical models of returns
      • L12: CAPM
      • L13: Cost of equity
      • L14: Bond pricing
      • L15: Bond yields
      • L16: Bond risk
      • L17: Valuation data processing
      • L18_19: Multiples valuation
      • L20_21: Dividend discount models
      • L22_23: Discounted cash flow analysis
      • L24: Valuation sensitivity analysis
      • L25: Options intro
      • L26: Risk management with options

On this page

  • Pandas DataFrames
  • Creating dataframes with the “.DataFrame()” function
  • The structure of a DataFrame: index, columns, values
  • Manipulating the index
    • .reset_index()
    • .set_index()
    • .rename()
  • Manipulating columns
    • .rename()
    • Accessing and creating entire columns
  • Some common Pandas functions
  • Pandas Series

L06: Pandas intro

Pandas DataFrames

Almost all our data analysis in this course will be performed using a new data structure called Pandas DataFrames. These data structures are implemented in the Pandas package, which comes with the Anaconda installation. It is customary to import the Pandas package as below.

import pandas as pd

The Pandas package has a very rich functionality. In this course we will cover only a subset of what one can do with this package. If you are interested in a higher level of detail than what is covered in this course, I strongly recommend the official user guide for the package:

https://pandas.pydata.org/docs/user_guide/index.html

We will be referencing different parts of this user guide at different points in the class. You are not expected to know all the details in the user guide, only the parts that are covered in class, or in the practice problems.

The pandas package comes with many different functions/attributes. We will introduce some of the more commonly used attributes gradually over the course of the semester. Remember, in Pyhton, an object attribute is anything that you can put (using dot notation) after the name of that object. We will use the terms “attribute”, “method” and “function” interchangeably.

Creating dataframes with the “.DataFrame()” function

One of the most important functions in the pandas package is the .DataFrame() function, which allows us to create a new dataframe (this is similar to using [] to create lists, and using {} to create dictionaries).

The general syntax for .DataFrame() is as follows:

pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)

As an aside, every time we introduce a new Pandas function, you should check its official documentation to see how you’re supposed to use it. Usually, this is the first result if you Google “pandas” followed by the name of that function. For example, Google “pandas DataFrame()” and you should get this https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html as one of the first results.

If you imported pandas as pd (as above), you would replace “pandas” in the syntax above with “pd”.

For example:

df = pd.DataFrame(data=[['TSLA',1000],['AAPL',2000]], 
                  index = ['Tesla','Apple'], 
                  columns = ['ticker','price'])

The cell above creates a new variable df of type pandas.core.frame.DataFrame:

print(type(df))
<class 'pandas.core.frame.DataFrame'>

To print df, we can use the print() function as usual:

print(df)
      ticker  price
Tesla   TSLA   1000
Apple   AAPL   2000

Though simply using the name of the dataframe also works, and the output looks nicer. We will use this method throughout the course:

df
ticker price
Tesla TSLA 1000
Apple AAPL 2000

From the syntax of the .DataFrame() function, you should be able to tell that any of the three parameters of DataFrame can be omitted (because they all have default falues).

For example:

df2 = pd.DataFrame(data=[['TSLA',1000],['AAPL',2000]], index = ['Tesla','Apple'])
df2
0 1
Tesla TSLA 1000
Apple AAPL 2000
df3 = pd.DataFrame(data=[['TSLA',1000],['AAPL',2000]], columns = ['ticker','price'])
df3
ticker price
0 TSLA 1000
1 AAPL 2000
df4 = pd.DataFrame( index = ['Tesla','Apple'], columns = ['ticker','price'])
df4
ticker price
Tesla NaN NaN
Apple NaN NaN

Finally, if you check the documentation of .DataFrame(), you will see that there are several different types of objects you can supply to each of its parameters.

For example, we could have created our initial dataframe df using a dictionary for the data parameter (instead of a list):

df5 = pd.DataFrame(data = {'ticker': ['TSLA','AAPL'], 'price': [1000, 2000]}, 
                  index = ['Tesla','Apple'])
df5
ticker price
Tesla TSLA 1000
Apple AAPL 2000

Note how the columns parameter is missing, because column names are specified as the keys in the dictionary supplied to the data parameter.

We will not go through all the different ways to create a new dataframe. The key point is that most Pandas functions offer a large degree of flexibility with respect to how they can be used. This is very helpful for more advanced users, but it can be quite confusing for beginners. You can only get better at this with practice. Look at the documentation, and experiment with different ways to use that particular function. Check the results and see if you can explain what happened.

I will NOT expect you to know every little detail about the documentation of all the functions we introduce throughout the course. But you should be able to use those functions in the form that I use them in the notes. For example, given the code above, I expect you to know how to create a new dataframe from a list (first approach at the top) and from a dictionary (see code cell right above this one). In the code below, I also show you how to use the numpy package to create a dataframe containing some randomly generate data (this comes in handy when you want to quickly generate an example dataframe to test some of your code, when the contents of the data don’t really matter):

import numpy as np #usually we have all import statements at the top of the file
rdata = pd.DataFrame(data = np.random.rand(10,3), columns = list('abc'))
rdata
a b c
0 0.956222 0.766637 0.285147
1 0.437727 0.215754 0.932937
2 0.834266 0.366227 0.295984
3 0.297755 0.764719 0.124215
4 0.376822 0.802006 0.850846
5 0.768934 0.477132 0.203595
6 0.045300 0.445239 0.430616
7 0.124063 0.178999 0.262997
8 0.277809 0.292087 0.605525
9 0.610705 0.627855 0.923457

The structure of a DataFrame: index, columns, values

The syntax of the .DataFrame() function should make it clear what the main components of a dataframe are:

  • the index: the names of the rows (also referred to as axis 0 of the dataframe)
  • the columns: the names of the columns (also referred to as axis 1 of the dataframe)
  • the values: the data contained in the table itself

We can access each of these individual parts of the dataframe using the .index, .columns and .values attributes:

df
ticker price
Tesla TSLA 1000
Apple AAPL 2000
df.index
Index(['Tesla', 'Apple'], dtype='object')
df.columns
Index(['ticker', 'price'], dtype='object')
df.values
array([['TSLA', 1000],
       ['AAPL', 2000]], dtype=object)

Note that these attributes are not followed by parentheses. They are still attributes, but they do not compute anything, they simply return some property of the dataframe. It is important to pay attention to these details. For example, the index attribute above is very different from the pd.Index() attribute (which is more appropriately called a “function”):https://pandas.pydata.org/docs/reference/api/pandas.Index.html.

Finally, note that the data inside a dataframe is represented as a numpy array:

type(df.values)
numpy.ndarray

At this point in time, it is not important to know anything about the numpy package or the ndarray type that comes with it. I just wanted to point out a very important feature of the Python programming language: many packages are built using other existing packages (e.g. pandas uses the numpy and matplotlib packages to name a few).

The .info() function gives us another way to quickly check the structure of our DataFrame:

df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, Tesla to Apple
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ticker  2 non-null      object
 1   price   2 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 48.0+ bytes

The .describe() function gives us summary statistics for the numerical columns in the dataframe:

df.describe()
price
count 2.000000
mean 1500.000000
std 707.106781
min 1000.000000
25% 1250.000000
50% 1500.000000
75% 1750.000000
max 2000.000000

The .dtypes attribute tells us the data type for each column:

df.dtypes
ticker    object
price      int64
dtype: object

The .shape attribute tells us the shape of the dataframe:

df.shape
(2, 2)

Manipulating the index

df
ticker price
Tesla TSLA 1000
Apple AAPL 2000

.reset_index()

We can switch to a numerical index for our dataframe using the .reset_index() function.

Syntax:

reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')

For example:

df.reset_index()
index ticker price
0 Tesla TSLA 1000
1 Apple AAPL 2000

Note how this pushed the old index inside the table itself, as a new column.

Very important: the code above did not actually change df:

df
ticker price
Tesla TSLA 1000
Apple AAPL 2000

To do that, we have to set the inplace parameter to True:

df.reset_index(inplace = True)
df
index ticker price
0 Tesla TSLA 1000
1 Apple AAPL 2000

Or we can simply re-write the df dataframe:

df = df.reset_index()
df
level_0 index ticker price
0 0 Tesla TSLA 1000
1 1 Apple AAPL 2000

.set_index()

We can use the data in one of the columns, as the new index of the table with the .set_index() function:

Syntax:

set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)

For example:

df = df.set_index('ticker')
df
level_0 index price
ticker
TSLA 0 Tesla 1000
AAPL 1 Apple 2000

Note that, once we make the ‘ticker’ column into an index, it stops showing up as a column:

df.columns
Index(['level_0', 'index', 'price'], dtype='object')

So if we try to set the index to ‘ticker’ again, we will get an error, because Python can not find the ‘ticker’ column:

#df.set_index('ticker') #this won't work

.rename()

We can rename one or more of the index values using the .rename() function.

Syntax:

rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None, errors='ignore')

We will supply a dictionary to the index parameter, where the keys are the current names we want to change in the index, and the values are the new names we want to use.

For example:

df.rename(index = {'TSLA': 'tsla'})
level_0 index price
ticker
tsla 0 Tesla 1000
AAPL 1 Apple 2000

As with .set_index(), .rename() does not actually change the df dataframe unless we set inplace = True or we redefine the df variable. This behavior should be expected of any pandas function that has an inplace parameter:

df
level_0 index price
ticker
TSLA 0 Tesla 1000
AAPL 1 Apple 2000

Manipulating columns

Let’s rewrite df to make sure we’re on the same page at this point:

ndf = pd.DataFrame(data=[['TSLA',1000],['AAPL',2000]], 
                  index = ['Tesla','Apple'], 
                  columns = ['ticker','price'])
ndf
ticker price
Tesla TSLA 1000
Apple AAPL 2000

.rename()

We can also rename one or more of the columns using the .rename() function.

This time, we will supply a dictionary to the columns parameter, where the keys are the current names of the columns we want to change, and the values are the new names we want to use.

For example

ndf = ndf.rename(columns = {'ticker': 't', 'price':'p'})
ndf
t p
Tesla TSLA 1000
Apple AAPL 2000

Accessing and creating entire columns

We can access the data in an entire column of a dataframe using square brackets after the dataframe name:

ndf['t']
Tesla    TSLA
Apple    AAPL
Name: t, dtype: object

If we want to access multiple columns at once, we have to put their names in a list:

ndf[['t','p']]
t p
Tesla TSLA 1000
Apple AAPL 2000

We can create new columns by just supplying the data manually:

ndf['new'] = [1,2]
ndf
t p new
Tesla TSLA 1000 1
Apple AAPL 2000 2

But most of the time, new columns will be created by bringing them over from other dataframes (we’ll cover this in a future lecture) or by some calculation involving other existing columns from the dataframe.

For example:

ndf['somecalc'] = ndf['p'] + ndf['new']
ndf
t p new somecalc
Tesla TSLA 1000 1 1001
Apple AAPL 2000 2 2002

Some common Pandas functions

In this section, we introduce some more Pandas functions that are very commonly used. These are very intuitive so we will not present the syntax for each of them. For the most part, it is very clear what these functions do, just from their name. We will cover many more such functions throughout the course.

First, let’s create an example dataframe with random integers:

rd = pd.DataFrame(data = np.random.rand(3,2), columns = list('ab'))
rd
a b
0 0.409593 0.323978
1 0.779860 0.474937
2 0.832432 0.196289
rd.head(2)
a b
0 0.409593 0.323978
1 0.779860 0.474937
rd.tail(2)
a b
1 0.779860 0.474937
2 0.832432 0.196289

Count number of non-missing values:

rd.count()
a    3
b    3
dtype: int64

Calculate means, medians, standard deviations and variances:

rd.mean()
a    0.673962
b    0.331734
dtype: float64
rd.median()
a    0.779860
b    0.323978
dtype: float64
rd.std()
a    0.230454
b    0.139486
dtype: float64
rd.var()
a    0.053109
b    0.019456
dtype: float64

Calculate percentage changes from one row to the next:

rd.pct_change()
a b
0 NaN NaN
1 0.903987 0.465956
2 0.067412 -0.586706

Calculate differences between the current row and a different row (default is the prior row):

rd.diff()
a b
0 NaN NaN
1 0.370267 0.150959
2 0.052572 -0.278648

Calculate cumulative sums (sum of values up to that row):

rd.cumsum()
a b
0 0.409593 0.323978
1 1.189453 0.798915
2 2.021885 0.995203

Calculate cummulative products (product of values up to that row):

rd.cumprod()
a b
0 0.409593 0.323978
1 0.319425 0.153869
2 0.265900 0.030203

Pandas Series

The pandas package comes with another commonly used data structure: the Series. This is a data structure that contains a single column and an index.

For example:

s = pd.Series(data = [12,34])
s
0    12
1    34
dtype: int64
type(s)
pandas.core.series.Series

We will almost never use Series in our analysis. We only introduce it here because some pandas functions return a Series as a result. For the most part, we will convert series to dataframes using the .to_frame() function:

s = s.to_frame()
s
0
0 12
1 34
type(s)
pandas.core.frame.DataFrame