Mihai Ion
  • Home
  • Research
  • Teaching
  • CV
  1. FIN 525
  2. Lectures
  3. L10: Merging, reshaping datasets
  • Teaching
  • FIN 525
    • Lectures
      • L00: Jupyter basics
      • L01: Introduction
      • L02: Variables, types, operators
      • L03: Data structures
      • L04: Conditionals, loops
      • L05: Functions, packages
      • L06: Pandas intro
      • L07: Pandas I/O
      • L08: Pandas filtering
      • L09: Pandas data cleaning
      • L10: Merging, reshaping datasets
      • L11: Dates, lags, sorting
      • L12: Descriptive stats
      • L13: Conditional stats, outliers
      • L14: Conditional stats applied
      • L15: Linear regression intro
      • L16: Linear regression applications
      • L17: Panel regression intro
      • L18: Robust panel regression
      • L19: Robust timeseries regression
      • L20: Backtesting - data prep
      • L21: Backtesting - sumstats
      • L22: Backtesting -returns
      • L23: Backtesting - risk adjustment
  • FIN 421
    • Lectures
      • L01: Introduction
      • L02: Analyzing past returns
      • L03: Modeling future returns
      • L04: Portfolio theory intro
      • L05: Optimal capital allocation
      • L06: Tangency portfolios
      • L07_08: Optimal asset allocation
      • L09: Review
      • L10_11: Statistical models of returns
      • L12: CAPM
      • L13: Cost of equity
      • L14: Bond pricing
      • L15: Bond yields
      • L16: Bond risk
      • L17: Valuation data processing
      • L18_19: Multiples valuation
      • L20_21: Dividend discount models
      • L22_23: Discounted cash flow analysis
      • L24: Valuation sensitivity analysis
      • L25: Options intro
      • L26: Risk management with options

On this page

  • Preliminaries
  • Merging datasets
    • Merging by columns with .merge()
      • Inner join
      • Outer join
      • Left join
      • Right join
    • Merging on index using .join() (and .merge())
  • Concatenating datasets with .concat()
  • Reshaping datasets
    • From long to wide (unstacking) with .pivot()
    • From wide to long (stacking) with .stack()

L10: Merging, reshaping datasets

Preliminaries

The data we need for our projects is rarely all in one place (in a single dataframe) or organized the way we need it. This means that we very often have to combine two or more datasets into a single dataset and/or change the organization of the dataset (what appears in the rows and what appears in the columns) to better suit our needs. Here we cover some of the main tools we can use to perform these operations using the Pandas package.

import pandas as pd
from IPython.display import display #allows us to ``pretty print`` multiple objects in the same cell

Let’s create some example datasets:

df1 = pd.DataFrame({'year': [2001, 2002, 2003], 
                    'tic': ['MSFT','TSLA','AAPL'], 
                    'fy':[2002,2003,2004]})
df1
year tic fy
0 2001 MSFT 2002
1 2002 TSLA 2003
2 2003 AAPL 2004
df2 = pd.DataFrame({'year': [2001, 2002, 2004], 
                    'ticker': ['MSFT','NFLX','AAPL'], 
                    'fy':[12,12,12]})
df2
year ticker fy
0 2001 MSFT 12
1 2002 NFLX 12
2 2004 AAPL 12
#example of display
display(df1)
df2
year tic fy
0 2001 MSFT 2002
1 2002 TSLA 2003
2 2003 AAPL 2004
year ticker fy
0 2001 MSFT 12
1 2002 NFLX 12
2 2004 AAPL 12

Merging datasets

When we say we want to “merge” two datasets, we generally mean that we want the columns of the two datasets to appear side by side in one final dataset. The important question is: How should the ROWS of the two datasets be matched? To perform this match, we need to have one or more columns that contain the same information in each of the two datasets. These common columns are usually referred to as the “keys” on which the rows are matched.

The second thing we have to decide is what to do with the rows that do NOT match after the merge. This is where we have to decide if we want an “inner”, “outer”, “left”, or “right” merge (aka join), as specified below.

Most commonly, we use the .merge() function when the keys on which we want to merge the datasets are columns, and the .join() function when they are indexes (even though .merge() can also merge on index). We cover these two functions separately in the two subsections below:

Merging by columns with .merge()

If the keys we want to merge on are columns in our dataframes (as opposed to indexes) we have to use the .merge() (.join() will not work for this purpose):

Abbreviated syntax:

DataFrame.merge(right, how='inner', on=None, 
                left_on=None, right_on=None, 
                left_index=False, right_index=False, 
                sort=False, suffixes=('_x', '_y'))

We’ll replace DataFrame and right with the names of the two dataframes we want to merge (respectively). In the subsections below, we explain what the different choices for how mean. When the key(s) on which we want to merge have the same name(s) in the two datasets, we use the on parameter to specify the names of the keys (see the example immediately below), otherwise we need to specify them using the left_on and right_on parameters.

The documentation at https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html has some additional examples that you may find useful.

Inner join

The inner join combines the datasets based on the INTERSECTION of the values in the “key” columns.

For example, if we want to inner merge the two dataframes using the year column as a key:

inner1 = df1.merge(df2, how='inner', on='year')
inner1
year tic fy_x ticker fy_y
0 2001 MSFT 2002 MSFT 12
1 2002 TSLA 2003 NFLX 12

But if we want to merge on both year and ticker information, we can not use on because the ticker information has different names in the different dataframes:

inner2 = df1.merge(df2, how='inner', 
                   left_on = ['year','tic'], right_on = ['year','ticker'])
inner2
year tic fy_x ticker fy_y
0 2001 MSFT 2002 MSFT 12

Note how the name of the fy column in each of the two datasets has been changed. To control that process ourselves, we can use the suffixes parameter to specify suffixes that should be appended at the end of common column names:

inner2 = df1.merge(df2, how='inner', 
                   left_on = ['year','tic'], right_on = ['year','ticker'],
                  suffixes = ('_df1', '_df2'))
inner2
year tic fy_df1 ticker fy_df2
0 2001 MSFT 2002 MSFT 12

Outer join

The outer join combines the datasets based on the UNION of the values in the “key” columns. For example:

outer = df1.merge(df2, how='outer', 
                 left_on = ['year','tic'], right_on = ['year','ticker'])
outer
year tic fy_x ticker fy_y
0 2001 MSFT 2002.0 MSFT 12.0
1 2002 TSLA 2003.0 NaN NaN
2 2003 AAPL 2004.0 NaN NaN
3 2002 NaN NaN NFLX 12.0
4 2004 NaN NaN AAPL 12.0

Left join

In a left join, the unmatched keys from the left dataset are kept, but the unmatched keys from the right dataset are discarded.

left = df1.merge(df2, how='left', 
                 left_on = ['year','tic'], right_on = ['year','ticker'])
left
year tic fy_x ticker fy_y
0 2001 MSFT 2002 MSFT 12.0
1 2002 TSLA 2003 NaN NaN
2 2003 AAPL 2004 NaN NaN

Right join

In a right join, the unmatched keys from the right dataset are kept, but the unmatched keys from the left dataset are discarded.

right = df1.merge(df2, how='right', 
                 left_on = ['year','tic'], right_on = ['year','ticker'])
right
year tic fy_x ticker fy_y
0 2001 MSFT 2002.0 MSFT 12
1 2002 NaN NaN NFLX 12
2 2004 NaN NaN AAPL 12

Merging on index using .join() (and .merge())

As mentioned above, this covers the situation when the keys on which we want to perform the merge are indexes in the dataframes we want to merge. In this case, we can either use the .merge() function and specify left_index=True, right_index=True, or we can use the .join() funciton without specifying any keys at all (because .join() assumes that you are using the index):

First, let’s add an index (MultiIndex) to the two dataframes:

df3 = df1.set_index(['year','tic'])
df3
fy
year tic
2001 MSFT 2002
2002 TSLA 2003
2003 AAPL 2004
df4 = df2.set_index(['year','ticker'])
df4
fy
year ticker
2001 MSFT 12
2002 NFLX 12
2004 AAPL 12

Before we can use .merge(), we have to make the index names match (tic does not match ticker):

df4.index.names = ['year','tic']
display(df3)
df4
fy
year tic
2001 MSFT 2002
2002 TSLA 2003
2003 AAPL 2004
fy
year tic
2001 MSFT 12
2002 NFLX 12
2004 AAPL 12

Now, say we want to perform an outer merge on the index:

outerm = df3.merge(df4, how='outer', 
                  left_index = True, right_index = True,
                  suffixes = ('_df3', '_df4'))
outerm
fy_df3 fy_df4
year tic
2001 MSFT 2002.0 12.0
2002 NFLX NaN 12.0
TSLA 2003.0 NaN
2003 AAPL 2004.0 NaN
2004 AAPL NaN 12.0

Using the .join() function will not work unless we specify how the fy columns need to be renamed. We use the lsuffix and rsuffix parameters for this purpose.

Syntax:

DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)

For example, we can perform the same outer merge as above using:

outerj = df3.join(df4, how = 'outer', 
                  lsuffix = '_df3', rsuffix = '_df4')
outerj
fy_df3 fy_df4
year tic
2001 MSFT 2002.0 12.0
2002 NFLX NaN 12.0
TSLA 2003.0 NaN
2003 AAPL 2004.0 NaN
2004 AAPL NaN 12.0

Concatenating datasets with .concat()

When we say that we want to “concatenate” two datasets, we generally mean that we want the columns of the two datasets to be stacked on top of each other (vertically) or side by side (horizontally). We usually want to concatenate vertically (aka “appending”) when we obtain more data on a given set of variables, and we just want to add it at the bottom of a dataset already containing those variables (columns).

Unlike merging, concatenating horizontally means that we want to combine two datasets side by side, but “as they are”, without concern for matching rows in any meaningful way (like merge). This is usually done when we want to combine different pieces of information (columns) from the two datasets, and we are certain that existing datasets are already organized in such a way that the rows will match meaningfully.

The .concat() function can perform both types of concatenation (it can also perform a merge but we will not use it for that purpose in this course):

Syntax:

pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, sort=False, copy=True)

We specify axis=0 for vertical concatenation and axis=1 for horizontal concatenation:

df1
year tic fy
0 2001 MSFT 2002
1 2002 TSLA 2003
2 2003 AAPL 2004
df2
year ticker fy
0 2001 MSFT 12
1 2002 NFLX 12
2 2004 AAPL 12
vertical = pd.concat([df1,df2], axis = 0)
vertical
year tic fy ticker
0 2001 MSFT 2002 NaN
1 2002 TSLA 2003 NaN
2 2003 AAPL 2004 NaN
0 2001 NaN 12 MSFT
1 2002 NaN 12 NFLX
2 2004 NaN 12 AAPL
horizontal = pd.concat([df1,df2], axis=1)
horizontal
year tic fy year ticker fy
0 2001 MSFT 2002 2001 MSFT 12
1 2002 TSLA 2003 2002 NFLX 12
2 2003 AAPL 2004 2004 AAPL 12

Note that, .concat() with axis=1 is equivalent to .join() with how='outer'. In other words, .concat() performs a merge on index. Its default join argument is set to outer but you can change that to perform any kind of join you want.

Reshaping datasets

By reshaping a dataset we generally mean that we want to change the structure of the dataset so that either

  1. Some data stored in one column is converted to multiple columns (but the same row)
    • In pandas, this is called “unstacking”
    • Informally, we say that we are converting the dataset from long to wide

or

  1. Some data stored in multiple columns (but the same row) is converted to a single column
    • In pandas, this is called “stacking”
    • Informally, we say that we are converting the dataset from wide to long

See the official user guide page for more details on reshaping: https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html

From long to wide (unstacking) with .pivot()

Syntax:

DataFrame.pivot(index=None, columns=None, values=None)

Let’s create an example dataset:

long1 = pd.DataFrame({'portfolio':[1,1,2,2], 
                      'year': ['2005','2006','2005','2006'], 
                      'return': [0.1,0.15,0.05,0.01],
                      'nfirms':[5,3,4,10]})
long1
portfolio year return nfirms
0 1 2005 0.10 5
1 1 2006 0.15 3
2 2 2005 0.05 4
3 2 2006 0.01 10

For example, if we want to reshape the return data so that each year gets its own row (index value), and each portfolio gets its own column, we would use:

wide = long1.pivot(index='year', columns = 'portfolio', values = 'return')
wide
portfolio 1 2
year
2005 0.10 0.05
2006 0.15 0.01
wide.columns
Int64Index([1, 2], dtype='int64', name='portfolio')
wide.index
Index(['2005', '2006'], dtype='object', name='year')

If we want to reshape both the return data and the nfirms data in the same way, at the same time:

wide2 = long1.pivot(index='year', columns='portfolio', values=['return','nfirms'])
wide2
return nfirms
portfolio 1 2 1 2
year
2005 0.10 0.05 5.0 4.0
2006 0.15 0.01 3.0 10.0

Note that now the column labels are two-dimensional:

wide2.columns
MultiIndex([('return', 1),
            ('return', 2),
            ('nfirms', 1),
            ('nfirms', 2)],
           names=[None, 'portfolio'])
wide2.index
Index(['2005', '2006'], dtype='object', name='year')

Unstacking based on values in the index can be done using the .unstack() function (usually for datasets with a MultiIndex).

From wide to long (stacking) with .stack()

We can stack the data back up to a “long” shape, based on information in the index of that dataframe.

Syntax:

DataFrame.stack(level=-1, dropna=True)

The level parameter is used in case the dataframe we want to reshape has a MultiIndex in the columns i.e. multi-dimensional column names (like wide2 above). The default level=-1 works for dataframes that were reshaped from a different dataframe (like wide2 above). In that case level=-1 tells Python to just undo that reshaping:

long3 = wide2.stack()
long3
return nfirms
year portfolio
2005 1 0.10 5.0
2 0.05 4.0
2006 1 0.15 3.0
2 0.01 10.0

If we want to stack the information in the first dimension of the column MultiIndex (i.e. the part that contains “return” and “nfirms”), then we use level=0:

wide2.stack(level=0)
portfolio 1 2
year
2005 nfirms 5.00 4.00
return 0.10 0.05
2006 nfirms 3.00 10.00
return 0.15 0.01
wide2
return nfirms
portfolio 1 2 1 2
year
2005 0.10 0.05 5.0 4.0
2006 0.15 0.01 3.0 10.0

If we want to stack the information in the second dimension of the column MultiIndex (i.e. the part that contains 1’s and 2’s), then we use level=1:

wide2.stack(level=1)
return nfirms
year portfolio
2005 1 0.10 5.0
2 0.05 4.0
2006 1 0.15 3.0
2 0.01 10.0