import pandas as pd
from IPython.display import display #allows us to ``pretty print`` multiple objects in the same cell
L10: Merging, reshaping datasets
Preliminaries
The data we need for our projects is rarely all in one place (in a single dataframe) or organized the way we need it. This means that we very often have to combine two or more datasets into a single dataset and/or change the organization of the dataset (what appears in the rows and what appears in the columns) to better suit our needs. Here we cover some of the main tools we can use to perform these operations using the Pandas package.
Let’s create some example datasets:
= pd.DataFrame({'year': [2001, 2002, 2003],
df1 'tic': ['MSFT','TSLA','AAPL'],
'fy':[2002,2003,2004]})
df1
year | tic | fy | |
---|---|---|---|
0 | 2001 | MSFT | 2002 |
1 | 2002 | TSLA | 2003 |
2 | 2003 | AAPL | 2004 |
= pd.DataFrame({'year': [2001, 2002, 2004],
df2 'ticker': ['MSFT','NFLX','AAPL'],
'fy':[12,12,12]})
df2
year | ticker | fy | |
---|---|---|---|
0 | 2001 | MSFT | 12 |
1 | 2002 | NFLX | 12 |
2 | 2004 | AAPL | 12 |
#example of display
display(df1) df2
year | tic | fy | |
---|---|---|---|
0 | 2001 | MSFT | 2002 |
1 | 2002 | TSLA | 2003 |
2 | 2003 | AAPL | 2004 |
year | ticker | fy | |
---|---|---|---|
0 | 2001 | MSFT | 12 |
1 | 2002 | NFLX | 12 |
2 | 2004 | AAPL | 12 |
Merging datasets
When we say we want to “merge” two datasets, we generally mean that we want the columns of the two datasets to appear side by side in one final dataset. The important question is: How should the ROWS of the two datasets be matched? To perform this match, we need to have one or more columns that contain the same information in each of the two datasets. These common columns are usually referred to as the “keys” on which the rows are matched.
The second thing we have to decide is what to do with the rows that do NOT match after the merge. This is where we have to decide if we want an “inner”, “outer”, “left”, or “right” merge (aka join), as specified below.
Most commonly, we use the .merge()
function when the keys on which we want to merge the datasets are columns, and the .join()
function when they are indexes (even though .merge()
can also merge on index). We cover these two functions separately in the two subsections below:
Merging by columns with .merge()
If the keys we want to merge on are columns in our dataframes (as opposed to indexes) we have to use the .merge()
(.join()
will not work for this purpose):
Abbreviated syntax:
='inner', on=None,
DataFrame.merge(right, how=None, right_on=None,
left_on=False, right_index=False,
left_index=False, suffixes=('_x', '_y')) sort
We’ll replace DataFrame
and right
with the names of the two dataframes we want to merge (respectively). In the subsections below, we explain what the different choices for how
mean. When the key(s) on which we want to merge have the same name(s) in the two datasets, we use the on
parameter to specify the names of the keys (see the example immediately below), otherwise we need to specify them using the left_on
and right_on
parameters.
The documentation at https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html has some additional examples that you may find useful.
Inner join
The inner join combines the datasets based on the INTERSECTION of the values in the “key” columns.
For example, if we want to inner merge the two dataframes using the year
column as a key:
= df1.merge(df2, how='inner', on='year')
inner1 inner1
year | tic | fy_x | ticker | fy_y | |
---|---|---|---|---|---|
0 | 2001 | MSFT | 2002 | MSFT | 12 |
1 | 2002 | TSLA | 2003 | NFLX | 12 |
But if we want to merge on both year
and ticker information, we can not use on
because the ticker information has different names in the different dataframes:
= df1.merge(df2, how='inner',
inner2 = ['year','tic'], right_on = ['year','ticker'])
left_on inner2
year | tic | fy_x | ticker | fy_y | |
---|---|---|---|---|---|
0 | 2001 | MSFT | 2002 | MSFT | 12 |
Note how the name of the fy
column in each of the two datasets has been changed. To control that process ourselves, we can use the suffixes
parameter to specify suffixes that should be appended at the end of common column names:
= df1.merge(df2, how='inner',
inner2 = ['year','tic'], right_on = ['year','ticker'],
left_on = ('_df1', '_df2'))
suffixes inner2
year | tic | fy_df1 | ticker | fy_df2 | |
---|---|---|---|---|---|
0 | 2001 | MSFT | 2002 | MSFT | 12 |
Outer join
The outer join combines the datasets based on the UNION of the values in the “key” columns. For example:
= df1.merge(df2, how='outer',
outer = ['year','tic'], right_on = ['year','ticker'])
left_on outer
year | tic | fy_x | ticker | fy_y | |
---|---|---|---|---|---|
0 | 2001 | MSFT | 2002.0 | MSFT | 12.0 |
1 | 2002 | TSLA | 2003.0 | NaN | NaN |
2 | 2003 | AAPL | 2004.0 | NaN | NaN |
3 | 2002 | NaN | NaN | NFLX | 12.0 |
4 | 2004 | NaN | NaN | AAPL | 12.0 |
Left join
In a left join, the unmatched keys from the left dataset are kept, but the unmatched keys from the right dataset are discarded.
= df1.merge(df2, how='left',
left = ['year','tic'], right_on = ['year','ticker'])
left_on left
year | tic | fy_x | ticker | fy_y | |
---|---|---|---|---|---|
0 | 2001 | MSFT | 2002 | MSFT | 12.0 |
1 | 2002 | TSLA | 2003 | NaN | NaN |
2 | 2003 | AAPL | 2004 | NaN | NaN |
Right join
In a right join, the unmatched keys from the right dataset are kept, but the unmatched keys from the left dataset are discarded.
= df1.merge(df2, how='right',
right = ['year','tic'], right_on = ['year','ticker'])
left_on right
year | tic | fy_x | ticker | fy_y | |
---|---|---|---|---|---|
0 | 2001 | MSFT | 2002.0 | MSFT | 12 |
1 | 2002 | NaN | NaN | NFLX | 12 |
2 | 2004 | NaN | NaN | AAPL | 12 |
Merging on index using .join()
(and .merge()
)
As mentioned above, this covers the situation when the keys on which we want to perform the merge are indexes in the dataframes we want to merge. In this case, we can either use the .merge()
function and specify left_index=True, right_index=True
, or we can use the .join()
funciton without specifying any keys at all (because .join()
assumes that you are using the index):
First, let’s add an index (MultiIndex) to the two dataframes:
= df1.set_index(['year','tic'])
df3 df3
fy | ||
---|---|---|
year | tic | |
2001 | MSFT | 2002 |
2002 | TSLA | 2003 |
2003 | AAPL | 2004 |
= df2.set_index(['year','ticker'])
df4 df4
fy | ||
---|---|---|
year | ticker | |
2001 | MSFT | 12 |
2002 | NFLX | 12 |
2004 | AAPL | 12 |
Before we can use .merge()
, we have to make the index names match (tic
does not match ticker
):
= ['year','tic']
df4.index.names
display(df3) df4
fy | ||
---|---|---|
year | tic | |
2001 | MSFT | 2002 |
2002 | TSLA | 2003 |
2003 | AAPL | 2004 |
fy | ||
---|---|---|
year | tic | |
2001 | MSFT | 12 |
2002 | NFLX | 12 |
2004 | AAPL | 12 |
Now, say we want to perform an outer merge on the index:
= df3.merge(df4, how='outer',
outerm = True, right_index = True,
left_index = ('_df3', '_df4'))
suffixes outerm
fy_df3 | fy_df4 | ||
---|---|---|---|
year | tic | ||
2001 | MSFT | 2002.0 | 12.0 |
2002 | NFLX | NaN | 12.0 |
TSLA | 2003.0 | NaN | |
2003 | AAPL | 2004.0 | NaN |
2004 | AAPL | NaN | 12.0 |
Using the .join()
function will not work unless we specify how the fy
columns need to be renamed. We use the lsuffix
and rsuffix
parameters for this purpose.
Syntax:
=None, how='left', lsuffix='', rsuffix='', sort=False) DataFrame.join(other, on
For example, we can perform the same outer merge as above using:
= df3.join(df4, how = 'outer',
outerj = '_df3', rsuffix = '_df4')
lsuffix outerj
fy_df3 | fy_df4 | ||
---|---|---|---|
year | tic | ||
2001 | MSFT | 2002.0 | 12.0 |
2002 | NFLX | NaN | 12.0 |
TSLA | 2003.0 | NaN | |
2003 | AAPL | 2004.0 | NaN |
2004 | AAPL | NaN | 12.0 |
Concatenating datasets with .concat()
When we say that we want to “concatenate” two datasets, we generally mean that we want the columns of the two datasets to be stacked on top of each other (vertically) or side by side (horizontally). We usually want to concatenate vertically (aka “appending”) when we obtain more data on a given set of variables, and we just want to add it at the bottom of a dataset already containing those variables (columns).
Unlike merging, concatenating horizontally means that we want to combine two datasets side by side, but “as they are”, without concern for matching rows in any meaningful way (like merge). This is usually done when we want to combine different pieces of information (columns) from the two datasets, and we are certain that existing datasets are already organized in such a way that the rows will match meaningfully.
The .concat()
function can perform both types of concatenation (it can also perform a merge but we will not use it for that purpose in this course):
Syntax:
=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, sort=False, copy=True) pandas.concat(objs, axis
We specify axis=0
for vertical concatenation and axis=1
for horizontal concatenation:
df1
year | tic | fy | |
---|---|---|---|
0 | 2001 | MSFT | 2002 |
1 | 2002 | TSLA | 2003 |
2 | 2003 | AAPL | 2004 |
df2
year | ticker | fy | |
---|---|---|---|
0 | 2001 | MSFT | 12 |
1 | 2002 | NFLX | 12 |
2 | 2004 | AAPL | 12 |
= pd.concat([df1,df2], axis = 0)
vertical vertical
year | tic | fy | ticker | |
---|---|---|---|---|
0 | 2001 | MSFT | 2002 | NaN |
1 | 2002 | TSLA | 2003 | NaN |
2 | 2003 | AAPL | 2004 | NaN |
0 | 2001 | NaN | 12 | MSFT |
1 | 2002 | NaN | 12 | NFLX |
2 | 2004 | NaN | 12 | AAPL |
= pd.concat([df1,df2], axis=1)
horizontal horizontal
year | tic | fy | year | ticker | fy | |
---|---|---|---|---|---|---|
0 | 2001 | MSFT | 2002 | 2001 | MSFT | 12 |
1 | 2002 | TSLA | 2003 | 2002 | NFLX | 12 |
2 | 2003 | AAPL | 2004 | 2004 | AAPL | 12 |
Note that, .concat()
with axis=1
is equivalent to .join()
with how='outer'
. In other words, .concat()
performs a merge on index. Its default join
argument is set to outer
but you can change that to perform any kind of join you want.
Reshaping datasets
By reshaping a dataset we generally mean that we want to change the structure of the dataset so that either
- Some data stored in one column is converted to multiple columns (but the same row)
- In pandas, this is called “unstacking”
- Informally, we say that we are converting the dataset from long to wide
or
- Some data stored in multiple columns (but the same row) is converted to a single column
- In pandas, this is called “stacking”
- Informally, we say that we are converting the dataset from wide to long
See the official user guide page for more details on reshaping: https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
From long to wide (unstacking) with .pivot()
Syntax:
=None, columns=None, values=None) DataFrame.pivot(index
Let’s create an example dataset:
= pd.DataFrame({'portfolio':[1,1,2,2],
long1 'year': ['2005','2006','2005','2006'],
'return': [0.1,0.15,0.05,0.01],
'nfirms':[5,3,4,10]})
long1
portfolio | year | return | nfirms | |
---|---|---|---|---|
0 | 1 | 2005 | 0.10 | 5 |
1 | 1 | 2006 | 0.15 | 3 |
2 | 2 | 2005 | 0.05 | 4 |
3 | 2 | 2006 | 0.01 | 10 |
For example, if we want to reshape the return
data so that each year
gets its own row (index value), and each portfolio
gets its own column, we would use:
= long1.pivot(index='year', columns = 'portfolio', values = 'return')
wide wide
portfolio | 1 | 2 |
---|---|---|
year | ||
2005 | 0.10 | 0.05 |
2006 | 0.15 | 0.01 |
wide.columns
Int64Index([1, 2], dtype='int64', name='portfolio')
wide.index
Index(['2005', '2006'], dtype='object', name='year')
If we want to reshape both the return
data and the nfirms
data in the same way, at the same time:
= long1.pivot(index='year', columns='portfolio', values=['return','nfirms'])
wide2 wide2
return | nfirms | |||
---|---|---|---|---|
portfolio | 1 | 2 | 1 | 2 |
year | ||||
2005 | 0.10 | 0.05 | 5.0 | 4.0 |
2006 | 0.15 | 0.01 | 3.0 | 10.0 |
Note that now the column labels are two-dimensional:
wide2.columns
MultiIndex([('return', 1),
('return', 2),
('nfirms', 1),
('nfirms', 2)],
names=[None, 'portfolio'])
wide2.index
Index(['2005', '2006'], dtype='object', name='year')
Unstacking based on values in the index can be done using the .unstack()
function (usually for datasets with a MultiIndex).
From wide to long (stacking) with .stack()
We can stack the data back up to a “long” shape, based on information in the index of that dataframe.
Syntax:
=-1, dropna=True) DataFrame.stack(level
The level
parameter is used in case the dataframe we want to reshape has a MultiIndex in the columns i.e. multi-dimensional column names (like wide2
above). The default level=-1
works for dataframes that were reshaped from a different dataframe (like wide2
above). In that case level=-1
tells Python to just undo that reshaping:
= wide2.stack()
long3 long3
return | nfirms | ||
---|---|---|---|
year | portfolio | ||
2005 | 1 | 0.10 | 5.0 |
2 | 0.05 | 4.0 | |
2006 | 1 | 0.15 | 3.0 |
2 | 0.01 | 10.0 |
If we want to stack the information in the first dimension of the column MultiIndex (i.e. the part that contains “return” and “nfirms”), then we use level=0
:
=0) wide2.stack(level
portfolio | 1 | 2 | |
---|---|---|---|
year | |||
2005 | nfirms | 5.00 | 4.00 |
return | 0.10 | 0.05 | |
2006 | nfirms | 3.00 | 10.00 |
return | 0.15 | 0.01 |
wide2
return | nfirms | |||
---|---|---|---|---|
portfolio | 1 | 2 | 1 | 2 |
year | ||||
2005 | 0.10 | 0.05 | 5.0 | 4.0 |
2006 | 0.15 | 0.01 | 3.0 | 10.0 |
If we want to stack the information in the second dimension of the column MultiIndex (i.e. the part that contains 1’s and 2’s), then we use level=1
:
=1) wide2.stack(level
return | nfirms | ||
---|---|---|---|
year | portfolio | ||
2005 | 1 | 0.10 | 5.0 |
2 | 0.05 | 4.0 | |
2006 | 1 | 0.15 | 3.0 |
2 | 0.01 | 10.0 |