L10: Merging, reshaping datasets

Preliminaries

The data we need for our projects is rarely all in one place (in a single dataframe) or organized the way we need it. This means that we very often have to combine two or more datasets into a single dataset and/or change the organization of the dataset (what appears in the rows and what appears in the columns) to better suit our needs. Here we cover some of the main tools we can use to perform these operations using the Pandas package.

import pandas as pd
from IPython.display import display #allows us to ``pretty print`` multiple objects in the same cell

Let’s create some example datasets:

df1 = pd.DataFrame({'year': [2001, 2002, 2003], 
                    'tic': ['MSFT','TSLA','AAPL'], 
                    'fy':[2002,2003,2004]})
df1

	year	tic	fy
0	2001	MSFT	2002
1	2002	TSLA	2003
2	2003	AAPL	2004

df2 = pd.DataFrame({'year': [2001, 2002, 2004], 
                    'ticker': ['MSFT','NFLX','AAPL'], 
                    'fy':[12,12,12]})
df2

	year	ticker	fy
0	2001	MSFT	12
1	2002	NFLX	12
2	2004	AAPL	12

#example of display
display(df1)
df2

	year	tic	fy
0	2001	MSFT	2002
1	2002	TSLA	2003
2	2003	AAPL	2004

	year	ticker	fy
0	2001	MSFT	12
1	2002	NFLX	12
2	2004	AAPL	12

Merging datasets

When we say we want to “merge” two datasets, we generally mean that we want the columns of the two datasets to appear side by side in one final dataset. The important question is: How should the ROWS of the two datasets be matched? To perform this match, we need to have one or more columns that contain the same information in each of the two datasets. These common columns are usually referred to as the “keys” on which the rows are matched.

The second thing we have to decide is what to do with the rows that do NOT match after the merge. This is where we have to decide if we want an “inner”, “outer”, “left”, or “right” merge (aka join), as specified below.

Most commonly, we use the .merge() function when the keys on which we want to merge the datasets are columns, and the .join() function when they are indexes (even though .merge() can also merge on index). We cover these two functions separately in the two subsections below:

Merging by columns with `.merge()`

If the keys we want to merge on are columns in our dataframes (as opposed to indexes) we have to use the .merge() (.join() will not work for this purpose):

Abbreviated syntax:

DataFrame.merge(right, how='inner', on=None, 
                left_on=None, right_on=None, 
                left_index=False, right_index=False, 
                sort=False, suffixes=('_x', '_y'))

We’ll replace DataFrame and right with the names of the two dataframes we want to merge (respectively). In the subsections below, we explain what the different choices for how mean. When the key(s) on which we want to merge have the same name(s) in the two datasets, we use the on parameter to specify the names of the keys (see the example immediately below), otherwise we need to specify them using the left_on and right_on parameters.

The documentation at https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html has some additional examples that you may find useful.

Inner join

The inner join combines the datasets based on the INTERSECTION of the values in the “key” columns.

For example, if we want to inner merge the two dataframes using the year column as a key:

inner1 = df1.merge(df2, how='inner', on='year')
inner1

	year	tic	fy_x	ticker	fy_y
0	2001	MSFT	2002	MSFT	12
1	2002	TSLA	2003	NFLX	12

But if we want to merge on both year and ticker information, we can not use on because the ticker information has different names in the different dataframes:

inner2 = df1.merge(df2, how='inner', 
                   left_on = ['year','tic'], right_on = ['year','ticker'])
inner2

	year	tic	fy_x	ticker	fy_y
0	2001	MSFT	2002	MSFT	12

Note how the name of the fy column in each of the two datasets has been changed. To control that process ourselves, we can use the suffixes parameter to specify suffixes that should be appended at the end of common column names:

inner2 = df1.merge(df2, how='inner', 
                   left_on = ['year','tic'], right_on = ['year','ticker'],
                  suffixes = ('_df1', '_df2'))
inner2

	year	tic	fy_df1	ticker	fy_df2
0	2001	MSFT	2002	MSFT	12

Outer join

The outer join combines the datasets based on the UNION of the values in the “key” columns. For example:

outer = df1.merge(df2, how='outer', 
                 left_on = ['year','tic'], right_on = ['year','ticker'])
outer

	year	tic	fy_x	ticker	fy_y
0	2001	MSFT	2002.0	MSFT	12.0
1	2002	TSLA	2003.0	NaN	NaN
2	2003	AAPL	2004.0	NaN	NaN
3	2002	NaN	NaN	NFLX	12.0
4	2004	NaN	NaN	AAPL	12.0

Left join

In a left join, the unmatched keys from the left dataset are kept, but the unmatched keys from the right dataset are discarded.

left = df1.merge(df2, how='left', 
                 left_on = ['year','tic'], right_on = ['year','ticker'])
left

	year	tic	fy_x	ticker	fy_y
0	2001	MSFT	2002	MSFT	12.0
1	2002	TSLA	2003	NaN	NaN
2	2003	AAPL	2004	NaN	NaN

Right join

In a right join, the unmatched keys from the right dataset are kept, but the unmatched keys from the left dataset are discarded.

right = df1.merge(df2, how='right', 
                 left_on = ['year','tic'], right_on = ['year','ticker'])
right

	year	tic	fy_x	ticker	fy_y
0	2001	MSFT	2002.0	MSFT	12
1	2002	NaN	NaN	NFLX	12
2	2004	NaN	NaN	AAPL	12

Merging on index using `.join()` (and `.merge()`)

As mentioned above, this covers the situation when the keys on which we want to perform the merge are indexes in the dataframes we want to merge. In this case, we can either use the .merge() function and specify left_index=True, right_index=True, or we can use the .join() funciton without specifying any keys at all (because .join() assumes that you are using the index):

First, let’s add an index (MultiIndex) to the two dataframes:

df3 = df1.set_index(['year','tic'])
df3

		fy
year	tic
2001	MSFT	2002
2002	TSLA	2003
2003	AAPL	2004

df4 = df2.set_index(['year','ticker'])
df4

		fy
year	ticker
2001	MSFT	12
2002	NFLX	12
2004	AAPL	12

Before we can use .merge(), we have to make the index names match (tic does not match ticker):

df4.index.names = ['year','tic']
display(df3)
df4

		fy
year	tic
2001	MSFT	2002
2002	TSLA	2003
2003	AAPL	2004

		fy
year	tic
2001	MSFT	12
2002	NFLX	12
2004	AAPL	12

Now, say we want to perform an outer merge on the index:

outerm = df3.merge(df4, how='outer', 
                  left_index = True, right_index = True,
                  suffixes = ('_df3', '_df4'))
outerm

		fy_df3	fy_df4
year	tic
2001	MSFT	2002.0	12.0
2002	NFLX	NaN	12.0
2002	TSLA	2003.0	NaN
2003	AAPL	2004.0	NaN
2004	AAPL	NaN	12.0

Using the .join() function will not work unless we specify how the fy columns need to be renamed. We use the lsuffix and rsuffix parameters for this purpose.

Syntax:

DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)

For example, we can perform the same outer merge as above using:

outerj = df3.join(df4, how = 'outer', 
                  lsuffix = '_df3', rsuffix = '_df4')
outerj

		fy_df3	fy_df4
year	tic
2001	MSFT	2002.0	12.0
2002	NFLX	NaN	12.0
2002	TSLA	2003.0	NaN
2003	AAPL	2004.0	NaN
2004	AAPL	NaN	12.0

Concatenating datasets with `.concat()`

When we say that we want to “concatenate” two datasets, we generally mean that we want the columns of the two datasets to be stacked on top of each other (vertically) or side by side (horizontally). We usually want to concatenate vertically (aka “appending”) when we obtain more data on a given set of variables, and we just want to add it at the bottom of a dataset already containing those variables (columns).

Unlike merging, concatenating horizontally means that we want to combine two datasets side by side, but “as they are”, without concern for matching rows in any meaningful way (like merge). This is usually done when we want to combine different pieces of information (columns) from the two datasets, and we are certain that existing datasets are already organized in such a way that the rows will match meaningfully.

The .concat() function can perform both types of concatenation (it can also perform a merge but we will not use it for that purpose in this course):

Syntax:

pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, sort=False, copy=True)

We specify axis=0 for vertical concatenation and axis=1 for horizontal concatenation:

df1

	year	tic	fy
0	2001	MSFT	2002
1	2002	TSLA	2003
2	2003	AAPL	2004

df2

	year	ticker	fy
0	2001	MSFT	12
1	2002	NFLX	12
2	2004	AAPL	12

vertical = pd.concat([df1,df2], axis = 0)
vertical

	year	tic	fy	ticker
0	2001	MSFT	2002	NaN
1	2002	TSLA	2003	NaN
2	2003	AAPL	2004	NaN
0	2001	NaN	12	MSFT
1	2002	NaN	12	NFLX
2	2004	NaN	12	AAPL

horizontal = pd.concat([df1,df2], axis=1)
horizontal

	year	tic	fy	year	ticker	fy
0	2001	MSFT	2002	2001	MSFT	12
1	2002	TSLA	2003	2002	NFLX	12
2	2003	AAPL	2004	2004	AAPL	12

Note that, .concat() with axis=1 is equivalent to .join() with how='outer'. In other words, .concat() performs a merge on index. Its default join argument is set to outer but you can change that to perform any kind of join you want.

Reshaping datasets

By reshaping a dataset we generally mean that we want to change the structure of the dataset so that either

Some data stored in one column is converted to multiple columns (but the same row)
- In pandas, this is called “unstacking”
- Informally, we say that we are converting the dataset from long to wide

Some data stored in multiple columns (but the same row) is converted to a single column
- In pandas, this is called “stacking”
- Informally, we say that we are converting the dataset from wide to long

See the official user guide page for more details on reshaping: https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html

From long to wide (unstacking) with `.pivot()`

Syntax:

DataFrame.pivot(index=None, columns=None, values=None)

Let’s create an example dataset:

long1 = pd.DataFrame({'portfolio':[1,1,2,2], 
                      'year': ['2005','2006','2005','2006'], 
                      'return': [0.1,0.15,0.05,0.01],
                      'nfirms':[5,3,4,10]})
long1

	portfolio	year	return	nfirms
0	1	2005	0.10	5
1	1	2006	0.15	3
2	2	2005	0.05	4
3	2	2006	0.01	10

For example, if we want to reshape the return data so that each year gets its own row (index value), and each portfolio gets its own column, we would use:

wide = long1.pivot(index='year', columns = 'portfolio', values = 'return')
wide

portfolio	1	2
year
2005	0.10	0.05
2006	0.15	0.01

wide.columns

Int64Index([1, 2], dtype='int64', name='portfolio')

wide.index

Index(['2005', '2006'], dtype='object', name='year')

If we want to reshape both the return data and the nfirms data in the same way, at the same time:

wide2 = long1.pivot(index='year', columns='portfolio', values=['return','nfirms'])
wide2

	return		nfirms
portfolio	1	2	1	2
year
2005	0.10	0.05	5.0	4.0
2006	0.15	0.01	3.0	10.0

Note that now the column labels are two-dimensional:

wide2.columns

MultiIndex([('return', 1),
            ('return', 2),
            ('nfirms', 1),
            ('nfirms', 2)],
           names=[None, 'portfolio'])

wide2.index

Index(['2005', '2006'], dtype='object', name='year')

Unstacking based on values in the index can be done using the .unstack() function (usually for datasets with a MultiIndex).

From wide to long (stacking) with `.stack()`

We can stack the data back up to a “long” shape, based on information in the index of that dataframe.

Syntax:

DataFrame.stack(level=-1, dropna=True)

The level parameter is used in case the dataframe we want to reshape has a MultiIndex in the columns i.e. multi-dimensional column names (like wide2 above). The default level=-1 works for dataframes that were reshaped from a different dataframe (like wide2 above). In that case level=-1 tells Python to just undo that reshaping:

long3 = wide2.stack()
long3

		return	nfirms
year	portfolio
2005	1	0.10	5.0
2005	2	0.05	4.0
2006	1	0.15	3.0
2006	2	0.01	10.0

If we want to stack the information in the first dimension of the column MultiIndex (i.e. the part that contains “return” and “nfirms”), then we use level=0:

wide2.stack(level=0)

	portfolio	1	2
year
2005	nfirms	5.00	4.00
2005	return	0.10	0.05
2006	nfirms	3.00	10.00
2006	return	0.15	0.01

wide2

	return		nfirms
portfolio	1	2	1	2
year
2005	0.10	0.05	5.0	4.0
2006	0.15	0.01	3.0	10.0

If we want to stack the information in the second dimension of the column MultiIndex (i.e. the part that contains 1’s and 2’s), then we use level=1:

wide2.stack(level=1)

		return	nfirms
year	portfolio
2005	1	0.10	5.0
2005	2	0.05	4.0
2006	1	0.15	3.0
2006	2	0.01	10.0

Preliminaries

Merging datasets

Merging by columns with .merge()

Inner join

Outer join

Left join

Right join

Merging on index using .join() (and .merge())

Concatenating datasets with .concat()

Reshaping datasets

From long to wide (unstacking) with .pivot()

From wide to long (stacking) with .stack()

Merging by columns with `.merge()`

Merging on index using `.join()` (and `.merge()`)

Concatenating datasets with `.concat()`

From long to wide (unstacking) with `.pivot()`

From wide to long (stacking) with `.stack()`