L06: Pandas intro

Pandas DataFrames

Almost all our data analysis in this course will be performed using a new data structure called Pandas DataFrames. These data structures are implemented in the Pandas package, which comes with the Anaconda installation. It is customary to import the Pandas package as below.

import pandas as pd

The Pandas package has a very rich functionality. In this course we will cover only a subset of what one can do with this package. If you are interested in a higher level of detail than what is covered in this course, I strongly recommend the official user guide for the package:

https://pandas.pydata.org/docs/user_guide/index.html

We will be referencing different parts of this user guide at different points in the class. You are not expected to know all the details in the user guide, only the parts that are covered in class, or in the practice problems.

The pandas package comes with many different functions/attributes. We will introduce some of the more commonly used attributes gradually over the course of the semester. Remember, in Pyhton, an object attribute is anything that you can put (using dot notation) after the name of that object. We will use the terms “attribute”, “method” and “function” interchangeably.

Creating dataframes with the “.DataFrame()” function

One of the most important functions in the pandas package is the .DataFrame() function, which allows us to create a new dataframe (this is similar to using [] to create lists, and using {} to create dictionaries).

The general syntax for .DataFrame() is as follows:

pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)

As an aside, every time we introduce a new Pandas function, you should check its official documentation to see how you’re supposed to use it. Usually, this is the first result if you Google “pandas” followed by the name of that function. For example, Google “pandas DataFrame()” and you should get this https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html as one of the first results.

If you imported pandas as pd (as above), you would replace “pandas” in the syntax above with “pd”.

For example:

df = pd.DataFrame(data=[['TSLA',1000],['AAPL',2000]], 
                  index = ['Tesla','Apple'], 
                  columns = ['ticker','price'])

The cell above creates a new variable df of type pandas.core.frame.DataFrame:

print(type(df))

<class 'pandas.core.frame.DataFrame'>

To print df, we can use the print() function as usual:

print(df)

      ticker  price
Tesla   TSLA   1000
Apple   AAPL   2000

Though simply using the name of the dataframe also works, and the output looks nicer. We will use this method throughout the course:

df

	ticker	price
Tesla	TSLA	1000
Apple	AAPL	2000

From the syntax of the .DataFrame() function, you should be able to tell that any of the three parameters of DataFrame can be omitted (because they all have default falues).

For example:

df2 = pd.DataFrame(data=[['TSLA',1000],['AAPL',2000]], index = ['Tesla','Apple'])
df2

	0	1
Tesla	TSLA	1000
Apple	AAPL	2000

df3 = pd.DataFrame(data=[['TSLA',1000],['AAPL',2000]], columns = ['ticker','price'])
df3

	ticker	price
0	TSLA	1000
1	AAPL	2000

df4 = pd.DataFrame( index = ['Tesla','Apple'], columns = ['ticker','price'])
df4

	ticker	price
Tesla	NaN	NaN
Apple	NaN	NaN

Finally, if you check the documentation of .DataFrame(), you will see that there are several different types of objects you can supply to each of its parameters.

For example, we could have created our initial dataframe df using a dictionary for the data parameter (instead of a list):

df5 = pd.DataFrame(data = {'ticker': ['TSLA','AAPL'], 'price': [1000, 2000]}, 
                  index = ['Tesla','Apple'])
df5

	ticker	price
Tesla	TSLA	1000
Apple	AAPL	2000

Note how the columns parameter is missing, because column names are specified as the keys in the dictionary supplied to the data parameter.

We will not go through all the different ways to create a new dataframe. The key point is that most Pandas functions offer a large degree of flexibility with respect to how they can be used. This is very helpful for more advanced users, but it can be quite confusing for beginners. You can only get better at this with practice. Look at the documentation, and experiment with different ways to use that particular function. Check the results and see if you can explain what happened.

I will NOT expect you to know every little detail about the documentation of all the functions we introduce throughout the course. But you should be able to use those functions in the form that I use them in the notes. For example, given the code above, I expect you to know how to create a new dataframe from a list (first approach at the top) and from a dictionary (see code cell right above this one). In the code below, I also show you how to use the numpy package to create a dataframe containing some randomly generate data (this comes in handy when you want to quickly generate an example dataframe to test some of your code, when the contents of the data don’t really matter):

import numpy as np #usually we have all import statements at the top of the file

rdata = pd.DataFrame(data = np.random.rand(10,3), columns = list('abc'))
rdata

	a	b	c
0	0.956222	0.766637	0.285147
1	0.437727	0.215754	0.932937
2	0.834266	0.366227	0.295984
3	0.297755	0.764719	0.124215
4	0.376822	0.802006	0.850846
5	0.768934	0.477132	0.203595
6	0.045300	0.445239	0.430616
7	0.124063	0.178999	0.262997
8	0.277809	0.292087	0.605525
9	0.610705	0.627855	0.923457

The structure of a DataFrame: index, columns, values

The syntax of the .DataFrame() function should make it clear what the main components of a dataframe are:

the index: the names of the rows (also referred to as axis 0 of the dataframe)
the columns: the names of the columns (also referred to as axis 1 of the dataframe)
the values: the data contained in the table itself

We can access each of these individual parts of the dataframe using the .index, .columns and .values attributes:

df

	ticker	price
Tesla	TSLA	1000
Apple	AAPL	2000

df.index

Index(['Tesla', 'Apple'], dtype='object')

df.columns

Index(['ticker', 'price'], dtype='object')

df.values

array([['TSLA', 1000],
       ['AAPL', 2000]], dtype=object)

Note that these attributes are not followed by parentheses. They are still attributes, but they do not compute anything, they simply return some property of the dataframe. It is important to pay attention to these details. For example, the index attribute above is very different from the pd.Index() attribute (which is more appropriately called a “function”):https://pandas.pydata.org/docs/reference/api/pandas.Index.html.

Finally, note that the data inside a dataframe is represented as a numpy array:

type(df.values)

numpy.ndarray

At this point in time, it is not important to know anything about the numpy package or the ndarray type that comes with it. I just wanted to point out a very important feature of the Python programming language: many packages are built using other existing packages (e.g. pandas uses the numpy and matplotlib packages to name a few).

The .info() function gives us another way to quickly check the structure of our DataFrame:

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, Tesla to Apple
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ticker  2 non-null      object
 1   price   2 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 48.0+ bytes

The .describe() function gives us summary statistics for the numerical columns in the dataframe:

df.describe()

	price
count	2.000000
mean	1500.000000
std	707.106781
min	1000.000000
25%	1250.000000
50%	1500.000000
75%	1750.000000
max	2000.000000

The .dtypes attribute tells us the data type for each column:

df.dtypes

ticker    object
price      int64
dtype: object

The .shape attribute tells us the shape of the dataframe:

df.shape

(2, 2)

Manipulating the index

df

	ticker	price
Tesla	TSLA	1000
Apple	AAPL	2000

`.reset_index()`

We can switch to a numerical index for our dataframe using the .reset_index() function.

Syntax:

reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')

For example:

df.reset_index()

	index	ticker	price
0	Tesla	TSLA	1000
1	Apple	AAPL	2000

Note how this pushed the old index inside the table itself, as a new column.

Very important: the code above did not actually change df:

df

	ticker	price
Tesla	TSLA	1000
Apple	AAPL	2000

To do that, we have to set the inplace parameter to True:

df.reset_index(inplace = True)

df

	index	ticker	price
0	Tesla	TSLA	1000
1	Apple	AAPL	2000

Or we can simply re-write the df dataframe:

df = df.reset_index()

df

	level_0	index	ticker	price
0	0	Tesla	TSLA	1000
1	1	Apple	AAPL	2000

`.set_index()`

We can use the data in one of the columns, as the new index of the table with the .set_index() function:

Syntax:

set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)

For example:

df = df.set_index('ticker')

df

	level_0	index	price
ticker
TSLA	0	Tesla	1000
AAPL	1	Apple	2000

Note that, once we make the ‘ticker’ column into an index, it stops showing up as a column:

df.columns

Index(['level_0', 'index', 'price'], dtype='object')

So if we try to set the index to ‘ticker’ again, we will get an error, because Python can not find the ‘ticker’ column:

#df.set_index('ticker') #this won't work

`.rename()`

We can rename one or more of the index values using the .rename() function.

Syntax:

rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None, errors='ignore')

We will supply a dictionary to the index parameter, where the keys are the current names we want to change in the index, and the values are the new names we want to use.

For example:

df.rename(index = {'TSLA': 'tsla'})

	level_0	index	price
ticker
tsla	0	Tesla	1000
AAPL	1	Apple	2000

As with .set_index(), .rename() does not actually change the df dataframe unless we set inplace = True or we redefine the df variable. This behavior should be expected of any pandas function that has an inplace parameter:

df

	level_0	index	price
ticker
TSLA	0	Tesla	1000
AAPL	1	Apple	2000

Manipulating columns

Let’s rewrite df to make sure we’re on the same page at this point:

ndf = pd.DataFrame(data=[['TSLA',1000],['AAPL',2000]], 
                  index = ['Tesla','Apple'], 
                  columns = ['ticker','price'])
ndf

	ticker	price
Tesla	TSLA	1000
Apple	AAPL	2000

`.rename()`

We can also rename one or more of the columns using the .rename() function.

This time, we will supply a dictionary to the columns parameter, where the keys are the current names of the columns we want to change, and the values are the new names we want to use.

For example

ndf = ndf.rename(columns = {'ticker': 't', 'price':'p'})

ndf

	t	p
Tesla	TSLA	1000
Apple	AAPL	2000

Accessing and creating entire columns

We can access the data in an entire column of a dataframe using square brackets after the dataframe name:

ndf['t']

Tesla    TSLA
Apple    AAPL
Name: t, dtype: object

If we want to access multiple columns at once, we have to put their names in a list:

ndf[['t','p']]

	t	p
Tesla	TSLA	1000
Apple	AAPL	2000

We can create new columns by just supplying the data manually:

ndf['new'] = [1,2]
ndf

	t	p	new
Tesla	TSLA	1000	1
Apple	AAPL	2000	2

But most of the time, new columns will be created by bringing them over from other dataframes (we’ll cover this in a future lecture) or by some calculation involving other existing columns from the dataframe.

For example:

ndf['somecalc'] = ndf['p'] + ndf['new']
ndf

	t	p	new	somecalc
Tesla	TSLA	1000	1	1001
Apple	AAPL	2000	2	2002

Some common Pandas functions

In this section, we introduce some more Pandas functions that are very commonly used. These are very intuitive so we will not present the syntax for each of them. For the most part, it is very clear what these functions do, just from their name. We will cover many more such functions throughout the course.

First, let’s create an example dataframe with random integers:

rd = pd.DataFrame(data = np.random.rand(3,2), columns = list('ab'))
rd

	a	b
0	0.409593	0.323978
1	0.779860	0.474937
2	0.832432	0.196289

rd.head(2)

	a	b
0	0.409593	0.323978
1	0.779860	0.474937

rd.tail(2)

	a	b
1	0.779860	0.474937
2	0.832432	0.196289

Count number of non-missing values:

rd.count()

a    3
b    3
dtype: int64

Calculate means, medians, standard deviations and variances:

rd.mean()

a    0.673962
b    0.331734
dtype: float64

rd.median()

a    0.779860
b    0.323978
dtype: float64

rd.std()

a    0.230454
b    0.139486
dtype: float64

rd.var()

a    0.053109
b    0.019456
dtype: float64

Calculate percentage changes from one row to the next:

rd.pct_change()

	a	b
0	NaN	NaN
1	0.903987	0.465956
2	0.067412	-0.586706

Calculate differences between the current row and a different row (default is the prior row):

rd.diff()

	a	b
0	NaN	NaN
1	0.370267	0.150959
2	0.052572	-0.278648

Calculate cumulative sums (sum of values up to that row):

rd.cumsum()

	a	b
0	0.409593	0.323978
1	1.189453	0.798915
2	2.021885	0.995203

Calculate cummulative products (product of values up to that row):

rd.cumprod()

	a	b
0	0.409593	0.323978
1	0.319425	0.153869
2	0.265900	0.030203

Pandas Series

The pandas package comes with another commonly used data structure: the Series. This is a data structure that contains a single column and an index.

For example:

s = pd.Series(data = [12,34])
s

0    12
1    34
dtype: int64

type(s)

pandas.core.series.Series

We will almost never use Series in our analysis. We only introduce it here because some pandas functions return a Series as a result. For the most part, we will convert series to dataframes using the .to_frame() function:

s = s.to_frame()
s

	0
0	12
1	34

type(s)

pandas.core.frame.DataFrame