import pandas as pd
L06: Pandas intro
Pandas DataFrames
Almost all our data analysis in this course will be performed using a new data structure called Pandas DataFrames. These data structures are implemented in the Pandas package, which comes with the Anaconda installation. It is customary to import the Pandas package as below.
The Pandas package has a very rich functionality. In this course we will cover only a subset of what one can do with this package. If you are interested in a higher level of detail than what is covered in this course, I strongly recommend the official user guide for the package:
https://pandas.pydata.org/docs/user_guide/index.html
We will be referencing different parts of this user guide at different points in the class. You are not expected to know all the details in the user guide, only the parts that are covered in class, or in the practice problems.
The pandas package comes with many different functions/attributes. We will introduce some of the more commonly used attributes gradually over the course of the semester. Remember, in Pyhton, an object attribute is anything that you can put (using dot notation) after the name of that object. We will use the terms “attribute”, “method” and “function” interchangeably.
Creating dataframes with the “.DataFrame()” function
One of the most important functions in the pandas package is the .DataFrame()
function, which allows us to create a new dataframe (this is similar to using []
to create lists, and using {}
to create dictionaries).
The general syntax for .DataFrame()
is as follows:
=None, index=None, columns=None, dtype=None, copy=None) pandas.DataFrame(data
As an aside, every time we introduce a new Pandas function, you should check its official documentation to see how you’re supposed to use it. Usually, this is the first result if you Google “pandas” followed by the name of that function. For example, Google “pandas DataFrame()” and you should get this https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html as one of the first results.
If you imported pandas as pd (as above), you would replace “pandas” in the syntax above with “pd”.
For example:
= pd.DataFrame(data=[['TSLA',1000],['AAPL',2000]],
df = ['Tesla','Apple'],
index = ['ticker','price']) columns
The cell above creates a new variable df
of type pandas.core.frame.DataFrame
:
print(type(df))
<class 'pandas.core.frame.DataFrame'>
To print df
, we can use the print()
function as usual:
print(df)
ticker price
Tesla TSLA 1000
Apple AAPL 2000
Though simply using the name of the dataframe also works, and the output looks nicer. We will use this method throughout the course:
df
ticker | price | |
---|---|---|
Tesla | TSLA | 1000 |
Apple | AAPL | 2000 |
From the syntax of the .DataFrame()
function, you should be able to tell that any of the three parameters of DataFrame can be omitted (because they all have default falues).
For example:
= pd.DataFrame(data=[['TSLA',1000],['AAPL',2000]], index = ['Tesla','Apple'])
df2 df2
0 | 1 | |
---|---|---|
Tesla | TSLA | 1000 |
Apple | AAPL | 2000 |
= pd.DataFrame(data=[['TSLA',1000],['AAPL',2000]], columns = ['ticker','price'])
df3 df3
ticker | price | |
---|---|---|
0 | TSLA | 1000 |
1 | AAPL | 2000 |
= pd.DataFrame( index = ['Tesla','Apple'], columns = ['ticker','price'])
df4 df4
ticker | price | |
---|---|---|
Tesla | NaN | NaN |
Apple | NaN | NaN |
Finally, if you check the documentation of .DataFrame()
, you will see that there are several different types of objects you can supply to each of its parameters.
For example, we could have created our initial dataframe df
using a dictionary for the data
parameter (instead of a list):
= pd.DataFrame(data = {'ticker': ['TSLA','AAPL'], 'price': [1000, 2000]},
df5 = ['Tesla','Apple'])
index df5
ticker | price | |
---|---|---|
Tesla | TSLA | 1000 |
Apple | AAPL | 2000 |
Note how the columns
parameter is missing, because column names are specified as the keys in the dictionary supplied to the data
parameter.
We will not go through all the different ways to create a new dataframe. The key point is that most Pandas functions offer a large degree of flexibility with respect to how they can be used. This is very helpful for more advanced users, but it can be quite confusing for beginners. You can only get better at this with practice. Look at the documentation, and experiment with different ways to use that particular function. Check the results and see if you can explain what happened.
I will NOT expect you to know every little detail about the documentation of all the functions we introduce throughout the course. But you should be able to use those functions in the form that I use them in the notes. For example, given the code above, I expect you to know how to create a new dataframe from a list (first approach at the top) and from a dictionary (see code cell right above this one). In the code below, I also show you how to use the numpy
package to create a dataframe containing some randomly generate data (this comes in handy when you want to quickly generate an example dataframe to test some of your code, when the contents of the data don’t really matter):
import numpy as np #usually we have all import statements at the top of the file
= pd.DataFrame(data = np.random.rand(10,3), columns = list('abc'))
rdata rdata
a | b | c | |
---|---|---|---|
0 | 0.956222 | 0.766637 | 0.285147 |
1 | 0.437727 | 0.215754 | 0.932937 |
2 | 0.834266 | 0.366227 | 0.295984 |
3 | 0.297755 | 0.764719 | 0.124215 |
4 | 0.376822 | 0.802006 | 0.850846 |
5 | 0.768934 | 0.477132 | 0.203595 |
6 | 0.045300 | 0.445239 | 0.430616 |
7 | 0.124063 | 0.178999 | 0.262997 |
8 | 0.277809 | 0.292087 | 0.605525 |
9 | 0.610705 | 0.627855 | 0.923457 |
The structure of a DataFrame: index, columns, values
The syntax of the .DataFrame()
function should make it clear what the main components of a dataframe are:
- the index: the names of the rows (also referred to as axis 0 of the dataframe)
- the columns: the names of the columns (also referred to as axis 1 of the dataframe)
- the values: the data contained in the table itself
We can access each of these individual parts of the dataframe using the .index
, .columns
and .values
attributes:
df
ticker | price | |
---|---|---|
Tesla | TSLA | 1000 |
Apple | AAPL | 2000 |
df.index
Index(['Tesla', 'Apple'], dtype='object')
df.columns
Index(['ticker', 'price'], dtype='object')
df.values
array([['TSLA', 1000],
['AAPL', 2000]], dtype=object)
Note that these attributes are not followed by parentheses. They are still attributes, but they do not compute anything, they simply return some property of the dataframe. It is important to pay attention to these details. For example, the index
attribute above is very different from the pd.Index()
attribute (which is more appropriately called a “function”):https://pandas.pydata.org/docs/reference/api/pandas.Index.html.
Finally, note that the data inside a dataframe is represented as a numpy array:
type(df.values)
numpy.ndarray
At this point in time, it is not important to know anything about the numpy
package or the ndarray
type that comes with it. I just wanted to point out a very important feature of the Python programming language: many packages are built using other existing packages (e.g. pandas uses the numpy and matplotlib packages to name a few).
The .info() function gives us another way to quickly check the structure of our DataFrame:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, Tesla to Apple
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ticker 2 non-null object
1 price 2 non-null int64
dtypes: int64(1), object(1)
memory usage: 48.0+ bytes
The .describe() function gives us summary statistics for the numerical columns in the dataframe:
df.describe()
price | |
---|---|
count | 2.000000 |
mean | 1500.000000 |
std | 707.106781 |
min | 1000.000000 |
25% | 1250.000000 |
50% | 1500.000000 |
75% | 1750.000000 |
max | 2000.000000 |
The .dtypes attribute tells us the data type for each column:
df.dtypes
ticker object
price int64
dtype: object
The .shape attribute tells us the shape of the dataframe:
df.shape
(2, 2)
Manipulating the index
df
ticker | price | |
---|---|---|
Tesla | TSLA | 1000 |
Apple | AAPL | 2000 |
.reset_index()
We can switch to a numerical index for our dataframe using the .reset_index()
function.
Syntax:
=None, drop=False, inplace=False, col_level=0, col_fill='') reset_index(level
For example:
df.reset_index()
index | ticker | price | |
---|---|---|---|
0 | Tesla | TSLA | 1000 |
1 | Apple | AAPL | 2000 |
Note how this pushed the old index inside the table itself, as a new column.
Very important: the code above did not actually change df
:
df
ticker | price | |
---|---|---|
Tesla | TSLA | 1000 |
Apple | AAPL | 2000 |
To do that, we have to set the inplace
parameter to True:
= True) df.reset_index(inplace
df
index | ticker | price | |
---|---|---|---|
0 | Tesla | TSLA | 1000 |
1 | Apple | AAPL | 2000 |
Or we can simply re-write the df
dataframe:
= df.reset_index() df
df
level_0 | index | ticker | price | |
---|---|---|---|---|
0 | 0 | Tesla | TSLA | 1000 |
1 | 1 | Apple | AAPL | 2000 |
.set_index()
We can use the data in one of the columns, as the new index of the table with the .set_index()
function:
Syntax:
=True, append=False, inplace=False, verify_integrity=False) set_index(keys, drop
For example:
= df.set_index('ticker') df
df
level_0 | index | price | |
---|---|---|---|
ticker | |||
TSLA | 0 | Tesla | 1000 |
AAPL | 1 | Apple | 2000 |
Note that, once we make the ‘ticker’ column into an index, it stops showing up as a column:
df.columns
Index(['level_0', 'index', 'price'], dtype='object')
So if we try to set the index to ‘ticker’ again, we will get an error, because Python can not find the ‘ticker’ column:
#df.set_index('ticker') #this won't work
.rename()
We can rename one or more of the index values using the .rename()
function.
Syntax:
=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None, errors='ignore') rename(mapper
We will supply a dictionary to the index
parameter, where the keys are the current names we want to change in the index, and the values are the new names we want to use.
For example:
= {'TSLA': 'tsla'}) df.rename(index
level_0 | index | price | |
---|---|---|---|
ticker | |||
tsla | 0 | Tesla | 1000 |
AAPL | 1 | Apple | 2000 |
As with .set_index()
, .rename()
does not actually change the df
dataframe unless we set inplace = True
or we redefine the df variable. This behavior should be expected of any pandas function that has an inplace
parameter:
df
level_0 | index | price | |
---|---|---|---|
ticker | |||
TSLA | 0 | Tesla | 1000 |
AAPL | 1 | Apple | 2000 |
Manipulating columns
Let’s rewrite df
to make sure we’re on the same page at this point:
= pd.DataFrame(data=[['TSLA',1000],['AAPL',2000]],
ndf = ['Tesla','Apple'],
index = ['ticker','price'])
columns ndf
ticker | price | |
---|---|---|
Tesla | TSLA | 1000 |
Apple | AAPL | 2000 |
.rename()
We can also rename one or more of the columns using the .rename()
function.
This time, we will supply a dictionary to the columns
parameter, where the keys are the current names of the columns we want to change, and the values are the new names we want to use.
For example
= ndf.rename(columns = {'ticker': 't', 'price':'p'}) ndf
ndf
t | p | |
---|---|---|
Tesla | TSLA | 1000 |
Apple | AAPL | 2000 |
Accessing and creating entire columns
We can access the data in an entire column of a dataframe using square brackets after the dataframe name:
't'] ndf[
Tesla TSLA
Apple AAPL
Name: t, dtype: object
If we want to access multiple columns at once, we have to put their names in a list:
't','p']] ndf[[
t | p | |
---|---|---|
Tesla | TSLA | 1000 |
Apple | AAPL | 2000 |
We can create new columns by just supplying the data manually:
'new'] = [1,2]
ndf[ ndf
t | p | new | |
---|---|---|---|
Tesla | TSLA | 1000 | 1 |
Apple | AAPL | 2000 | 2 |
But most of the time, new columns will be created by bringing them over from other dataframes (we’ll cover this in a future lecture) or by some calculation involving other existing columns from the dataframe.
For example:
'somecalc'] = ndf['p'] + ndf['new']
ndf[ ndf
t | p | new | somecalc | |
---|---|---|---|---|
Tesla | TSLA | 1000 | 1 | 1001 |
Apple | AAPL | 2000 | 2 | 2002 |
Some common Pandas functions
In this section, we introduce some more Pandas functions that are very commonly used. These are very intuitive so we will not present the syntax for each of them. For the most part, it is very clear what these functions do, just from their name. We will cover many more such functions throughout the course.
First, let’s create an example dataframe with random integers:
= pd.DataFrame(data = np.random.rand(3,2), columns = list('ab'))
rd rd
a | b | |
---|---|---|
0 | 0.409593 | 0.323978 |
1 | 0.779860 | 0.474937 |
2 | 0.832432 | 0.196289 |
2) rd.head(
a | b | |
---|---|---|
0 | 0.409593 | 0.323978 |
1 | 0.779860 | 0.474937 |
2) rd.tail(
a | b | |
---|---|---|
1 | 0.779860 | 0.474937 |
2 | 0.832432 | 0.196289 |
Count number of non-missing values:
rd.count()
a 3
b 3
dtype: int64
Calculate means, medians, standard deviations and variances:
rd.mean()
a 0.673962
b 0.331734
dtype: float64
rd.median()
a 0.779860
b 0.323978
dtype: float64
rd.std()
a 0.230454
b 0.139486
dtype: float64
rd.var()
a 0.053109
b 0.019456
dtype: float64
Calculate percentage changes from one row to the next:
rd.pct_change()
a | b | |
---|---|---|
0 | NaN | NaN |
1 | 0.903987 | 0.465956 |
2 | 0.067412 | -0.586706 |
Calculate differences between the current row and a different row (default is the prior row):
rd.diff()
a | b | |
---|---|---|
0 | NaN | NaN |
1 | 0.370267 | 0.150959 |
2 | 0.052572 | -0.278648 |
Calculate cumulative sums (sum of values up to that row):
rd.cumsum()
a | b | |
---|---|---|
0 | 0.409593 | 0.323978 |
1 | 1.189453 | 0.798915 |
2 | 2.021885 | 0.995203 |
Calculate cummulative products (product of values up to that row):
rd.cumprod()
a | b | |
---|---|---|
0 | 0.409593 | 0.323978 |
1 | 0.319425 | 0.153869 |
2 | 0.265900 | 0.030203 |
Pandas Series
The pandas package comes with another commonly used data structure: the Series. This is a data structure that contains a single column and an index.
For example:
= pd.Series(data = [12,34])
s s
0 12
1 34
dtype: int64
type(s)
pandas.core.series.Series
We will almost never use Series in our analysis. We only introduce it here because some pandas functions return a Series as a result. For the most part, we will convert series to dataframes using the .to_frame()
function:
= s.to_frame()
s s
0 | |
---|---|
0 | 12 |
1 | 34 |
type(s)
pandas.core.frame.DataFrame