# Workshop: pandas

### Monday, March 20, 2023

In today's workshop we'll get a bit of practice using `pandas`.
If you are already familiar with dataframes in `R`, then you will find `pandas` to be fairly easy to pick up.
In today's exercises, we'll concentrate on the basics of working with `pandas` dataframes (e.g., constructing and accessing them).
On Wednesday, we'll turn our attention to actually computing statistics from dataframes and using them to create plots.

In [1]:
import pandas as pd

 ## Problem 1: the building blocks

Let's recall that the two most basic units in Python `pandas` are the `Series` and `DataFrame` objects.
Roughly speaking, `Series` objects represent sequences and `DataFrame` objects have their columns specified by `Series` objects.

### 1.1
Create three `pandas` series objects: one called `primes` containing the first ten primes; one called `squares` containing the first ten squares; and one called `alpha` containing the first ten letters of the alphabet (lower- or upper-case; whatever you prefer).
If you're looking for a challenge, you can try using list comprehensions for this, but you are free to just list these out by hand if you prefer.

In [None]:
primes = None
squares = None
alpha = None

### 1.2 
Verify that you can mostly treat `pd.Series` objects as `numpy` arrays-- try writing `primes**2` and see what happens.

In [None]:
primes**2 #pd.Series objects support universal functions

In [None]:
[1,2,3]**2 #Reminder: lists do not!

### 1.3

Try adding two `pd.Series` objects.

In [None]:
primes+squares

What if you try to square the `Series` object `alpha`, which has strings as its entries?

In [None]:
#TODO: code goes here.

### 1.4
Use the `pandas` map functionality to construct another `Series` object called `sqprimes` containing the squares of the first ten primes.

<b>Hint:</b> remember lambda expressions?

In [None]:
sqprimes = primes**2 # This would give us the solution, but we want to use map

In [None]:
# TODO: code goes here.

### 1.5
Now, put the four `Series` objects created above into a single `pandas` `DataFrame` object called `my_first_df`, with column names `Primes`, `Evens`, `Alpha` and `SqPrimes`. 

<b>Hint:</b> read the documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

In [None]:
my_first_df = None # TODO: update this line.

## Problem 2: Basic Indexing

The block of code below creates a table containing some basic information about the colleges and universities with the most NCAA Division I Ice Hockey championships (see <a href="https://en.wikipedia.org/wiki/NCAA_Division_I_Men's_Ice_Hockey_Tournament">here</a> for more information).

In [2]:
schools = ['Michigan', 'Denver', 'North Dakota', 'Wisconsin', 'Boston College',
           'Boston University', 'Minnesota', 'Lake Superior State']
titles = [9,8,8,6,5,5,5,3]
states = ['MI', 'CO', 'ND', 'WI', 'MA', 'MA', 'MI', 'MI']
hockey_dict = {'College':pd.Series(schools), '#Titles':pd.Series(titles), 'State':pd.Series(states) }
h = pd.DataFrame( hockey_dict )
h

Unnamed: 0,College,#Titles,State
0,Michigan,9,MI
1,Denver,8,CO
2,North Dakota,8,ND
3,Wisconsin,6,WI
4,Boston College,5,MA
5,Boston University,5,MA
6,Minnesota,5,MI
7,Lake Superior State,3,MI


### 2.1
Oops! There's a typo in there-- U. Minnesota is in Minnesota (state abbreviation MN), not Michigan!
Correct the typo (using indexing, not by modifying the code above!).

In [None]:
#TODO: code goes here.

### 2.2
Modify the code above so that the rows are indexed by the schools' names, instead of by the integers 0 to 7.

In [None]:
# TODO: code goes here.

See these StackOverflow threads for another approach to this problem. You may find it helpful to read the documentation concerning `DataFrame` and `Index` objects in `pandas`, first (see links in Problem 2.4 below).

https://stackoverflow.com/questions/19609631/python-changing-row-index-of-pandas-data-frame
https://stackoverflow.com/questions/19609631/python-changing-row-index-of-pandas-data-frame

### 2.3
Add a column called `InWisconsin?` to this `DataFrame`, whose values are Booleans encoding whether or not each of these universities is in Wisconsin.

In [None]:
#TODO: code goes here.

### 2.4
Add a row for Michigan State, which is in Michigan (state abbreviation MI), and has won 3 championships.

<b>Hint:</b> you may find it useful to read up on indexing and Index objects in pandas:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
https://pandas.pydata.org/pandas-docs/stable/reference/indexing.html


In [None]:
#TODO: code goes here

### 2.5
Write code to extract only rows corresponding to universities in Michigan (state code MI)

In [None]:
# CODE GOES HERE.

### 2.6
Save the `DataFrame`in a CSV file called `ncaa_hockey.csv`.

In [None]:
# TODO: code goes here.

### 2.7
Verify that everything worked correctly by reading your file back into `pandas`, saving it in a new variable called `hockey_test`.

In [None]:
# TODO: code goes here.