Lab

Probability and Statistics

Alessandro D. Gagliardi

On Groundhog Day, February 2, a famous groundhog in Punxsutawney, PA is used to predict whether a winter will be long or not based on whether or not he sees his shadow. I collected data on whether he saw his shadow or not from here. I stored some of this data in this table.

Although Phil is on the East Coast, I wondered if the information says anything about whether or not we will experience a rainy winter out here in California. For this, I found rainfall data, and saved it in a table. To see how this was extracted see this notebook.

Make a boxplot of the average rainfall in Northen California comparing the years Phil sees his shadow versus the years he does not.
Construct a 90% confidence interval for the difference between the mean rainfall in years Phil sees his shadow and years he does not.
Interpret the interval in part 2.
At level, $\alpha = 0.05$ would you reject the null hypothesis that the average rainfall in Northern California during the month of February was the same in years Phil sees his shadow versus years he does not?
What assumptions are you making in forming your confidence interval and in your hypothesis test?

%matplotlib inline
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
import twitter
import yaml
from pymongo import MongoClient

Part 1

rainfall = pd.read_csv('http://stats191.stanford.edu/data/rainfall.csv')
groundhog = pd.read_csv('http://stats191.stanford.edu/data/groundhog.table')

df = rainfall.merge(groundhog, left_on='WY', right_on='year')[['Total', 'shadow']]

df.boxplot(column='Total', by='shadow')

Part 2

mod = sm.OLS.from_formula("Total ~ shadow == 'Y'", df)
res = mod.fit()
print res.summary(alpha = .1)

I report the confidence interval [-20.135, 10.453] for the difference between the shadow == 'Y' and the shadow == 'N' means.

Part 3

If I repeated this sample of shadow and rainfall in Northern California (assuming they are IID each year) and I form this confidence interval as t.test does. Then, 90% of the intervals will cover the true underlying difference in the rainfall between years the groundhog sees his shadow or not.

Part 4

As the reported $p$-value is 0.591, I fail to reject the null hypothesis at the 5% level.

Part 5

I am assuming that the rainfall measurements are independent $N(\mu_i,\sigma^2)$ where $\mu_i=\mu_N$ in the shadow == 'N' years and $\mu_i=\mu_Y$ in the shadow == 'Y' years.

1-2 Pairs

Start with the data on US and Canada trends from last week.

Create a histogram of text_len within each group. (Use alpha = 0.5 to overlay them.)
Compute the sample mean and standard deviation in the two groups.
Create a DataFrame concatenating data from each collection adding a country column to distinguish US from Canada.
i.e. given ca_text_len and us_text_len as Series containing the length of each text in the Canadian and US collections respectively:
```
 text_len_df = pd.concat([pd.DataFrame({'text_len': ca_text_len, 'country': 'CA'}), 
                          pd.DataFrame({'text_len': us_text_len, 'country': 'US'})])
```
Use this DataFrame to create a boxplot of the text_len by country.
Use OLS to compute a 90% confidence interval for the difference in text_len between the two groups. Name a problem with describing the confidence interval of tweet length in this way.
At level $\alpha=5\%$, test the null hypothesis that the average text length does not differ between the two groups. What can you conclude?
Repeat the above steps but pick your own tags and try to find a pair with a more significant difference.

Homework

Use enron.db from last week.

Create a boxplot of the message recipient count (MAX(rno)), splitting the data based on the seniority of the sender.
Compute the sample mean and standard deviation in the two groups.
Create a histogram of the recipient count within each group.
Compute a 90% confidence interval for the difference in recipient count between the two groups. What is a problem with this? How might you fix it?
At level $\alpha=5\%$, test the null hypothesis that the average recipient count does not differ between the two groups. What assumptions are you making? What can you conclude?