Data Science

Probability and Statistics

Alessandro D. Gagliardi

There are three kinds of lies: lies, damned lies, and statistics.

– Benjamin Disraeli (?)

How to Lie with Statistics

Data Science:

How to be Honest with Statistics

Agenda

  • What is a regression model?
  • History of Probability
  • Descriptive statistics -- numerical
  • Descriptive statistics -- graphical
  • p-values and Hypothesis Testing
  • Inference about a population mean
  • Difference between two population means
In [1]:
%matplotlib inline
%load_ext rmagic
from __future__ import division

import warnings

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandas.rpy.common as com
import statsmodels.api as sm

from matplotlib import pylab
from scipy import stats
from scipy.stats import binom
from scipy.stats import t as tdist
from sympy import Eq, S, init_printing
from sympy.stats import Binomial, Die, P, density, sample, sample_iter

warnings.filterwarnings('ignore')

init_printing()

What is a regression model?

A functional relationship between input & response variables

A simple linear regression model captures a linear relationship between an input x and response variable y

$$ y = \alpha + \beta x + \epsilon $$

The purpose of models is not to fit the data but to sharpen the questions.

Samuel Karlin (1924-2007), mathematician

What do the terms in this model mean?

$y = \alpha + \beta x + \epsilon$

$y =$ response variable (the one we want to predict)

$x =$ input variable (the one we use to train the model)

$\alpha =$ intercept (where the line crosses the y-axis)

$\beta =$ regression coefficient (the model “parameter”)

$\epsilon =$ residual (the prediction error)

What is a regression model?

A regression model is a model of the relationships between some covariates (predictors) and an outcome.

Specifically, regression is a model of the average outcome given or having fixed the covariates.

Heights of mothers and daughters

  • We will consider the heights of mothers and daughters collected by Karl Pearson in the late 19th century.

  • One of our goals is to understand height of the daughter, D, knowing the height of the mother, M.

  • A mathematical model might look like $$ D = f(M) + \varepsilon$$ where $f$ gives the average height of the daughter of a mother of height M and $\varepsilon$ is error: not every daughter has the same height.

  • A statistical question: is there any relationship between covariates and outcomes -- is $f$ just a constant?

Let's create a plot of the heights of the mother/daughter pairs. The data is in an R package that can be downloaded from CRAN with the command:

install.packages("alr3")

If the package is not installed, then you will get an error message when calling library(alr3).

In [2]:
%%R
library(alr3)
data(heights)
Loading required package: car

In [3]:
heights = com.load_data('heights')
heights_fig = plt.figure(figsize=(10,6))
axes = heights_fig.gca()
axes.scatter(heights.Mheight, heights.Dheight, c='red')
axes.set_xlabel("Mother's height (inches)", size=20)
axes.set_ylabel("Daughter's height (inches)", size=20)
Out[3]:
<matplotlib.text.Text at 0x10e37c350>

In the first part of this session we'll talk about fitting a line to this data. Let's do that and remake the plot, including this "best fitting line".

In [4]:
res = sm.OLS.from_formula("Dheight ~ Mheight", heights).fit()
axes.plot([heights.Mheight.min(), heights.Mheight.max()], 
          [res.params.Intercept + res.params.Mheight * heights.Mheight.min(), 
           res.params.Intercept + res.params.Mheight * heights.Mheight.max()],
          linewidth=3)
heights_fig
Out[4]:

Linear regression model

  • How do find this line? With a model.

  • We might model the data as $$ D = \beta_0+ \beta_1 M + \varepsilon. $$

  • This model is linear in $\beta_1$, the coefficient of M (the mother's height), it is a simple linear regression model.

Polynomial Regression

Consider the following model: $$ D = \beta_0 + \beta_1 M + \beta_2 M^2 + \beta_3 F + \varepsilon $$ where $F$ is the height of the daughter's father.

Q: This represents a nonlinear relationship. Is it still a linear model?

A: Yes, because it’s linear in the $\beta$’s!

Polynomial Regression

Although polynomial regression fits a nonlinear model to the data, as statistical estimation problem it is linear, in the sense that the regression function $E(y|x)$ is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

(from Wikipedia)

Which model is better? We will need a tool to compare models... more to come later.

A more complex model

  • Our example here was rather simple: we only had one independent variable.

  • Independent variables are sometimes called features, covariates, or explanatory variables.

  • In practice, we often have many more than one explanatory variable.

Right-to-work

This example considers the effect of right-to-work legislation (which varies by state) on various factors. A description of the data can be found here.

The variables are:

  • Income: income for a four-person family

  • COL: cost of living for a four-person family

  • PD: Population density

  • URate: rate of unionization in 1978

  • Pop: Population

  • Taxes: Property taxes in 1972

  • RTWL: right-to-work indicator

In a study like this, there are many possible questions of interest. Our focus will be on the relationship between RTWL and Income. However, we should recognize that other variables have an effect on Income. Let's look at some of these relationships.

In [5]:
url = "http://www.ilr.cornell.edu/~hadi/RABE4/Data4/P005.txt"
rtw = pd.read_table(url)
rtw.head()
Out[5]:
City COL PD URate Pop Taxes Income RTWL
0 Atlanta 169 414 13.6 1790128 5128 2961 1
1 Austin 143 239 11.0 396891 4303 1711 1
2 Bakersfield 339 43 23.7 349874 4166 2122 0
3 Baltimore 173 951 21.0 2147850 5001 4654 0
4 Baton Rouge 99 255 16.0 411725 3965 1620 1

5 rows × 8 columns

A graphical way to visualize the relationship between Income and RTWL is the boxplot.

In [6]:
rtw.boxplot(column="Income", by="RTWL", figsize=(12,6))
Out[6]:
<matplotlib.axes.AxesSubplot at 0x10f935290>

One variable that may have an important effect on the relationship between is the cost of living COL. It also varies between right-to-work states.

In [7]:
rtw.boxplot(column="COL", by="RTWL", figsize=(12,6))
Out[7]:
<matplotlib.axes.AxesSubplot at 0x10fa55d90>

We may want to include more than one plot in a given display. We can do so with subplots.

In [8]:
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(12,6))
ax1.scatter(rtw.URate, rtw.COL)
ax2.scatter(rtw.URate, rtw.Income)
ax3.scatter(rtw.URate, rtw.Pop)
ax4.scatter(rtw.COL, rtw.Income)
Out[8]:
<matplotlib.collections.PathCollection at 0x10fdc3710>

pandas has a builtin function that will try to display all pairwise relationships in a given dataset, the function scatter_matrix.

In [9]:
axes = pd.scatter_matrix(rtw, figsize=(12,6))

In looking at all the pairwise relationships. There is a point that stands out from all the rest. This data point is New York City, the 27th row of the table.

In [10]:
rtw.ix[26]
Out[10]:
City      New York
COL            323
PD            6908
URate         39.2
Pop        9561089
Taxes         5260
Income        4862
RTWL             0
Name: 26, dtype: object
In [11]:
%%R
print(rtw.table[27,])
pairs(rtw.table[-27,], pch=23, bg='red')
Error in print(rtw.table[27, ]) : object 'rtw.table' not found

Right-to-work example

Building a model

Some of the main goals of this course:

  • Build a statistical model describing the effect of RTWL on Income.

  • This model should recognize that other variables also affect Income.

  • What sort of statistical confidence do we have in our conclusion about RTWL and Income?

  • Is the model adequate do describe this dataset?

  • Are there other (simpler, more complicated) models?

History of Probability

Chevalier de Méré (~1654)

What's likelier:

  • rolling at least one six in 4 throws of a single die
  • rolling at least one double six in 24 throws of a pair of dice

Chevalier de Méré's reasoning:

  • chance of one six $= \frac{1}{6}$
  • acerage number in four rolls $= 4 \times \left(\frac{1}{6}\right) = \frac{2}{3}$
  • chance of double six in one roll $= \frac{1}{36}$
  • average number in 24 rolls $= 24 \times \left(\frac{1}{36}\right) = \frac{2}{3}$
iteration  | bet 1 | bet 2 | bet 1: avg | bet 2: avg
        0     win     win     1.0          0.0
        1     win     win     1.0          1.0
        2     win     win     1.0          1.0
        3     lose    lose    0.75         0.75
        4     win     lose    0.8          0.6
        5     lose    lose    0.667        0.5
        6     lose    win     0.571        0.571
        7     win     win     0.625        0.625
        8     lose    win     0.556        0.667
        9     lose    lose    0.5          0.6

Let's run a simulation:

First let us instantiate a fair die using the sympy library:

In [12]:
D6 = Die('D6') 
density(D6).dict
Out[12]:
$$\begin{Bmatrix}1 : \frac{1}{6}, & 2 : \frac{1}{6}, & 3 : \frac{1}{6}, & 4 : \frac{1}{6}, & 5 : \frac{1}{6}, & 6 : \frac{1}{6}\end{Bmatrix}$$

Let's simulate one roll of the dice:

In [13]:
6 in (sample(D6), sample(D6), sample(D6), sample(D6))
Out[13]:
False

Now lets simulate 2000 trials:

In [14]:
one6in4 = pd.Series(6 in (sample(D6), sample(D6), sample(D6), sample(D6)) for _ in range(2000))
(one6in4.cumsum()/one6in4.index).plot(ylim=(0,1))
one6in4.mean()
Out[14]:
$$0.5175$$

Somewhat less than $^2/_3$ but still good if one is betting even odds.

What about two dice?

In [15]:
A, B = Die('A'), Die('B')
density(A + B).dict
Out[15]:
$$\begin{Bmatrix}2 : \frac{1}{36}, & 3 : \frac{1}{18}, & 4 : \frac{1}{12}, & 5 : \frac{1}{9}, & 6 : \frac{5}{36}, & 7 : \frac{1}{6}, & 8 : \frac{5}{36}, & 9 : \frac{1}{9}, & 10 : \frac{1}{12}, & 11 : \frac{1}{18}, & 12 : \frac{1}{36}\end{Bmatrix}$$

Can we get boxcars on 24 rolls of a pair of dice?

In [16]:
12 in sample_iter(A + B, numsamples=24)
Out[16]:
False

Now let's simulate 2000 trials:

In [17]:
one12in24 = pd.Series(12 in sample_iter(A + B, numsamples=24) for _ in range(2000))
(one12in24.cumsum()/one12in24.index).plot(ylim=(0,1))
one12in24.mean()
Out[17]:
$$0.499$$

Not so good for de Méré

Why did he lose more often with the second gamble?

De Méré asked his friend, Blaise Pascal what was going on.

Pascal consulted with Pierre de Fermat and the Theory of Probability was born.

What are the chances with one die?

In [18]:
P(Eq(D6, 6)), P(D6 < 6)
Out[18]:
$$\begin{pmatrix}\frac{1}{6}, & \frac{5}{6}\end{pmatrix}$$

What are the chances with two dice?

$$ P(win) = \frac{1}{6} + \frac{5}{6} \times \frac{1}{6} $$ $$ P(lose) = \frac{5}{6} \times \frac{5}{6} $$

In [19]:
P(Eq(D6, 6)) + P(Eq(D6, 6)) * P(D6 < 6), P(D6 < 6) * P(D6 < 6)
Out[19]:
$$\begin{pmatrix}\frac{11}{36}, & \frac{25}{36}\end{pmatrix}$$

What about three dice?

$$ P(win) = \frac{1}{6} + \frac{5}{6} \times \frac{1}{6} + \frac{5}{6} \times \frac{5}{6} \times \frac{1}{6}$$ $$ P(lose) = \frac{5}{6} \times \frac{5}{6} \times \frac{5}{6}$$

In [20]:
P(Eq(D6, 6)) + P(Eq(D6, 6)) * P(D6 < 6) + P(Eq(D6, 6)) * P(Eq(D6, 6)) * P(D6 < 6), P(D6 < 6)**3
Out[20]:
$$\begin{pmatrix}\frac{71}{216}, & \frac{125}{216}\end{pmatrix}$$

How about four dice?

$$ P(lose) = \left(\frac{5}{6}\right)^4 $$ $$ P(win) = 1 - P(lose) $$

In [21]:
1 - P(D6 < 6)**4, P(D6 < 6)**4
Out[21]:
$$\begin{pmatrix}\frac{671}{1296}, & \frac{625}{1296}\end{pmatrix}$$
In [22]:
(1 - P(D6 < 6)**4).evalf(3)
Out[22]:
$$0.518$$

Better than $^{50}/_{50}$, but not as good as de Méré thought.

Now, what are the chances of getting 12 on a pair of dice?

In [23]:
P(Eq(A + B, 12)), P(A + B < 12)
Out[23]:
$$\begin{pmatrix}\frac{1}{36}, & \frac{35}{36}\end{pmatrix}$$

So far, so good. What about 24 tosses of a pair of dice?

$$ P(lose) = \left(\frac{35}{36}\right)^{24} $$ $$ P(win) = 1 - P(lose) $$

In [24]:
1 - P(A + B < 12)**24, P(A + B < 12)**24
Out[24]:
$$\begin{pmatrix}\frac{11033126465283976852912127963392284191}{22452257707354557240087211123792674816}, & \frac{11419131242070580387175083160400390625}{22452257707354557240087211123792674816}\end{pmatrix}$$
In [25]:
(1 - P(A + B < 12)**24).evalf(3)
Out[25]:
$$0.491$$

Now we understand why de Méré kept losing the second bet.

Later we will investigate how long it might have taken de Méré to realize he was wrong, but first...

Numerical descriptive statistics

Mean of a sample

Given a sample of numbers $X=(X_1, \dots, X_n)$ the sample mean, $\overline{X}$ is $$ \overline{X} = \frac1n \sum_{i=1}^n X_i.$$

There are many ways to compute this in python.

In [26]:
X = [1,3,5,7,8,12,19]
print X
print np.mean(X)
print (X[0]+X[1]+X[2]+X[3]+X[4]+X[5]+X[6])/7
print np.sum(X)/len(X)
[1, 3, 5, 7, 8, 12, 19]
7.85714285714
7.85714285714
7.85714285714

We'll also illustrate thes calculations with part of an example we consider below, on differences in blood pressure between two groups.

In [27]:
url = 'http://lib.stat.cmu.edu/DASL/Datafiles/Calcium.html'
calcium = pd.read_table(url, skiprows=28, nrows=21)
treated = calcium.Decrease[calcium.Treatment == 'Calcium']
placebo = calcium.Decrease[calcium.Treatment == 'Placebo']
treated.mean(), placebo.mean()
Out[27]:
$$\begin{pmatrix}5.0, & -0.272727272727\end{pmatrix}$$

Standard deviation of a sample

Given a sample of numbers $X=(X_1, \dots, X_n)$ the sample standard deviation $S_X$ is $$ S^2_X = \frac{1}{n-1} \sum_{i=1}^n (X_i-\overline{X})^2.$$

In [28]:
S2 = sum((treated - treated.mean())**2) / (len(treated) - 1)
print np.sqrt(S2)
print treated.std()
8.74325136574
8.74325136574

Median of a sample

Given a sample of numbers $X=(X_1, \dots, X_n)$ the sample median is the middle of the sample: if $n$ is even, it is the average of the middle two points. If $n$ is odd, it is the midpoint.

In [29]:
X = pd.Series([1,3,5,7,8,12,19])
X.median()
Out[29]:
$$7.0$$

Quantiles of a sample

Given a sample of numbers $X=(X_1, \dots, X_n)$ the $q$-th quantile is a point $x_q$ in the data such that $q \cdot 100\%$ of the data lie to the left of $x_q$.

Example

The $0.5$-quantile is the median: half of the data lie to the right of the median.

In [30]:
X.quantile(.25)
Out[30]:
$$4.0$$

Graphical statistical summaries

We've already seen a boxplot. Another common statistical summary is a histogram.

In [31]:
treated.hist(figsize=(12,6), color="orange")
plt.title('Treated group')
plt.xlabel('Decrease')
Out[31]:
<matplotlib.text.Text at 0x111ef9e90>

p-values and Hypothesis Testing

Back to de Méré's Null Hypothesis $(H_0)$: $$ P(win) = \frac{2}{3} $$

In [32]:
X = Binomial('X', 1, S(2)/3)
density(X).dict 
Out[32]:
$$\begin{Bmatrix}0 : \frac{1}{3}, & 1 : \frac{2}{3}\end{Bmatrix}$$

If he plays three times, according to $H_0$, he should win:

In [33]:
X = Binomial('X', 3, S(2)/3)
density(X).dict
Out[33]:
$$\begin{Bmatrix}0 : \frac{1}{27}, & 1 : \frac{2}{9}, & 2 : \frac{4}{9}, & 3 : \frac{8}{27}\end{Bmatrix}$$

If he plays nine times, his odds:

In [34]:
X = Binomial('X', 9, S(2)/3)
density(X).dict
Out[34]:
$$\begin{Bmatrix}0 : \frac{1}{19683}, & 1 : \frac{2}{2187}, & 2 : \frac{16}{2187}, & 3 : \frac{224}{6561}, & 4 : \frac{224}{2187}, & 5 : \frac{448}{2187}, & 6 : \frac{1792}{6561}, & 7 : \frac{512}{2187}, & 8 : \frac{256}{2187}, & 9 : \frac{512}{19683}\end{Bmatrix}$$
In [35]:
plt.figure(figsize=(12,6))
colors = matplotlib.rcParams['axes.color_cycle']
p = 2/3
for k, color in zip(np.arange(4,13), colors):
    rv = binom(k, p)
    x = np.linspace(0.0, 1.0, k+1)
    y = rv.pmf(range(k+1))
    plt.plot(x, y, lw=2, color=color, label=k)
    plt.fill_between(x, y, color=color, alpha=1/9)
plt.legend()
plt.title("Relative distribution given $p = 2/3$")
plt.tight_layout()
plt.ylabel("PDF")
plt.xlabel("Relative Frequency")
Out[35]:
<matplotlib.text.Text at 0x113654990>

What are the chances of losing 6 out of 12 games, given: $$ H_0: P(win) = \frac{2}{3} $$

In [36]:
X = Binomial('X', 12, S(2)/3)
P(X <= 6).evalf(3)
Out[36]:
$$0.178$$

In other words, if we were to play 12 games 100 times, and $H_0$ were true, we could expect to lose half of those 12 about 18% of the time.

How many games do we need to play before we can conclude, from losing half the time, that $p(win) < \frac{2}{3}$

In [37]:
for t in range(12,53,2):
    X = Binomial('X', t, S(2)/3)
    print "{} trials: p-value = {}".format(t, P(X <= t/2).evalf(3))
12 trials: p-value = 0.178
14 trials: p-value = 0.149
16 trials: p-value = 0.126
18 trials: p-value = 0.108
20 trials: p-value = 0.0919
22 trials: p-value = 0.0787
24 trials: p-value = 0.0677
26 trials: p-value = 0.0583
28 trials: p-value = 0.0503
30 trials: p-value = 0.0435
32 trials: p-value = 0.0377
34 trials: p-value = 0.0327
36 trials: p-value = 0.0284
38 trials: p-value = 0.0246
40 trials: p-value = 0.0214
42 trials: p-value = 0.0187
44 trials: p-value = 0.0163
46 trials: p-value = 0.0142
48 trials: p-value = 0.0124
50 trials: p-value = 0.0108
52 trials: p-value = 0.00946

What is a p-value?

What it's not:

  • It is not the probability that the null hypothesis is true.
  • It is not the inverse of the probability that the alternative hypothesis is true.

What it is:

  • The p-value is the probability of an observed outcome if the null hypothesis were true.

N.B. Statistics is incapable of proving that anything is true. It can only suggest that something probably isn't.

In [38]:
plt.figure(figsize=(12,6))
k = range(30)
H_0 = binom(29, 0.667).pmf(k)
plt.plot(k, H_0, lw=2, color=colors[1], label="$H_0$")
plt.fill_between(k, H_0, color=colors[1], alpha=.5)
H_a = binom(29, 0.491).pmf(k)
plt.plot(k, H_a, lw=2, color=colors[0], label="$H_a$")
plt.fill_between(k, H_a, color=colors[0], alpha=.5)
plt.legend()
plt.title("Binomial distribution")
plt.tight_layout()
plt.ylabel("PDF at $k$")
plt.xlabel("$k$")
Out[38]:
<matplotlib.text.Text at 0x10ec011d0>

Inference about a population mean

A testing scenario

  • Suppose we want to determine the efficacy of a new drug on blood pressure.

  • Our study design is: we will treat a large patient population (maybe not so large: budget constraints limit it $n=20$) with the drug and measure their blood pressure before and after taking the drug.

  • One way to conclude that the drug is effective if the blood pressure has decreased. That is, if the average difference between before and after is positive.

Setting up the test

  • The null hypothesis, $H_0$ is: the average difference is less than zero.

  • The alternative hypothesis, $H_a$, is: the average difference is greater than zero.

  • Sometimes (actually, often), people will test the alternative, $H_a$: the average difference is not zero vs. $H_0$: the average difference is zero.

  • The test is performed by estimating the average difference and converting to standardized units.

Drawing from a box

  • Formally, could set up the above test as drawing from a box of differences in blood pressure.

  • A box model is a useful theoretical device that describes the experiment under consideration. In our example, we can think of the sample of decreases drawn 20 patients at random from a large population (box) containing all the possible decreases in blood pressure.

A simulated box model

  • In our box model, we will assume that the decrease is an integer drawn at random from $-3$ to 6.

  • We will draw 20 random integers from -3 to 6 with replacement and test whether the mean of our "box" is 0 or not.

In [39]:
mysample = np.random.randint(-3, 7, 20)
mysample
Out[39]:
array([ 4, -1,  0, -3,  4, -3,  6, -2,  6,  2,  4,  2,  0,  1,  5,  1,  0,
        2,  3,  2])

The test is usually a $T$ test that has the form $$ T = \frac{\overline{X}-0}{S_X/\sqrt{n}} $$

In [40]:
T = (mysample.mean() - 0) / (mysample.std() / np.sqrt(20))
T
Out[40]:
$$2.78354671907$$

This $T$ value is compared to a table for the appropriate $T$ distribution (in this case there are 19 degrees of freedom) and the 5% cutoff is

In [41]:
cutoff = stats.t.ppf(.975, 19)
cutoff
Out[41]:
$$2.09302405441$$

The result of the test is

In [42]:
result = abs(T) > cutoff
result
Out[42]:
True

If result is TRUE, then we reject $H_0$ the mean is 0, while if it is FALSE we do not reject. Of course, in this example we know the mean in our "box" is not 0, it is 1.5.

This rule can be visualized with the $T$ density. The code to produce the figure is a little long, but the figure is hopefully clear. The total grey area is 0.05=5%, and the cutoff is chosen to be symmetric around zero and such that this area is exactly 5%.

For a test of size $\alpha$ we write this cutoff $t_{n-1,1-\alpha/2}$.

In [43]:
density_fig = plt.figure(figsize=(10,6))
axes = density_fig.gca()

df = 19
x = np.linspace(-4,4,101)
axes.plot(x,tdist.pdf(x, df), linewidth=2, label=r'$T$, df=%d' % df)

xR = np.linspace(cutoff, 4, 101)
yR = tdist.pdf(xR, df)
xRpoly, yRpoly = pylab.poly_between(xR, 0*xR, yR)
axes.fill(xRpoly, yRpoly, facecolor='gray', hatch='\\', alpha=0.5)

xL = np.linspace(-4, -cutoff, 101)
yL = tdist.pdf(xL, df)
xLpoly, yLpoly = pylab.poly_between(xL, 0*xL, yL)
axes.fill(xLpoly, yLpoly, facecolor='gray', hatch='\\', alpha=0.5)

axes.set_xlabel('standardized units')
axes.set_ylabel('density')
axes.legend()
Out[43]:
<matplotlib.legend.Legend at 0x1150f4f10>

As mentioned above, sometimes tests are one-sided. If the null hypothesis we tested was that the mean is less than 0, then we would reject this hypothesis if our observed mean was much smaller than 0. This corresponds to a positive $T$ value.

In [44]:
density_fig = plt.figure(figsize=(10,6))
axes = density_fig.gca()

df = 19
cutoff = tdist.isf(0.05,df)

x = np.linspace(-4,4,101)
axes.plot(x,tdist.pdf(x, df), linewidth=2, label=r'$T$, df=%d' % df)

xR = np.linspace(cutoff, 4, 101)
yR = tdist.pdf(xR, df)
xLpoly, yLpoly = pylab.poly_between(xR, 0*xR, yR)
axes.fill(xLpoly, yLpoly, facecolor='gray', hatch='\\', alpha=0.5)

axes.set_xlabel('standardized units')
axes.set_ylabel('density')
axes.legend()
Out[44]:
<matplotlib.legend.Legend at 0x115100410>
The rejection rules are affected by the degrees of freedom. Here is the rejection region when we only have 5 samples from our "box".
In [45]:
density_fig = plt.figure(figsize=(10,6))
axes = density_fig.gca()

df = 4
cutoff = tdist.isf(0.05,df)

x = np.linspace(-4,4,101)
axes.plot(x,tdist.pdf(x, df), linewidth=2, label=r'$T$, df=%d' % df)

xR = np.linspace(cutoff, 4, 101)
yR = tdist.pdf(xR, df)
xRpoly, yRpoly = pylab.poly_between(xR, 0*xR, yR)
axes.fill(xRpoly, yRpoly, facecolor='gray', hatch='\\', alpha=0.5)

axes.set_xlabel('standardized units')
axes.set_ylabel('density')
axes.legend()
Out[45]:
<matplotlib.legend.Legend at 0x111e6b590>

Confidence intervals

  • Instead of testing a particular hypothesis, we might be interested in coming up with a reasonable range for the mean of our "box".

  • Statistically, this is done via a confidence interval.

  • If the 5% cutoff is $q$ for our test, then the 95% confidence interval is $$ [\bar{X} - q S_X / \sqrt{n}, \bar{X} + q S_X / \sqrt{n}] $$ where we recall $q=t_{n-1,0.975}$ with $n=20$.

  • If we wanted 90% confidence interval, we would use $q=t_{19,0.95}$. Why?

In [46]:
L = mysample.mean() - cutoff*mysample.std()/np.sqrt(20)
U = mysample.mean() + cutoff*mysample.std()/np.sqrt(20)
print "L: {}\tU: {}".format(L, U)
L: 0.386307472368	U: 2.91369252763

Note that the endpoints above depend on the data. Not every interval will cover the true mean of our "box" which is 1.5. Let's take a look at 100 intervals of size 90%. We would expect that roughly 90 of them cover 1.5.

In [47]:
cutoff = stats.t.ppf(.95, 19)
L, U, covered = [], [], []
for i in range(100):
    mysample = np.random.randint(-3, 7, 20)
    l = mysample.mean() - cutoff*mysample.std()/np.sqrt(20)
    u = mysample.mean() + cutoff*mysample.std()/np.sqrt(20)
    L.append(l)
    U.append(u)
    covered.append((l < 1.5) * (u > 1.5))

sum(covered)
Out[47]:
$$89$$

A useful picture is to plot all these intervals so we can see the randomness in the intervals, why the true mean of the box is unchanged.

In [48]:
mu = 1.5
plt.figure(figsize=(12,6))
plt.xlabel('Sample')
plt.xlim((0,100))
plt.ylabel('Confidence Intervals')
plt.ylim((-2.5+mu,2.5+mu))
for i in range(100):
    if covered[i]:
        col='g'
    else:
        col='r'
    plt.plot([i,i], [L[i],U[i]], color=col, linewidth=2)
plt.plot([0, 100], [mu, mu], '--', color='k', linewidth=4)
Out[48]:
[<matplotlib.lines.Line2D at 0x111afead0>]

Blood pressure example

  • A study was conducted to study the effect of calcium supplements on blood pressure.

  • More detailed data description can be here.

  • We had loaded the data above, storing the two samples in the variables treated and placebo.

  • Some questions might be:

    • What is the mean decrease in BP in the treated group? placebo group?
    • What is the median decrease in BP in the treated group? placebo group?
    • What is the standard deviation of decrease in BP in the treated group? placebo group?
    • Is there a difference between the two groups? Did BP decrease more in \ the treated group?
In [49]:
print pd.DataFrame({'treated': treated.describe(), 'placebo': placebo.describe()}).T
calcium.boxplot('Decrease', 'Treatment', figsize=(12,6))
         count      mean       std  min   25%  50%    75%  max
placebo     11 -0.272727  5.900693  -11 -3.00   -1   2.50   12
treated     10  5.000000  8.743251   -5 -2.75    4  10.75   18

[2 rows x 8 columns]

Out[49]:
<matplotlib.axes.AxesSubplot at 0x11302e890>

A hypothesis test

In our setting, we have two groups that we have reason to believe are different.

  • We have two samples:

    • $(X_1, \dots, X_{10})$ (treated)
    • $(Z_1, \dots, Z_{11})$ (placebo)
  • We can answer this statistically by testing the null hypothesis $$H_0:\mu_X = \mu_Z.$$

  • If variances are equal, the pooled $t$-test is appropriate.

Pooled $t$ test

  • The test statistic is $$ T = \frac{\overline{X} - \overline{Z}}{S_P \sqrt{\frac{1}{10} + \frac{1}{11}}}, \qquad S^2_P = \frac{9 \cdot S^2_X + 10 \cdot S^2_Z}{19}.$$

  • For two-sided test at level $\alpha=0.05$, reject if $|T| > t_{19, 0.975}$.

  • Confidence interval: for example, a $90\%$ confidence interval for $\mu_X-\mu_Z$ is $$ \overline{X}-\overline{Z} \pm S_P \sqrt{\frac{1}{10} + \frac{1}{11}} \cdot t_{19,0.95}.$$

In [50]:
sdP = np.sqrt((9*treated.std()**2 + 10*placebo.std()**2)/19)
T = (treated.mean() - placebo.mean()) / (sdP * np.sqrt(1/10 + 1/11))
T
Out[50]:
$$1.63410824159$$

scipy.stats has a builtin function to perform such $t$-tests.

In [51]:
t, p = stats.ttest_ind(treated, placebo, equal_var=True)
't-statistic = {:6.6} pvalue = {:.4}'.format(t, p)
Out[51]:
't-statistic = 1.6341 pvalue = 0.1187'

If we don't make the assumption of equal variance, R will give a slightly different result.

In [52]:
t, p = stats.ttest_ind(treated, placebo, equal_var=False)
't-statistic = {:6.6} pvalue = {:.4}'.format(t, p)
Out[52]:
't-statistic = 1.6037 pvalue = 0.1288'

Pooled estimate of variance

  • The rule for the $SD$ of differences is $$ SD(\overline{X}-\overline{Z}) = \sqrt{SD(\overline{X})^2+SD(\overline{Z})^2}$$

  • By this rule, we might take our estimate to be $$ \widehat{SD(\overline{X}-\overline{Z})} = \sqrt{\frac{S^2_X}{10} + \frac{S^2_Z}{11}}. $$

  • The pooled estimate assumes $\mathbb{E}(S^2_X)=\mathbb{E}(S^2_Z)=\sigma^2$ and replaces the $S^2$'s above with $S^2_P$, a better estimate of $\sigma^2$ than either $S^2_X$ or $S^2_Z$.

Where do we get $df=19$?

Well, the $X$ sample has $10-1=9$ degrees of freedom to estimate $\sigma^2$ while the $Z$ sample has $11-1=10$ degrees of freedom.

Therefore, the total degrees of freedom is $9+10=19$.

Our first regression model

  • We can put the two samples together: $$Y=(X_1,\dots, X_{10}, Z_1, \dots, Z_{11}).$$

  • Under the same assumptions as the pooled $t$-test: $$ \begin{aligned} Y_i &\sim N(\mu_i, \sigma^2)\\ \mu_i &= \begin{cases} \mu_X & 1 \leq i \leq 10 \\ \mu_Z & 11 \leq i \leq 21. \end{cases} \end{aligned} $$

  • This is a (regression) model for the sample $Y$. The (qualitative) variable Treatment is called a covariate or predictor.

  • The decrease in BP is the outcome.

  • We assume that the relationship between treatment and average decrease in BP is simple: it depends only on which group a subject is in.

  • This relationship is modelled through the mean vector $\mu=(\mu_1, \dots, \mu_{21})$.

In [53]:
mod = sm.OLS.from_formula(formula='Decrease ~ Treatment', data=calcium)
res = mod.fit()
print res.summary()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               Decrease   R-squared:                       0.123
Model:                            OLS   Adj. R-squared:                  0.077
Method:                 Least Squares   F-statistic:                     2.670
Date:                Mon, 28 Apr 2014   Prob (F-statistic):              0.119
Time:                        18:04:50   Log-Likelihood:                -70.735
No. Observations:                  21   AIC:                             145.5
Df Residuals:                      19   BIC:                             147.6
Df Model:                           1                                         
========================================================================================
                           coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------------
Intercept                5.0000      2.335      2.141      0.045         0.112     9.888
Treatment[T.Placebo]    -5.2727      3.227     -1.634      0.119       -12.026     1.481
==============================================================================
Omnibus:                        1.015   Durbin-Watson:                   2.085
Prob(Omnibus):                  0.602   Jarque-Bera (JB):                0.918
Skew:                           0.314   Prob(JB):                        0.632
Kurtosis:                       2.191   Cond. No.                         2.68
==============================================================================