Lab

Multiple Linear Regression

Alessandro D. Gagliardi

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as stats
import statsmodels.formula.api as smf

this dataset shows a set of six numeric survey responses $X_i$ (survey responses)
and a dependent variable $Y$ (perceived supervisor quality)
we want to predict $Y$ from the $X$'s

x = pd.read_table('http://www.ats.ucla.edu/stat/examples/chp/p054.txt')
x.head()

the column names have trailing whitespace, so I fix that by mapping str.strip onto the columns' names.

x.columns = x.columns.map(str.strip)

this set of scatterplots gives us an idea of the pairwise relationships present in the dataset

_ = pd.scatter_matrix(x, figsize=(7, 7))

this linear fit represents the "full model"; eg, the fit with all of the independent variables included

lm = smf.ols('Y ~ X1 + X2 + X3 + X4 + X5 + X6', data=x)
fit = lm.fit()
print fit.summary()

remove feature w/ lowest (abs) t score

fit2 = smf.ols('Y ~ X1 + X2 + X3 + X4 + X6', data=x).fit()
print fit2.summary()

note R-sq decreases slightly, but adj R-sq increases slightly

--> increasing bias, decreasing variance

ditto

fit3 = smf.ols('Y ~ X1 + X2 + X3 + X6', data=x).fit()
print fit3.summary()

fit4 = smf.ols('Y ~ X1 + X3 + X6', data=x).fit()
print fit4.summary()

stopping criteria met: all featuers have |t| > 1

$\rightarrow$ optimal bias-variance point reached

$\rightarrow$ Residual standard error (RSE) minimized

fit5 = smf.ols('Y ~ X1 + X3', data=x).fit()
print fit5.summary()

note this model is weaker (lower $Adj. R^2$)

fit6 = smf.ols('Y ~ X1', data=x).fit()
print fit6.summary()

ditto

want to see absence of structure in resid scatterplot ("gaussian white noise")

fit4.resid.plot(style='o', figsize=(12,8))

$\rightarrow$ this plot looks pretty good; also note that resid quartiles look good

plt.figure(figsize=(12,8))
_ = stats.probplot(fit4.resid, dist="norm", plot=plt)

1-2 Pairs

(Based on RABE 3.15)

A national insurance organization wanted to study the consumption pattern of cigarettes in all 50 states and the District of Columbia. The variables chosen for the study are:

Age: Median age of a person living in a state.
HS: Percentage of people over 25 years of age in a state who had completed high school.
Income: Per capita personal income for a state (income in dollars).
Black: Percentage of blacks living in a state.
Female: Percentage of females living in a state.
Price: Weighted average price (in cents) of a pack ofcigarettes in a state.
Sales: Number of packs of cigarettes sold in a state on a per capita basis.

The data can be found at http://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P088.txt.

Below, specify the null and alternative hypotheses, the test used, and your conclusion using a 5% level of significance.

Test the hypothesis that the variable Female is not needed in the regression equation relating Sales to the six predictor variables.
Test the hypothesis that the variables Female and HS are not needed in the above regression equation.
Compute a 95% confidence interval for the true regression coefficient of the variable Income.
What percentage of the variation in Sales can be accounted for when Income is removed from the above regression equation? Which model did you use?

Homework

Level 1

Use enron.db from last week.

Select a DataFrame containing the recipient count (as before) and the department of the sender for each message.
Create a histogram of the recipient count.
Create a new column based on the log of the recipient count.
Create a histogram of that log.
Create a boxplot of the log, splitting the data based on the department of the sender.
Compute the sample mean and standard deviation of the log in the three groups.
Compute a 95% confidence interval for the difference in recipient count between the three groups.
At level $\alpha=5\%$, test the null hypothesis that the average recipient count does not differ between the three groups. What assumptions are you making? What can you conclude?

Level 2

A criminologist studying the relationship between income level and assults in U.S. cities (among other things) collected the following data for 2215 communities. The dataset can be found in the UCI machine learning site.

We are interested in the per capita assult rate and its relation to median income.

crime = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00211/CommViolPredUnnormalizedData.txt", header = None, na_values  = '?',
                    names = ['communityname', 'state', 'countyCode', 'communityCode', 'fold', 'population', 'householdsize', 'racepctblack', 'racePctWhite', 'racePctAsian', 'racePctHisp', 'agePct12t21', 'agePct12t29', 'agePct16t24', 'agePct65up', 'numbUrban', 'pctUrban', 'medIncome', 'pctWWage', 'pctWFarmSelf', 'pctWInvInc', 'pctWSocSec', 'pctWPubAsst', 'pctWRetire', 'medFamInc', 'perCapInc', 'whitePerCap', 'blackPerCap', 'indianPerCap', 'AsianPerCap', 'OtherPerCap', 'HispPerCap', 'NumUnderPov', 'PctPopUnderPov', 'PctLess9thGrade', 'PctNotHSGrad', 'PctBSorMore', 'PctUnemployed', 'PctEmploy', 'PctEmplManu', 'PctEmplProfServ', 'PctOccupManu', 'PctOccupMgmtProf', 'MalePctDivorce', 'MalePctNevMarr', 'FemalePctDiv', 'TotalPctDiv', 'PersPerFam', 'PctFam2Par', 'PctKids2Par', 'PctYoungKids2Par', 'PctTeen2Par', 'PctWorkMomYoungKids', 'PctWorkMom', 'NumKidsBornNeverMar', 'PctKidsBornNeverMar', 'NumImmig', 'PctImmigRecent', 'PctImmigRec5', 'PctImmigRec8', 'PctImmigRec10', 'PctRecentImmig', 'PctRecImmig5', 'PctRecImmig8', 'PctRecImmig10', 'PctSpeakEnglOnly', 'PctNotSpeakEnglWell', 'PctLargHouseFam', 'PctLargHouseOccup', 'PersPerOccupHous', 'PersPerOwnOccHous', 'PersPerRentOccHous', 'PctPersOwnOccup', 'PctPersDenseHous', 'PctHousLess3BR', 'MedNumBR', 'HousVacant', 'PctHousOccup', 'PctHousOwnOcc', 'PctVacantBoarded', 'PctVacMore6Mos', 'MedYrHousBuilt', 'PctHousNoPhone', 'PctWOFullPlumb', 'OwnOccLowQuart', 'OwnOccMedVal', 'OwnOccHiQuart', 'OwnOccQrange', 'RentLowQ', 'RentMedian', 'RentHighQ', 'RentQrange', 'MedRent', 'MedRentPctHousInc', 'MedOwnCostPctInc', 'MedOwnCostPctIncNoMtg', 'NumInShelters', 'NumStreet', 'PctForeignBorn', 'PctBornSameState', 'PctSameHouse85', 'PctSameCity85', 'PctSameState85', 'LemasSwornFT', 'LemasSwFTPerPop', 'LemasSwFTFieldOps', 'LemasSwFTFieldPerPop', 'LemasTotalReq', 'LemasTotReqPerPop', 'PolicReqPerOffic', 'PolicPerPop', 'RacialMatchCommPol', 'PctPolicWhite', 'PctPolicBlack', 'PctPolicHisp', 'PctPolicAsian', 'PctPolicMinor', 'OfficAssgnDrugUnits', 'NumKindsDrugsSeiz', 'PolicAveOTWorked', 'LandArea', 'PopDens', 'PctUsePubTrans', 'PolicCars', 'PolicOperBudg', 'LemasPctPolicOnPatr', 'LemasGangUnitDeploy', 'LemasPctOfficDrugUn', 'PolicBudgPerPop', 'murders', 'murdPerPop', 'rapes', 'rapesPerPop', 'robberies', 'robbbPerPop', 'assaults', 'assaultPerPop', 'burglaries', 'burglPerPop', 'larcenies', 'larcPerPop', 'autoTheft', 'autoTheftPerPop', 'arsons', 'arsonsPerPop', 'ViolentCrimesPerPop', 'nonViolPerPop'])

Fit a simple linear regression model to the data with np.log(crime.assaults) as the dependent variable and np.log(crime.medIncome) as the independent variable. Plot the estimated regression line.
Test whether there is a linear relationship between assaults and medIncome at level $\alpha=0.05$. State the null hypothesis, the alternative, the conclusion and the $p$-value.
Give a 95% confidence interval for the slope of the regression line. Interpret your interval.
Report the $R^2$ and the adjusted $R^2$ of the model, as well as an estimate of the variance of the errors in the model.
Go to archive.ics.uci.edu/ml/datasets/Communities+and+Crime+Unnormalized and pick out a few other factors that might help you predict assults.