Data Science

Introduction and Overview

Alessandro Gagliardi
Sr. Data Scientist, Glassdoor.com

Welcome!¶

Instructor:	Alessandro Gagliardi
	ADFGagliardi+GA@Gmail.com
EiRs:	Drew Stevens, dss9@columbia.edu
	Nir Kaldero, nirkaldero@gmail.com
Classes:	6:30pm-9:30pm, Mondays and Wednesdays
	April 14 – June 30, 2014 (no class May 26)
Office Hours:	TBD

Who am I?

Alessandro Gagliardi

1997 - 2001 - Studied Computer Science at UC Santa Cruz
2001 - 2002 - Developed web-based educational CRM in New York
2002 - 2003 - Took some time off
2003 - 2005 - Worked as a independent consultant for startups in New York
2005 - 2010 - Studient Integrative Neuroscience at Rutgers
2010 - 2011 - Taught Psychology and Neuroscience at USF, NDNU, CIIS
2011 - 2014 - Returned to industry as a Data Scientist at Socialize, Path, and Glassdoor

Who are the Experts in Residence?

Drew Stevens

1989 - 1993 - Studied Mechanical Engineering and Computer Science at Columbia University.
1993 - 1997 - Technology Consultant at Andersen Consulting (Accenture) New York // Phillip Morris, Smith Barney, Xerox & Tosco (now ConocoPhillips)
1997 - 2000 - Application Architect Knowledge Management System at McKinsey & Company in New York
2000 - 2001 - Director of Technology at Offerlab, a short lived co-marketing exchange in New York
2001 - 2002 - Director of Technology at Voyant, a web-based-viritual eyeglass shop marketed to ophthalmologists in New York
2002 - Sabbatical in South East Asia
2003 - 2008 - Technology Consultant, on various projects to improve legal transparency in Hanoi, Vietnam
2009 - 2014 - Technical Architect at Stem, an energy optimization company in the San Francisco Bay area.

Who are the Experts in Residence?

Nir Kaldero

2007 - 2010 - Served in a data science (Economics) unit @ Intelligence corps IDF
2010 - 2012 - Studied Economics and Business @ IDC, Israel
2011 - Worked as a Consultant & Team Leader for Maccabi World Union (YLC)
2011 - 2012 - Worked as a Strategic Consultant for Cadillac
2012 - 2013 - Transferred to UC Berkeley as a visiting PhD student @ Econ Department
2012 - Currently - Research Scholar (Visiting Scholar) @ UC Berkeley, Econ
2013 - Currently - working as an independent consultant for startups in the Bay-Area

Fields of Research: Industrial Organizations, Marketing (strategy) , Microeconomics, finance.

Who are you?

Your name
Where you work and what you do there
What you hope to get out of this course

Today:

Lecture:
1. What is Data Science?
2. Origins of Data Science
3. Goals of the Course
Lab:
- Python

A. What is Data Science?

Data Science = Data + Science¶

Data can be:

Unstructured (e.g. Email, Photos, Books, etc.)

Semi-Structured (e.g. XML, JSON, NoSQL, APIs, etc.)

Structured (e.g. SQL, Data Frames, etc.)

Most data in the world is unstructured. That is, it does not conform to any predetermined computer-readable form. Before we can work with this sort of data, we need to extract it, transform it into something usable, and load it into our system, whatever that system may be. Extract, transform, load is ETL and its one of the less glamorous and more time consuming parts of the job, so it is important to know how to do it efficiently. We will cover this in the next class.
Semi-structured data is a relatively new thing and has a lot to do with the web. Semi-structured data is machine readable but does not conform to a rigid structure. This is both a blessing and a curse. The flexibility it provides makes it a lot easier for different systems to talk to one another. But that same flexibility makes working with the data in aggregate more difficult. Many "NoSQL" databases use a semi-structured schema. Those of you who are over 30 probably remember XML which was a popular standard for semi-structured in the late 90's and early 2000's. Fortunately, XML has largely been replaced by JSON which is easier to read by both humans (because it doesn't have so many angle brackets) and computers (because it is less flexible and therefore less ambiguous). We will be covering how to work with this sort of data next Monday.
Ultimately, Data Scientists need our data to be structured before we can do anything with it. Relational databases are structured. And once we get into Pandas and R, we will learn about data frames which are also structured. Structured data are consistent throughout. For example, a given field will have the same data type no matter what. This is extremely important because it makes it possible to work across all of the data in aggregate. This opens up the possibility to do everything from calculate sums so measuring relationships and predicting outcomes. We will cover the basics of how to work with this kind of data next week and for the most part, the rest of this course will deal with data in this form.
1. As an aside, you might be wondering where Excel spreadsheets would fit in this list. I would place it somewhere between Structured and Semi-structured. We might call it "mostly structured". Unlike fully structured data, an Excel spreadsheet can accept different datatypes in a column. But unlike semi-structured data, it will complain when you do this (at least, as soon as you try to do an operation on that column). It probably belongs in the semi-structured category but because of the ability to do structured operations in a spreadsheet, I'm reluctant to put it there.

Science can be:

Explorations / Explanations

Data Visualization (e.g. ggplot2, Tableau, d3.js, etc.)
Unsupervised Machine Learning (e.g. clustering, etc.)
etc....

Inferences / Predictions

Regression Models (e.g. Linear Models, Logistic Regression)
Supervised Machine Learning (e.g. Neural Nets, Genetic Algorithms)
etc....

Data Science Workflow

From a Taxonomy of Data Science (by Dataists)

A. Obtain

B. Scrub

C. Explore

D. Model

E. Interpret

Workflow Example:

Problem: what are the leading indicators that a user will make a new purchase?

A. Collect data around user retention, user actions within the product, potentially find data outside of company

B. Extract aggregated values from raw data

How many times did a user share through Facebook within a week? A month?
How often did they open up our emails?

C. Examine data to find common distributions and correlations

D. Extract new meaning to predict if user would purchase again

E. Share results (and probably also go back to the drawing board)

B. Origins of Data Science¶

1974 - Peter Naur published Concise Survey of Computer Methods:

A basic principle of data science is this: The data representation must be chosen with due regard to the transformation to be achieved and the data processing tools available. This stresses the importance of concern for the characteristics of the data processing tools.

Data Science = Computer Science + Statistics

Computer Science

Artificial Intelligence (Alan Turing)
Relational Calculus / First-Order Logic (Edgar F. Codd)

Statistics

Probability and Significance (Ronald Fisher)
Statistical Learning Theory (Vladimir N. Vapnik)

Other Fields Contributing to Data Science

Psychology

Signal Detection Theory
Connectionism & Parallel Distributed Processing

Complex Systems

Artificial Life
Graph Theory

What else?

C. Goals of the Course¶

At the completion of this course, you will be able to:

Explore and present data through visualizations

Apply the 1st, 2nd, and 3rd Normal Form

Access data from web-based application programming interfaces (APIs)

Determine statistical significance

Apply generalized linear models (GLMs)

Detect clusters in multivariate data

Predict categories using supervised machine learning techniques

Apply these techniques to big data

Tentative Course Outline

Intro and Overview
Getting started with IPython & Git
SQL and Relational Theory
APIs and semi-structured data
Structuring Data
Probability, A/B Tests & Statistical Significance
Intro to ML and Linear Regression, ANOVA
Logistic Regression, Regularization
Principal Components Analysis
Clustering: Hierarchical and K-Means
Time Series Analysis
Midterm Review & Project Elevator Pitches
Nonlinear Models
Guest Lecture TBD
Grid Search and Parameter Selection
IPython.parallel & StarCluster
Hadoop Distributed File System and Streaming
Guest Lecture TBD & Final Project Working Session
Guest Lecture TBD & Final Project Working Session
Final Project Presentations (12 min. each)
Final Project Presentations (12 min. each)
Where to Go Next

Project Schedule

Date	Milestone
4/23	Preliminary Project Proposals Due (3-4 sentences)
4/30	EiR Feedback on Project Proposals Returned
5/7	Formal Proposals Due (including data and methods chosen)
5/14	EiR Feedback on Formal Proposals Returned
5/14	*Midterm Assessment Due*
5/21	Projects live on Github
5/21	Project Elevator Pitch in class (4 minutes each)
5/28	Peer Feedback Due
6/4	At least one working model
6/11	Final EiR Feedback Due
6/23-25	Final Presentations (12 minutes each)

Typical Class

Pre-work (0-30 min.) Students investigate a topic and present
Theory and Background (30-80 min.) Laptops closed, paper and pen(cil) only
Code walk-through (20-45 min.) Laptops open, follow along
Break (5-10 min.) You know what this is
Paired Programming (30-100 min.) One student drives, the other navigates
Individual work (0-40 min.) Time permitting, otherwise homework

Data Science

Introduction and Overview

Welcome!¶

Who am I?

Alessandro Gagliardi

Who are the Experts in Residence?

Drew Stevens

Who are the Experts in Residence?

Nir Kaldero

Who are you?

Today:

A. What is Data Science?

Data Science = Data + Science¶

Data can be:

Unstructured (e.g. Email, Photos, Books, etc.)

Semi-Structured (e.g. XML, JSON, NoSQL, APIs, etc.)

Structured (e.g. SQL, Data Frames, etc.)

Science can be:

Explorations / Explanations

Inferences / Predictions

Data Science Workflow

Workflow Example:

Problem: what are the leading indicators that a user will make a new purchase?

B. Origins of Data Science¶

Data Science = Computer Science + Statistics

Computer Science

Statistics

Other Fields Contributing to Data Science

Psychology

Complex Systems

What else?

C. Goals of the Course¶

At the completion of this course, you will be able to:

Explore and present data through visualizations

Apply the 1st, 2nd, and 3rd Normal Form

Access data from web-based application programming interfaces (APIs)

Determine statistical significance

Apply generalized linear models (GLMs)

Detect clusters in multivariate data

Predict categories using supervised machine learning techniques

Apply these techniques to big data

Tentative Course Outline

Project Schedule

Typical Class

Lab¶