Data Science

Introduction and Overview

Alessandro Gagliardi
Sr. Data Scientist, Glassdoor.com

Welcome!

Instructor:Alessandro Gagliardi
ADFGagliardi+GA@Gmail.com
EiRs:Drew Stevens, dss9@columbia.edu
Nir Kaldero, nirkaldero@gmail.com
Classes:6:30pm-9:30pm, Mondays and Wednesdays
April 14 – June 30, 2014 (no class May 26)
Office Hours:TBD

Who am I?

Alessandro Gagliardi

  • 1997 - 2001 - Studied Computer Science at UC Santa Cruz
  • 2001 - 2002 - Developed web-based educational CRM in New York
  • 2002 - 2003 - Took some time off
  • 2003 - 2005 - Worked as a independent consultant for startups in New York
  • 2005 - 2010 - Studient Integrative Neuroscience at Rutgers
  • 2010 - 2011 - Taught Psychology and Neuroscience at USF, NDNU, CIIS
  • 2011 - 2014 - Returned to industry as a Data Scientist at Socialize, Path, and Glassdoor

Who are the Experts in Residence?

Drew Stevens

  • 1989 - 1993 - Studied Mechanical Engineering and Computer Science at Columbia University.
  • 1993 - 1997 - Technology Consultant at Andersen Consulting (Accenture) New York // Phillip Morris, Smith Barney, Xerox & Tosco (now ConocoPhillips)
  • 1997 - 2000 - Application Architect Knowledge Management System at McKinsey & Company in New York
  • 2000 - 2001 - Director of Technology at Offerlab, a short lived co-marketing exchange in New York
  • 2001 - 2002 - Director of Technology at Voyant, a web-based-viritual eyeglass shop marketed to ophthalmologists in New York
  • 2002 - Sabbatical in South East Asia
  • 2003 - 2008 - Technology Consultant, on various projects to improve legal transparency in Hanoi, Vietnam
  • 2009 - 2014 - Technical Architect at Stem, an energy optimization company in the San Francisco Bay area.

Who are the Experts in Residence?

Nir Kaldero

  • 2007 - 2010 - Served in a data science (Economics) unit @ Intelligence corps IDF
  • 2010 - 2012 - Studied Economics and Business @ IDC, Israel
  • 2011 - Worked as a Consultant & Team Leader for Maccabi World Union (YLC)
  • 2011 - 2012 - Worked as a Strategic Consultant for Cadillac
  • 2012 - 2013 - Transferred to UC Berkeley as a visiting PhD student @ Econ Department
  • 2012 - Currently - Research Scholar (Visiting Scholar) @ UC Berkeley, Econ
  • 2013 - Currently - working as an independent consultant for startups in the Bay-Area

Fields of Research: Industrial Organizations, Marketing (strategy) , Microeconomics, finance.

Who are you?

  • Your name
  • Where you work and what you do there
  • What you hope to get out of this course

Today:

  1. Lecture:
    1. What is Data Science?
    2. Origins of Data Science
    3. Goals of the Course
  2. Lab:
    • Python

A. What is Data Science?

Data Science = Data + Science

Data can be:

Unstructured (e.g. Email, Photos, Books, etc.)

Semi-Structured (e.g. XML, JSON, NoSQL, APIs, etc.)

Structured (e.g. SQL, Data Frames, etc.)

  1. Most data in the world is unstructured. That is, it does not conform to any predetermined computer-readable form. Before we can work with this sort of data, we need to extract it, transform it into something usable, and load it into our system, whatever that system may be. Extract, transform, load is ETL and its one of the less glamorous and more time consuming parts of the job, so it is important to know how to do it efficiently. We will cover this in the next class.
  2. Semi-structured data is a relatively new thing and has a lot to do with the web. Semi-structured data is machine readable but does not conform to a rigid structure. This is both a blessing and a curse. The flexibility it provides makes it a lot easier for different systems to talk to one another. But that same flexibility makes working with the data in aggregate more difficult. Many "NoSQL" databases use a semi-structured schema. Those of you who are over 30 probably remember XML which was a popular standard for semi-structured in the late 90's and early 2000's. Fortunately, XML has largely been replaced by JSON which is easier to read by both humans (because it doesn't have so many angle brackets) and computers (because it is less flexible and therefore less ambiguous). We will be covering how to work with this sort of data next Monday.
  3. Ultimately, Data Scientists need our data to be structured before we can do anything with it. Relational databases are structured. And once we get into Pandas and R, we will learn about data frames which are also structured. Structured data are consistent throughout. For example, a given field will have the same data type no matter what. This is extremely important because it makes it possible to work across all of the data in aggregate. This opens up the possibility to do everything from calculate sums so measuring relationships and predicting outcomes. We will cover the basics of how to work with this kind of data next week and for the most part, the rest of this course will deal with data in this form.
    1. As an aside, you might be wondering where Excel spreadsheets would fit in this list. I would place it somewhere between Structured and Semi-structured. We might call it "mostly structured". Unlike fully structured data, an Excel spreadsheet can accept different datatypes in a column. But unlike semi-structured data, it will complain when you do this (at least, as soon as you try to do an operation on that column). It probably belongs in the semi-structured category but because of the ability to do structured operations in a spreadsheet, I'm reluctant to put it there.

Science can be:

Explorations / Explanations

  • Data Visualization (e.g. ggplot2, Tableau, d3.js, etc.)
  • Unsupervised Machine Learning (e.g. clustering, etc.)
  • etc....

Inferences / Predictions

  • Regression Models (e.g. Linear Models, Logistic Regression)
  • Supervised Machine Learning (e.g. Neural Nets, Genetic Algorithms)
  • etc....

Data Science Workflow

From a Taxonomy of Data Science (by Dataists)

A. Obtain

B. Scrub

C. Explore

D. Model

E. Interpret

Workflow Example:

Problem: what are the leading indicators that a user will make a new purchase?

A. Collect data around user retention, user actions within the product, potentially find data outside of company

B. Extract aggregated values from raw data

  1. How many times did a user share through Facebook within a week? A month?
  2. How often did they open up our emails?

C. Examine data to find common distributions and correlations

D. Extract new meaning to predict if user would purchase again

E. Share results (and probably also go back to the drawing board)

B. Origins of Data Science

1974 - Peter Naur published Concise Survey of Computer Methods:

A basic principle of data science is this: The data representation must be chosen with due regard to the transformation to be achieved and the data processing tools available. This stresses the importance of concern for the characteristics of the data processing tools.

Data Science = Computer Science + Statistics

Computer Science

  1. Artificial Intelligence (Alan Turing)
  2. Relational Calculus / First-Order Logic (Edgar F. Codd)

Statistics

  1. Probability and Significance (Ronald Fisher)
  2. Statistical Learning Theory (Vladimir N. Vapnik)

Other Fields Contributing to Data Science

Psychology

  1. Signal Detection Theory
  2. Connectionism & Parallel Distributed Processing

Complex Systems

  1. Artificial Life
  2. Graph Theory

What else?

C. Goals of the Course

At the completion of this course, you will be able to:

  • Employ the Map/Reduce paradigm to transform big unstructured data
  • Access data from web-based application programming interfaces (APIs)
  • Use Structured Query Language (SQL) functions like JOIN and GROUP
  • Explore and present data through visualizations
  • Apply generalized linear models (GLMs)
  • Detect clusters in multivariate data
  • Predict categories using supervised machine learning techniques

At the completion of this course, you will be able to:

Explore and present data through visualizations

Apply the 1st, 2nd, and 3rd Normal Form

Access data from web-based application programming interfaces (APIs)

Determine statistical significance

Apply generalized linear models (GLMs)

Detect clusters in multivariate data

Predict categories using supervised machine learning techniques

Apply these techniques to big data

Tentative Course Outline

  1. Intro and Overview
  2. Getting started with IPython & Git
  3. SQL and Relational Theory
  4. APIs and semi-structured data
  5. Structuring Data
  6. Probability, A/B Tests & Statistical Significance
  7. Intro to ML and Linear Regression, ANOVA
  8. Logistic Regression, Regularization
  9. Principal Components Analysis
  10. Clustering: Hierarchical and K-Means
  11. Time Series Analysis
  12. Midterm Review & Project Elevator Pitches
  13. Nonlinear Models
  14. Guest Lecture TBD
  15. Grid Search and Parameter Selection
  16. IPython.parallel & StarCluster
  17. Hadoop Distributed File System and Streaming
  18. Guest Lecture TBD & Final Project Working Session
  19. Guest Lecture TBD & Final Project Working Session
  20. Final Project Presentations (12 min. each)
  21. Final Project Presentations (12 min. each)
  22. Where to Go Next

Project Schedule

Date Milestone
4/23 Preliminary Project Proposals Due (3-4 sentences)
4/30 EiR Feedback on Project Proposals Returned
5/7 Formal Proposals Due (including data and methods chosen)
5/14 EiR Feedback on Formal Proposals Returned
5/14 Midterm Assessment Due
5/21 Projects live on Github
5/21 Project Elevator Pitch in class (4 minutes each)
5/28 Peer Feedback Due
6/4 At least one working model
6/11 Final EiR Feedback Due
6/23-25 Final Presentations (12 minutes each)

Typical Class

  • Pre-work (0-30 min.) Students investigate a topic and present
  • Theory and Background (30-80 min.) Laptops closed, paper and pen(cil) only
  • Code walk-through (20-45 min.) Laptops open, follow along
  • Break (5-10 min.) You know what this is
  • Paired Programming (30-100 min.) One student drives, the other navigates
  • Individual work (0-40 min.) Time permitting, otherwise homework

Lab