Enron

Download the sample database at: tinyurl.com/enron-db

pip install ipython-sql
for sql magic

In [1]:

%load_ext sql

In [2]:

%%bash
sqlite3 enron.db .tables

Employee          EmployeeWithVars  MessageBase       RecipientBase   
EmployeeBase      Message           Recipient

We only care about the *Base tables:

EmployeeBase
MessageBase
RecipientBase

(The other tables are derived from those. We don't need them for this exercise.)

In [3]:

%%sql sqlite:///enron.db
SELECT * FROM EmployeeBase LIMIT 5

Done.

Out[3]:

eid	name	department	longdepartment	title	gender	seniority
1	John Arnold	Trading	ENA Gas Financial	VP Trading	Male	Senior
2	Harry Arora	Trading	ENA East Power	VP Trading	Male	Senior
3	Robert Badeer	Trading	ENA West Power	Mgr Trading	Male	Junior
4	Susan Bailey	Legal	ENA Legal	Specialist Legal	Female	Junior
5	Eric Bass	Trading	ENA Gas Texas	Trader	Male	Junior

In [4]:

%%sql sqlite:///enron.db
SELECT * FROM MessageBase LIMIT 5

Done.

Out[4]:

mid	filename	unix_time	subject	from_eid
1	taylor-m/sent/11	910930020	Cd$ CME letter	138
2	taylor-m/sent/17	911459940	Indemnification	138
3	taylor-m/sent/18	911463840	Re: Indemnification	138
4	taylor-m/sent/23	911874180	Re: Coral Energy, L.P.	138
5	taylor-m/sent/27	912396120	Bankruptcy Code revisions	138

from_eid looks like a foreign key

In [5]:

%%sql sqlite:///enron.db
SELECT unix_time, name, department, title, gender, seniority, subject, filename 
FROM EmployeeBase JOIN MessageBase ON eid = from_eid LIMIT 5

Done.

Out[5]:

unix_time	name	department	title	gender	seniority	subject	filename
910930020	Mark E. Taylor	Legal	VP & Gen Cnsl	Male	Senior	Cd$ CME letter	taylor-m/sent/11
911459940	Mark E. Taylor	Legal	VP & Gen Cnsl	Male	Senior	Indemnification	taylor-m/sent/17
911463840	Mark E. Taylor	Legal	VP & Gen Cnsl	Male	Senior	Re: Indemnification	taylor-m/sent/18
911874180	Mark E. Taylor	Legal	VP & Gen Cnsl	Male	Senior	Re: Coral Energy, L.P.	taylor-m/sent/23
912396120	Mark E. Taylor	Legal	VP & Gen Cnsl	Male	Senior	Bankruptcy Code revisions	taylor-m/sent/27

Finally, RecipientBase:

In [6]:

%%sql sqlite:///enron.db
SELECT * FROM RecipientBase LIMIT 9

Done.

Out[6]:

mid	rno	to_eid
1	1	59
2	1	15
3	1	15
4	1	109
4	2	49
4	3	120
4	4	59
5	1	45
5	2	53

It looks like we can use this to determine how many recipients each message received
i.e. MAX(rno)

Who sent the email to the most recipients?

In [10]:

%%sql sqlite:///enron.db
SELECT name, title, MAX(rno)
FROM EmployeeBase 
JOIN MessageBase ON eid = from_eid 
JOIN RecipientBase USING(mid)

Done.

Out[10]:

name	title	MAX(rno)
John J. Lavorato	ENA President & CEO	57

Who are the top five senders of mass emails?

In [11]:

%%sql sqlite:///enron.db
SELECT name, title, MAX(rno) AS recipient_count
FROM EmployeeBase 
JOIN MessageBase ON eid = from_eid 
JOIN RecipientBase USING(mid) 
GROUP BY name, title ORDER BY recipient_count DESC LIMIT 5

Done.

Out[11]:

name	title	recipient_count
John J. Lavorato	ENA President & CEO	57
Kenneth Lay	President & CEO	52
Sally Beck	VP	49
Monique Sanchez	Associate	24
Louise Kitchen	COO	21

Pandas Data Frames¶

In [12]:

import pandas as pd
import sqlite3
from pandas.io import sql
conn = sqlite3.connect('enron.db')

In [13]:

sql.frame_query("""SELECT mid, unix_time, subject, from_eid 
                FROM MessageBase LIMIT 5""", conn, "mid")

Out[13]:

	unix_time	subject	from_eid
mid
1	910930020	Cd$ CME letter	138
2	911459940	Indemnification	138
3	911463840	Re: Indemnification	138
4	911874180	Re: Coral Energy, L.P.	138
5	912396120	Bankruptcy Code revisions	138

5 rows × 3 columns

In [14]:

MessageBase = sql.frame_query("""SELECT mid, unix_time, subject, from_eid 
                FROM MessageBase""", conn, "mid")
MessageBase.head()

Out[14]:

	unix_time	subject	from_eid
mid
1	910930020	Cd$ CME letter	138
2	911459940	Indemnification	138
3	911463840	Re: Indemnification	138
4	911874180	Re: Coral Energy, L.P.	138
5	912396120	Bankruptcy Code revisions	138

5 rows × 3 columns

SQL JOIN (with USING):

In [15]:

sql.frame_query("""SELECT mid, unix_time, subject, from_eid, rno, to_eid 
                FROM MessageBase JOIN RecipientBase USING (mid) LIMIT 5""", conn, "mid")

Out[15]:

	unix_time	subject	from_eid	rno	to_eid
mid
1	910930020	Cd$ CME letter	138	1	59
2	911459940	Indemnification	138	1	15
3	911463840	Re: Indemnification	138	1	15
4	911874180	Re: Coral Energy, L.P.	138	1	109
4	911874180	Re: Coral Energy, L.P.	138	2	49

5 rows × 5 columns

pandas.DataFrame.join

In [16]:

RecipientBase = sql.frame_query("SELECT * FROM RecipientBase", conn, "mid")
MessageBase.join(RecipientBase).head()

Out[16]:

	unix_time	subject	from_eid	rno	to_eid
mid
1	910930020	Cd$ CME letter	138	1	59
2	911459940	Indemnification	138	1	15
3	911463840	Re: Indemnification	138	1	15
4	911874180	Re: Coral Energy, L.P.	138	1	109
4	911874180	Re: Coral Energy, L.P.	138	2	49

5 rows × 5 columns

SQL JOIN (with ON):

In [17]:

sql.frame_query("""SELECT mid, name, title, seniority,
                          unix_time, subject, from_eid 
                     FROM EmployeeBase 
                     JOIN MessageBase ON eid = from_eid 
                    LIMIT 5""", conn, "mid")

Out[17]:

	name	title	seniority	unix_time	subject	from_eid
mid
1	Mark E. Taylor	VP & Gen Cnsl	Senior	910930020	Cd$ CME letter	138
2	Mark E. Taylor	VP & Gen Cnsl	Senior	911459940	Indemnification	138
3	Mark E. Taylor	VP & Gen Cnsl	Senior	911463840	Re: Indemnification	138
4	Mark E. Taylor	VP & Gen Cnsl	Senior	911874180	Re: Coral Energy, L.P.	138
5	Mark E. Taylor	VP & Gen Cnsl	Senior	912396120	Bankruptcy Code revisions	138

5 rows × 6 columns

pandas.DataFrame.merge

In [18]:

EmployeeBase = sql.frame_query("""SELECT eid, name, title, seniority 
                                    FROM EmployeeBase""", conn, "eid")
EmployeeBase.merge(MessageBase, right_on = 'from_eid', left_index = True).head()

Out[18]:

	name	title	seniority	unix_time	subject	from_eid
mid
1611	John Arnold	VP Trading	Senior	954317280	Re: Insurance Call Spread	1
1615	John Arnold	VP Trading	Senior	954328920	New curve generation methodology	1
1722	John Arnold	VP Trading	Senior	955447080	Option Analysis on NG Price Book	1
1723	John Arnold	VP Trading	Senior	955448160		1
1763	John Arnold	VP Trading	Senior	955701240	Re: options	1

5 rows × 6 columns

Histograms are a good way to get a feel for the data. For example:

In [19]:

MessageBase.unix_time.hist()

Out[19]:

<matplotlib.axes.AxesSubplot at 0x106424f50>

What does this tell us?

Convert unix_time to human readable form (i.e. datetime)

In [21]:

from datetime import datetime
dt = MessageBase.unix_time.map(datetime.fromtimestamp)
MessageBase.groupby(dt).unix_time.count().resample('1W', how='sum').plot()

Out[21]:

<matplotlib.axes.AxesSubplot at 0x107f73150>

What does this tell us?

Consider recipient counts (i.e. MAX(rno))

In [22]:

recipient_counts = sql.frame_query("""SELECT mid, datetime(unix_time, 'unixepoch') AS time_sent, 
                                             MAX(rno) AS recipient_count 
                                        FROM MessageBase JOIN RecipientBase USING (mid) 
                                       GROUP BY mid, time_sent""", 
                                   conn, "mid")
recipient_counts.head()

Out[22]:

	time_sent	recipient_count
mid
1	1998-11-13 04:07:00	1
2	1998-11-19 07:19:00	1
3	1998-11-19 08:24:00	1
4	1998-11-24 02:23:00	4
5	1998-11-30 03:22:00	2

5 rows × 2 columns

grouping objects

In [23]:

w = recipient_counts.time_sent.map(lambda dt: pd.Period(dt, freq='W'))
recipient_counts.groupby(w).recipient_count.mean().plot(ylim=(0,10))

Out[23]:

<matplotlib.axes.AxesSubplot at 0x10853ba10>

Lab

1-2 Pairs

Use matplotlib.pyplot.scatter (or plt.scatter for short) to plot the length of the subjects against the recipient_count.

Individually:

Plot the median subject length over time (using the same weekly grouping object above)

mid	filename	unix_time	subject	from_eid
12116	baughman-d/all_documents/398	990510780		67
12151	baughman-d/ect_admin/22	990546780		67