pip install ipython-sql
for sql magic
%load_ext sql
%%bash
sqlite3 enron.db .tables
Employee EmployeeWithVars MessageBase RecipientBase EmployeeBase Message Recipient
We only care about the *Base tables:
(The other tables are derived from those. We don't need them for this exercise.)
%%sql sqlite:///enron.db
SELECT * FROM EmployeeBase LIMIT 5
Done.
| eid | name | department | longdepartment | title | gender | seniority |
|---|---|---|---|---|---|---|
| 1 | John Arnold | Trading | ENA Gas Financial | VP Trading | Male | Senior |
| 2 | Harry Arora | Trading | ENA East Power | VP Trading | Male | Senior |
| 3 | Robert Badeer | Trading | ENA West Power | Mgr Trading | Male | Junior |
| 4 | Susan Bailey | Legal | ENA Legal | Specialist Legal | Female | Junior |
| 5 | Eric Bass | Trading | ENA Gas Texas | Trader | Male | Junior |
%%sql sqlite:///enron.db
SELECT * FROM MessageBase LIMIT 5
Done.
| mid | filename | unix_time | subject | from_eid |
|---|---|---|---|---|
| 1 | taylor-m/sent/11 | 910930020 | Cd$ CME letter | 138 |
| 2 | taylor-m/sent/17 | 911459940 | Indemnification | 138 |
| 3 | taylor-m/sent/18 | 911463840 | Re: Indemnification | 138 |
| 4 | taylor-m/sent/23 | 911874180 | Re: Coral Energy, L.P. | 138 |
| 5 | taylor-m/sent/27 | 912396120 | Bankruptcy Code revisions | 138 |
from_eid looks like a foreign key
%%sql sqlite:///enron.db
SELECT unix_time, name, department, title, gender, seniority, subject, filename
FROM EmployeeBase JOIN MessageBase ON eid = from_eid LIMIT 5
Done.
| unix_time | name | department | title | gender | seniority | subject | filename |
|---|---|---|---|---|---|---|---|
| 910930020 | Mark E. Taylor | Legal | VP & Gen Cnsl | Male | Senior | Cd$ CME letter | taylor-m/sent/11 |
| 911459940 | Mark E. Taylor | Legal | VP & Gen Cnsl | Male | Senior | Indemnification | taylor-m/sent/17 |
| 911463840 | Mark E. Taylor | Legal | VP & Gen Cnsl | Male | Senior | Re: Indemnification | taylor-m/sent/18 |
| 911874180 | Mark E. Taylor | Legal | VP & Gen Cnsl | Male | Senior | Re: Coral Energy, L.P. | taylor-m/sent/23 |
| 912396120 | Mark E. Taylor | Legal | VP & Gen Cnsl | Male | Senior | Bankruptcy Code revisions | taylor-m/sent/27 |
Finally, RecipientBase:
%%sql sqlite:///enron.db
SELECT * FROM RecipientBase LIMIT 9
Done.
| mid | rno | to_eid |
|---|---|---|
| 1 | 1 | 59 |
| 2 | 1 | 15 |
| 3 | 1 | 15 |
| 4 | 1 | 109 |
| 4 | 2 | 49 |
| 4 | 3 | 120 |
| 4 | 4 | 59 |
| 5 | 1 | 45 |
| 5 | 2 | 53 |
It looks like we can use this to determine how many recipients each message received
i.e. MAX(rno)
%%sql sqlite:///enron.db
SELECT * FROM RecipientBase WHERE rno = 57
Done.
| mid | rno | to_eid |
|---|---|---|
| 12116 | 57 | 51 |
| 12151 | 57 | 51 |
%%sql sqlite:///enron.db
SELECT * FROM MessageBase WHERE mid IN (12116, 12151)
Done.
| mid | filename | unix_time | subject | from_eid |
|---|---|---|---|---|
| 12116 | baughman-d/all_documents/398 | 990510780 | 67 | |
| 12151 | baughman-d/ect_admin/22 | 990546780 | 67 |
%%sql sqlite:///enron.db
SELECT * FROM EmployeeBase WHERE eid=67
Done.
| eid | name | department | longdepartment | title | gender | seniority |
|---|---|---|---|---|---|---|
| 67 | John J. Lavorato | Other | EWS | ENA President & CEO | Male | Senior |
Who sent the email to the most recipients?
%%sql sqlite:///enron.db
SELECT name, title, MAX(rno)
FROM EmployeeBase
JOIN MessageBase ON eid = from_eid
JOIN RecipientBase USING(mid)
Done.
| name | title | MAX(rno) |
|---|---|---|
| John J. Lavorato | ENA President & CEO | 57 |
Who are the top five senders of mass emails?
%%sql sqlite:///enron.db
SELECT name, title, MAX(rno) AS recipient_count
FROM EmployeeBase
JOIN MessageBase ON eid = from_eid
JOIN RecipientBase USING(mid)
GROUP BY name, title ORDER BY recipient_count DESC LIMIT 5
Done.
| name | title | recipient_count |
|---|---|---|
| John J. Lavorato | ENA President & CEO | 57 |
| Kenneth Lay | President & CEO | 52 |
| Sally Beck | VP | 49 |
| Monique Sanchez | Associate | 24 |
| Louise Kitchen | COO | 21 |
import pandas as pd
import sqlite3
from pandas.io import sql
conn = sqlite3.connect('enron.db')
sql.frame_query("""SELECT mid, unix_time, subject, from_eid
FROM MessageBase LIMIT 5""", conn, "mid")
| unix_time | subject | from_eid | |
|---|---|---|---|
| mid | |||
| 1 | 910930020 | Cd$ CME letter | 138 |
| 2 | 911459940 | Indemnification | 138 |
| 3 | 911463840 | Re: Indemnification | 138 |
| 4 | 911874180 | Re: Coral Energy, L.P. | 138 |
| 5 | 912396120 | Bankruptcy Code revisions | 138 |
5 rows × 3 columns
MessageBase = sql.frame_query("""SELECT mid, unix_time, subject, from_eid
FROM MessageBase""", conn, "mid")
MessageBase.head()
| unix_time | subject | from_eid | |
|---|---|---|---|
| mid | |||
| 1 | 910930020 | Cd$ CME letter | 138 |
| 2 | 911459940 | Indemnification | 138 |
| 3 | 911463840 | Re: Indemnification | 138 |
| 4 | 911874180 | Re: Coral Energy, L.P. | 138 |
| 5 | 912396120 | Bankruptcy Code revisions | 138 |
5 rows × 3 columns
SQL JOIN (with USING):
sql.frame_query("""SELECT mid, unix_time, subject, from_eid, rno, to_eid
FROM MessageBase JOIN RecipientBase USING (mid) LIMIT 5""", conn, "mid")
| unix_time | subject | from_eid | rno | to_eid | |
|---|---|---|---|---|---|
| mid | |||||
| 1 | 910930020 | Cd$ CME letter | 138 | 1 | 59 |
| 2 | 911459940 | Indemnification | 138 | 1 | 15 |
| 3 | 911463840 | Re: Indemnification | 138 | 1 | 15 |
| 4 | 911874180 | Re: Coral Energy, L.P. | 138 | 1 | 109 |
| 4 | 911874180 | Re: Coral Energy, L.P. | 138 | 2 | 49 |
5 rows × 5 columns
pandas.DataFrame.join
RecipientBase = sql.frame_query("SELECT * FROM RecipientBase", conn, "mid")
MessageBase.join(RecipientBase).head()
| unix_time | subject | from_eid | rno | to_eid | |
|---|---|---|---|---|---|
| mid | |||||
| 1 | 910930020 | Cd$ CME letter | 138 | 1 | 59 |
| 2 | 911459940 | Indemnification | 138 | 1 | 15 |
| 3 | 911463840 | Re: Indemnification | 138 | 1 | 15 |
| 4 | 911874180 | Re: Coral Energy, L.P. | 138 | 1 | 109 |
| 4 | 911874180 | Re: Coral Energy, L.P. | 138 | 2 | 49 |
5 rows × 5 columns
SQL JOIN (with ON):
sql.frame_query("""SELECT mid, name, title, seniority,
unix_time, subject, from_eid
FROM EmployeeBase
JOIN MessageBase ON eid = from_eid
LIMIT 5""", conn, "mid")
| name | title | seniority | unix_time | subject | from_eid | |
|---|---|---|---|---|---|---|
| mid | ||||||
| 1 | Mark E. Taylor | VP & Gen Cnsl | Senior | 910930020 | Cd$ CME letter | 138 |
| 2 | Mark E. Taylor | VP & Gen Cnsl | Senior | 911459940 | Indemnification | 138 |
| 3 | Mark E. Taylor | VP & Gen Cnsl | Senior | 911463840 | Re: Indemnification | 138 |
| 4 | Mark E. Taylor | VP & Gen Cnsl | Senior | 911874180 | Re: Coral Energy, L.P. | 138 |
| 5 | Mark E. Taylor | VP & Gen Cnsl | Senior | 912396120 | Bankruptcy Code revisions | 138 |
5 rows × 6 columns
pandas.DataFrame.merge
EmployeeBase = sql.frame_query("""SELECT eid, name, title, seniority
FROM EmployeeBase""", conn, "eid")
EmployeeBase.merge(MessageBase, right_on = 'from_eid', left_index = True).head()
| name | title | seniority | unix_time | subject | from_eid | |
|---|---|---|---|---|---|---|
| mid | ||||||
| 1611 | John Arnold | VP Trading | Senior | 954317280 | Re: Insurance Call Spread | 1 |
| 1615 | John Arnold | VP Trading | Senior | 954328920 | New curve generation methodology | 1 |
| 1722 | John Arnold | VP Trading | Senior | 955447080 | Option Analysis on NG Price Book | 1 |
| 1723 | John Arnold | VP Trading | Senior | 955448160 | 1 | |
| 1763 | John Arnold | VP Trading | Senior | 955701240 | Re: options | 1 |
5 rows × 6 columns
Histograms are a good way to get a feel for the data. For example:
MessageBase.unix_time.hist()
<matplotlib.axes.AxesSubplot at 0x106424f50>
MessageBase.unix_time.mean()
985319365.65994918
What does this tell us?
Convert unix_time to human readable form (i.e. datetime)
from datetime import datetime
dt = MessageBase.unix_time.map(datetime.fromtimestamp)
MessageBase.groupby(dt).unix_time.count().resample('1W', how='sum').plot()
<matplotlib.axes.AxesSubplot at 0x107f73150>
What does this tell us?
Consider recipient counts (i.e. MAX(rno))
recipient_counts = sql.frame_query("""SELECT mid, datetime(unix_time, 'unixepoch') AS time_sent,
MAX(rno) AS recipient_count
FROM MessageBase JOIN RecipientBase USING (mid)
GROUP BY mid, time_sent""",
conn, "mid")
recipient_counts.head()
| time_sent | recipient_count | |
|---|---|---|
| mid | ||
| 1 | 1998-11-13 04:07:00 | 1 |
| 2 | 1998-11-19 07:19:00 | 1 |
| 3 | 1998-11-19 08:24:00 | 1 |
| 4 | 1998-11-24 02:23:00 | 4 |
| 5 | 1998-11-30 03:22:00 | 2 |
5 rows × 2 columns
grouping objects
w = recipient_counts.time_sent.map(lambda dt: pd.Period(dt, freq='W'))
recipient_counts.groupby(w).recipient_count.mean().plot(ylim=(0,10))
<matplotlib.axes.AxesSubplot at 0x10853ba10>
Use matplotlib.pyplot.scatter (or plt.scatter for short) to plot the length of the subjects against the recipient_count.
Hint: Take advantage of IPython's help functionality:
import matplotlib.pyplot as plt
plt.scatter?
Plot the median subject length over time (using the same weekly grouping object above)
Further exercises: Try to find your own discoveries! (e.g. is there anything interesting you could say about the subjects of the emails. What about who sent the first ones and who sent the last ones? What about the recipients themselves? (hint: to_eid is a foreign key to eid)
See Pandas for SQL Developers for more examples about how Pandas and SQL compare.