Alessandro D. Gagliardi
# standard library:
import os
from pprint import pprint
# other modules:
import matplotlib.pyplot as plt
import pandas as pd
import twitter
import yaml
from pymongo import MongoClient
credentials = yaml.load(open(os.path.expanduser('~/api_cred.yml')))
auth = twitter.oauth.OAuth(credentials['ACCESS_TOKEN'],
credentials['ACCESS_TOKEN_SECRET'],
credentials['API_KEY'],
credentials['API_SECRET'])
twitter_api = twitter.Twitter(auth=auth)
# The Yahoo! Where On Earth ID for the entire world is 1.
# See https://dev.twitter.com/docs/api/1.1/get/trends/place and
# http://developer.yahoo.com/geo/geoplanet/
WORLD_WOE_ID = 1
US_WOE_ID = 23424977
# Prefix ID with the underscore for query string parameterization.
# Without the underscore, the twitter package appends the ID value
# to the URL itself as a special case keyword argument.
world_trends = twitter_api.trends.place(_id=WORLD_WOE_ID)
us_trends = twitter_api.trends.place(_id=US_WOE_ID)
pprint(world_trends)
pprint(us_trends)
world_trends_set = set([trend['name'] for trends in world_trends
for trend in trends['trends']])
us_trends_set = set([trend['name'] for trends in us_trends
for trend in trends['trends']])
world_trends_set.intersection(us_trends_set)
q = world_trends[0]['trends'][0]['name']
count = 100
# See https://dev.twitter.com/docs/api/1.1/get/search/tweets
search_results = twitter_api.search.tweets(q=q, count=count)
statuses = search_results['statuses']
len(statuses)
search_results['search_metadata'].get('next_results')
if 'next_results' in search_results['search_metadata']:
next_results = search_results['search_metadata']['next_results']
kwargs = dict([ kv.split('=') for kv in next_results[1:].split("&") ])
search_results = twitter_api.search.tweets(**kwargs)
statuses += search_results['statuses']
len(statuses)
from the command line:
$ mkdir -p data/db
$ mongod --dbpath data/db
c = MongoClient()
twitter:¶db = c.twitter
tweets:¶statuses_ids = db.tweets.insert(statuses)
statuses_ids[:5]
ObjectIds¶ObjectId is a 12-byte BSON type, constructed using:
In MongoDB, documents stored in a collection require a unique _id field that acts as a primary key. Because ObjectIds are small, most likely unique, and fast to generate, MongoDB uses ObjectIds as the default value for the _id field if the _id field is not specified.
Using ObjectIds for the _id field provides the following additional benefits:
.generation_time property in pymongo. _id field that stores ObjectId values is roughly equivalent to sorting by insertion time.c.database_names()
db = c.twitter
db.collection_names()
db.tweets.find_one()
Notice the _id included in the document along with the values we already saw before.
Now that we have our data in MongoDB, we can use some of it's search functionality. For example:
popular_tweets = db.tweets.find({'retweet_count': {"$gte": 3}})
popular_tweets.count()
pd.DataFrame(db.tweets.find(fields=['created_at', 'retweet_count', 'favorite_count']))
retweet_favorites = pd.DataFrame(list(db.tweets.find(fields=['created_at','retweet_count','favorite_count'])))
retweet_favorites.head()
.describe() is a useful method to get the gist of our data.
retweet_favorites.describe()
However, when applied to a DataFrame, it only describes numeric columns.
retweet_favorites.dtypes
.describe() can be called on individual columns (i.e. Series), even if they are not numeric.
retweet_favorites.created_at.describe()
However, in this case created_at is being treated as a string, which is not very helpful. We can fix that with pandas.to_datetime:
retweet_favorites.created_at.map(pd.to_datetime).describe()
Not all that interesting though, since all of these tweets were collected within a couple seconds of each other.
Mongo allows us to access subfields directly.
mentions_followers = list(db.tweets.find(fields=['entities.user_mentions', 'user.followers_count']))
pd.DataFrame(mentions_followers).head()
Pandas doesn't know how to parse the sub-documents however, so we must tell it explicitly:
mentions_followers_df = pd.DataFrame({'user_mentions': len(tweet['entities']['user_mentions']),
'followers_count': tweet['user'].get('followers_count')} for tweet in mentions_followers)
mentions_followers_df.head()
Perhaps user_mentions and followers_count are correlated?
plt.scatter(mentions_followers_df.user_mentions, mentions_followers_df.followers_count)
mentions_followers_df.corr()
Perhaps not.
Raise two fingers if you understand the material well.
Raise one finger if you understand this material OK.
Now, 1 find a 2 and sit next to them. While you work through this exercise, only 1's can type.
The Yahoo! Where On Earth ID of Canada is 23424775.
Use it to find Twitter trends in Canada and compare it to US trends. What's the difference between them?
canada collection. us collections. follower_count?Find tweets where 'retweet_count': {"$gt": 0}.
What is their count?
Is there anything about them that stands out as retweetable?
Further exercises: