On this collection of posts (half 1, half 2), I’ve been exhibiting the right way to use Python and different knowledge scientist instruments to investigate a group of tweets associated to the 2014 Bonnaroo Music and Arts Competition. Thus far, the investigation has been restricted to abstract knowledge of the total dataset. The great thing about Twitter is that it happens in realtime, so we are able to now peer into the fourth dimension and study these tweets as a operate of time.
Extra Natural
Earlier than we view the Bonnaroo tweets as a time collection, I want to make a fast remark in regards to the organic-ness of the tweets. In case you recall from the earlier publish, I eliminated duplicates and retweets from my assortment with the intention to make the tweet database extra indicative of true viewers reactions. On additional investigation, it appears that there have been many spammy media sources nonetheless within the assortment. To make the tweets much more natural, I made a decision to take a look at the supply of the tweets.
As a result of Kanye West was the preferred artist from the earlier posts’ evaluation, I made a decision to take a look at the highest 15 sources that talked about him:
twitterfeed 1585
dlvr.it 749
Twitter for iPhone 366
IFTTT 256
Hootsuite 201
Twitter for Web sites 188
Twitter Internet Shopper 127
Fb 120
Twitter for Android 119
WordPress.com 102
Tumblr 81
Instagram 73
iOS 42
TweetDeck 38
TweetAdder v4 37
twitterfeed and dlvr.it are social media platforms for deploying mass tweets, and a have a look at a few of these tweets reveals this reality. So, I made a decision to create an inventory of “natural sources”, which consists of cell Twitter shoppers, and use these to cull the tweet assortment
organic_sources = ['Twitter for iPhone', 'Twitter Web Client',
'Facebook', 'Twitter for Android', 'Instagram']
organics = organics[organics['source'].isin(organic_sources)]
With this new dataset, I re-ran the band recognition histogram from the earlier publish, and I used to be stunned to see that Kanye bought bumped down to 3rd place! It appears to be like like Kanye’s well-liked with the media, however Jack White and Elton John had been extra well-liked with the Bonnaroo viewers.
Let’s now have a look at the time dependence of the tweets. For this, we want to use the created_at
subject as our index and inform pandas to deal with its components as datetime objects.
# Clear up subject
organics['created_at'] = [tweetTime['$date'] for tweetTime in organics['created_at']]
organics['created_at'] = pd.to_datetime(Sequence(organics['created_at']))
organics = organics.set_index('created_at',drop=False)
organics.index = organics.index.tz_localize('UTC').tz_convert('EST')
To take a look at the variety of tweets per hour, now we have to resample our tweet assortment.
ts_hist = organics['created_at'].resample('60t', how='depend')
The vast majority of my time spent creating this weblog publish consisted of preventing with matplotlib making an attempt to get respectable trying plots. I assumed it will be cool to attempt to make a “fill between” plot, which took approach longer to determine than it ought to have. The secret’s that fill_between
takes 3 inputs: an array for the x-axis and two y-axis arrays between which the operate fills coloration. If one simply desires to plot a daily curve and fill to the x-axis, one should create an array of zeros that’s the similar size because the curve. Additionally, I get fairly confused with which instructions needs to be referred to as with ax, plt, and fig. Anyway, the code and corresponding determine are beneath.
# Prettier pandas plot settings
# Undecided why 'default' is just not the default...
pd.choices.show.mpl_style='default'
x_date = tshist.index
zero_line = np.zeros(len(x_date))
fig, ax = plt.subplots()
ax.fill_between(x_date, zero_line, ts_hist.values, facecolor='blue', alpha=0.5)
# Format plot
plt.setp(ax.get_xticklabels(),fontsize=12,household='sans-serif')
plt.setp(ax.get_yticklabels(),fontsize=12,household='sans-serif')
plt.xlabel('Date',fontsize=30)
plt.ylabel('Counts',fontsize=30)
plt.present()
As you may see, tweet frequency was fairly constant throughout every day of the pageant and persevered till the early hours of every morning.
Band Reputation Time Sequence
We are able to now return to questions from the earlier publish and have a look at how the highest 5 bands’ recognition modified with time. Utilizing my program from the earlier publish, buildMentionHist
, we are able to add a column for every band to our present organics
dataframe. Every row of the bands’ columns will comprise a True or False worth corresponding as to if or not the artist was talked about within the tweet. We resample the columns like above however do that in bins of 10 minutes.
import buildMentionHist as bmh
import json
path = 'bonnaroooAliasList.json'
alias_dict = [json.loads(line) for line in open(path)][0]
bandPop = organics['text'].apply(bmh.build_apply_fun(alias_dict),
alias_dict)
top_five = bandPop.index.tolist()[:5] # Get prime 5 artists' names
bandPop = pd.concat([organics, bandPop], axis=1)
top_five_ts = DataFrame()
for band in top_five:
top_five_ts[band] = bandPop[bandPop[band] == True]['text'].resample('10min', how='depend')
We now have a dataframe referred to as top_five_ts
that comprises the time collection data for the highest 5 hottest bands at Bonnaroo. All now we have to do now could be plot these time collection. I needed to once more make some fill between plots however with totally different colours for every band. I used the prettyplotlib library to assist with this as a result of it has nicer trying default colours. I plot each the total time collection and a “zoomed-in” time collection that’s nearer to when the artists’ popularities peaked on Twitter. I bumped into quite a lot of bother making an attempt to get the dates and instances formatted appropriately on the x-axis of the zoomed-in plot, so I’ve included that code beneath. There may be in all probability a greater approach to do it, however at the very least this lastly labored.
import pytz
import prettyplotlib as ppl
from prettyplotlib import brewer2mpl
for band in top_five_ts:
ppl.fill_between(top_five_ts.index.tolist(),0.,top_five_ts[band])
ax = plt.gca()
fig = plt.gcf()
set2 = brewer2mpl.get_map('Set2','qualitative',8).mpl_colors
# Observe: need to make legend by hand for fill_between plots.
# BEGIN making legend
for coloration in set2:
legendProxies.append(plt.Rectangle((0, 0), 1, 1, fc=coloration))
leg = legend(legendProxies, topfive, loc=2)
leg.draw_frame(False)
# END making legend
# BEGIN formatting xaxis
datemin = datetime(2014,6,13,12,0,0)
datemax = datetime(2014,6,16,12,0,0)
est = pytz.timezone('EST')
plt.axis([est.localize(datemin), est.localize(datemax), 0, 80])
fmt = dates.DateFormatter('%m/%d %H:%M',tz=est)
ax.xaxis.set_major_formatter(fmt)
ax.xaxis.set_tick_params(route='out')
# END formatting xaxis
plt.xlabel('Date',fontsize=30)
plt.ylabel('Counts',fontsize=30)
Right here is the total time collection:
And right here is the zoomed-in time collection:
If we have a look at when every band went on stage, we are able to see that every bands’ recognition spiked whereas they had been performing. That is good – it appears to be like like we’re measuring actually “natural” curiosity on Twitter!
Band | Efficiency Time |
---|---|
Jack White | 6/14 10:45PM – 12:15AM |
Elton John | 6/15 9:30PM – 11:30PM |
Kanye West | 6/13 10:00PM – 12:00AM |
Skrillex | 6/14 1:30AM – 3:30AM |
Vampire Weekend | 6/13 7:30PM – 8:45PM |
Delving into the Textual content
Up till now, I’ve not seemed an excessive amount of in regards to the precise textual content of the tweets apart from to discover a point out of an artist. Utilizing the nltk library, we are able to be taught a bit of extra about some basic qualities of the textual content. The best amount is trying on the most steadily used phrases. To do that, I am going by means of each tweet and break all the phrases up into particular person components of an inventory. Within the language of pure language processing, we’re “tokenizing” the textual content. Frequent english stopwords are omitted, in addition to any mentions of the artists or artists’ aliases. I exploit a daily expression code to solely seize phrases from the sentences and ignore punctuation (apart from apostrophes). I additionally take our alias_dict
from the earlier publish and make it possible for these phrases should not collected when tokenizing the tweets.
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re
def custom_tokenize(textual content, custom_words=None, clean_custom_words=False):
"""
This routine takes an enter "textual content" and strips punctuation
(besides apostrophes), converts every phrases to lowercase,
removes commonplace english stopwords, removes a set of
custom_words (elective), and returns an inventory of all the
leftover phrases.
INPUTS:
textual content = textual content string that one desires to tokenize
custom_words = customized record or dictionary of phrases to omit
from the tokenization.
clean_custom_world = Flag as True if you wish to clear
these phrases.
Flag as False if mapping this operate
to many keys. In that case,
pre-clean the phrases earlier than operating
this operate.
OUTPUTS:
phrases = It is a record of the tokenized model of every phrase
that was in "textual content"
"""
tokenizer = RegexpTokenizer(r"[w']+")
stop_url = re.compile(r'http[^s]+')
stops = stopwords.phrases('english')
if clean_custom_words:
custom_words = tokenize_custom_words(custom_words)
phrases = [w.lower() for w in text.split() if not re.match(stop_url, w)]
phrases = tokenizer.tokenize(' '.be a part of(phrases))
phrases = [w for w in words if w not in stops and w not in custom_words]
return phrases
def tokenize_custom_words(custom_words):
tokenizer = RegexpTokenizer(r"[w']+")
custom_tokens = []
stops = stopwords.phrases('english')
if kind(custom_words) is dict: # Helpful for alias_dict
for ok, v in custom_words.iteritems():
k_tokens = [w.lower() for w in k.split() if w.lower() not in stops]
# Take away all punctuation
k_tokens = tokenizer.tokenize(' '.be a part of(k_tokens))
# Take away apostrophes
k_tokens = [w.replace("'","") for w in k_tokens]
# Beneath takes care of nested lists, then tokenizes
v_tokens = [word for listwords in v for word in listwords]
v_tokens = tokenizer.tokenize(' '.be a part of(v_tokens))
# Take away apostrophes
v_tokens = [w.replace("'","") for w in v_tokens]
custom_tokens.prolong(k_tokens)
custom_tokens.prolong(v_tokens)
elif kind(custom_words) is record:
custom_tokens = [tokenizer.tokenize(words) for words in custom_words]
custom_tokens = [words.replace("'","") for words in custom_tokens]
custom_tokens = set(custom_tokens)
return custom_tokens
Utilizing the above code, I can apply the custom_tokenize
operate to every row of my organics
dataframe. Earlier than doing this, although, I be certain that to run the tokenize_custom_words
operate on the alias dictionary. In any other case, I’d find yourself cleansing the aliases for each row within the dataframe which is a waste of time.
import custom_tokenize as tk
clean_aliases = tk.tokenize_custom_words(alias_dict)
token_df = organics['text'].apply(tk.custom_tokenize,
custom_words=clean_aliases,
clean_custom_words=False)
Lastly, I acquire all the tokens into one large record and use the FreqDist
nltk operate to get the phrase frequency distribution.
# Have to flatten all tokens into one large record:
big_tokens = [y for x in token_df.values for y in x]
distr = nltk.FreqDist(big_tokens)
distr = distr.pop('bonnaroo') # Clearly highest frequency
distr.plot(25)
A pair issues caught my eye – the primary being that individuals like to speak about themseleves (see recognition of ” i’m “). Additionally it was fairly well-liked to misspell “Bonnaroo” (see recognition of ” bonaroo “). I needed to see if there was any correlation between mispellings and perhaps folks being intoxicated at night time, however the time collection habits of the mispellings appears to be like related in form (although not magnitude) to the total tweet time collection that’s plotted earlier within the publish.
misspell = token_df.apply(lambda x: 'bonaroo' in x)
misspell = misspell[misspell].resample('60t', how='depend')
The opposite factor that me was the the phrase “greatest” was one of many prime 25 most frequent phrases. Assuming that “greatest” correlates with happiness, we are able to see that individuals bought happier and happier because the pageant progressed:
That is, in fact, a reasonably simplistic measure of textual content sentiment. In my subsequent publish, I want to quantify extra sturdy measures of Bonnaroo viewers sentiment.
By the way in which, the code used on this entire collection on Bonnaroo is obtainable on my GitHub.