In my earlier submit, I wrote about how I collected tweets in regards to the Bonnaroo Music and Arts Competition in the course of the entirety of the competition. There are a variety of questions that may very well be answered by this dataset, like
- Do individuals spell worse as they grow to be extra intoxicated all through the evening?
- Does textual content sentiment decline as individuals go extra days with out bathing?
- Who on this planet tweets from a laptop computer throughout a music competition?
I might actually like to reply the above questions (and plan to), however I’ll give attention to the obvious query for this submit:
Which band was most fashionable?
And whereas this query appears easy to reply, there are various causes this weblog submit is so lengthy. To begin, we don’t also have a first rate definition of the query!
What does it imply for an artist to be the preferred as measured by tweets? For now, let’s work off of the oldest rule of PR: “Any publicity is nice publicity”. In that case, we will rank band recognition just by the variety of tweets that point out every artist. Let’s attempt to do that and see what occurs.
Of Pythons and Pandas
From the earlier submit, I’ve my dataset of Bonnaroo tweets sitting in a MongoDB database. I want to get these tweets out of the database and into IPython, a software program package deal for interactive computing in Python. I exported the database as an enormous JSON file after which loaded it into IPython with an inventory comprehension.
The following step is to get the JSON document into pandas, a Python library used primarily for manipulating tabular information. The principle object for coping with tabular information, the DataFrame, properly reads in such a JSON document.
import json
import pandas as pd
from pandas import DataFrame, Collection
path = 'tweetCollection.json'
document = [json.loads(line) for line in open(path)]
df = DataFrame(document)
If I kind df.rely()
I see that, positive sufficient, all 157,600 tweets are current and that 8,656 of them include location information.
_id 157600
created_at 157600
geo 8656
supply 157600
textual content 157600
dtype: int64
Whereas watching the tweets stream in, I observed that there are lots of retweets. To me, these don’t appear as “natural” as a bona-fide, unique tweet. Folks and software program can spam twitter all they need with retweets, however it’s tougher to spam with unique tweets. So, I believe a extra sturdy measure of band recognition is distinctive tweets. Pandas permits us to simply examine this:
rely 157600
distinctive 107773
prime RT @502michael502: Islam The Faith Of Reality...
freq 1938
Title: textual content, dtype: object
Of the 157,600 unique tweets, solely about two-thirds of them are distinctive tweets. And wait, what’s that hottest tweet that’s repeated 1,938 instances?
print df['text'].describe()['top']
RT @502michael502: Islam The Faith Of Reality
http://t.co/BO7Sjw6pSl
#FathersDay #AFLDonsDees #Bonnaroo #Brasil2014 #WorldCup #Jewish #…
Wow. Positive sufficient, should you go to the twitter web page for @502michael502, you will note that random pro-Islam messages are tweeted and retweeted hundreds of instances with an assorted assortment of hashtags containing trending and spiritual phrases. I assume Bonnaroo was fashionable sufficient to make it onto @502michael502’s trending hashtags! And right here I assumed he was only a huge jam band fan.
Okay, now we will attempt to take away retweets. We begin by grabbing solely distinctive tweets.
# Retain solely distinctive text-unique tweets
uniques = df.drop_duplicates(inplace=False, cols='textual content')
# I do not know learn how to make .startswith() case insensitve,
# so examine each instances:
organics = uniques[ uniques['text'].str.startswith('RT')==False ]
organics = organics[ organics['text'].str.startswith('rt')==False ]
# In case RT was positioned additional within the textual content than the start.
# Embody areas round ' RT ' to forestall grabbing phrases like begin
organics = organics[ organics['text'].str.comprises(' RT ', case=False)==False ]
print organics.rely()
_id 93311
created_at 93311
geo 8537
supply 93311
textual content 93311
dtype: int64
There we go: we now have gone from 157,600 whole tweets to 93,311 “natural” tweets. There’s nonetheless extra work that we might do to get extra natural tweets. For instance, I might argue that information media sources tweeting about artists at Bonnaroo aren’t a great measure of band recognition. Such tweets are tougher to detect, although. One methodology may very well be to take a look at the supply of the tweet – perhaps tweets from cell telephones usually tend to be people than media organizations? I’ll save this for one more submit as a result of we nonetheless have lots of work to do!
Most BeautifulSoup within the Room
Now that I’ve grouped collectively all the tweets that we care to research, we should seek for mentions of every Bonnaroo artist. However I’m lazy. There are 189 totally different artists acting at Bonnaroo, and certainly not do I really feel like typing all of them out.
Enter BeautifulSoup, a Python library for scraping web sites. All I’ve to do is try the band lineup on the Bonnaroo web site, determine which div
components correspond to the listed bands, and BeautifulSoup will seize the contents.
import urllib2
from BeautifulSoup import BeautifulSoup
url='http://lineup.bonnaroo.com/'
ufile = urllib2.urlopen(url)
soup = BeautifulSoup(ufile)
bandList = soup.discover('div',{'class':'ds-lineup ds-player'}).findAll('a')
fout = open('bonnarooBandList.txt','w')
for row in bandList:
band=row.renderContents()
fout.write(band + 'n')
fout.shut()
With a Little Assist from my (API) Mates
After I wrote the above script, I assumed I used to be completed. In a while, I thought of the truth that individuals don’t all the time name bands by their full title. For instance, the Purple Sizzling Chili Peppers are sometimes abbreviated RHCP. I used to be amazed when I discovered that MusicBrainz, an internet music encyclopedia, not solely retains monitor of bands’ aliases and mispellings, however MusicBrainz truly has an API for accessing this info. Even higher, any individual created a Python wrapper for the API.
I additionally needed to carry out some “scrubbing” of the aliases which might be retrieved from the MusicBrainz API. I think about {that a} band is “talked about” in a tweet if all phrases in any of the band’s aliases are current within the tweet textual content. For instance, a match for each “arctic” and “monkeys” within the textual content can be a point out of “Arctic Monkeys”. Nonetheless, I don’t need to miss a point out of “The Flaming Lips” as a result of “the” isn’t included.
I ameliorated this concern through the use of nltk, a Pure Language Processing library. The library comprises an inventory of English stopwords (frequent phrases like “the”) which I used as a filter. Notice: This can be a problem for bands like “The Head and the Coronary heart” the place the filter would go away behind “head” and “coronary heart”. Each these phrases might simply be in a tweet and never relate to the band.
The code beneath exhibits how I used the general public API’s and nltk in an effort to get searchable aliases.
import musicbrainzngs as mbrainz
import json
import string
import re
import nltk
from nltk.corpus import stopwords
mbrainz.auth(username,password) # Use a username and password
mbrainz.set_useragent(program_version,email_address) # Embody title
# of your program and e-mail deal with
# For use for eradicating punctuation
regex = re.compile('[%s]' % re.escape(string.punctuation))
################################
def clean_aliases(alias, regex):
"""
This operate converts every alias to lowercase, removes
punctuation, and removes cease phrases. The output is a
listing containing the remaining phrases.
"""
alias = alias.decrease().substitute(' &', '') # Take away ampersands
alias = regex.sub('', alias) # Take away punctuation
alias_words =
[w for w in alias.split() if w not in stopwords.words('english')]
return alias_words
################################
with open('bonnarooBandList.txt','r') as fin:
with open('bonnarooAliasList.json','w') as fout:
aliasDict={} # Initialize alias dictionary
for band in fin:
band = band.rstrip('n')
# Take away ampersands
band_query = band.decrease().substitute(' &', '')
# Take away punctuation
band_query = regex.sub('', band_query)
# Solely seize first end result
end result =
mbrainz.search_artists(artist=band_query, restrict=1)
# Take away stopwords
band_query = clean_aliases(band_query, regex)
#Initialize with stripped model of title
# listed on Bonnaroo web site
aliasList = [band_query]
attempt:
for alias in end result['artist-list'][0]['alias-list']:
alias = clean_aliases(alias['alias'], regex)
aliasList.append(alias) # Construct alias Checklist
besides: # Some artists don't return aliases
cross # So do not do something!
aliasDict[band] = aliasList
json.dump(aliasDict,fout)
fout.shut()
fin.shut()
The Closing Histogram
Okay, so we now have a dictionary of simply searchable aliases for all artists that carried out at Bonnaroo. All we now have to do now’s undergo every tweet and see if any of the aliases for any of the artists are talked about. We are able to then construct a histogram of “mentions” for every artist by including up all the mentions in all the tweets for a given artist.
Within the code beneath, I just do this. By working the operate on the backside, get_bandPop
, we get a pandas Collection in return that comprises every artist and the variety of instances they had been talked about in all the tweets.
# For use for eradicating punctuation
regex = re.compile('[%s]' % re.escape(string.punctuation))
def clean_sentence(sentence):
"""
Converts every sentence to lowercase and removes
punctuation.
"""
sentence = sentence.decrease().substitute(' &', '') # Take away ampersands
sentence = regex.sub('', sentence) # Take away punctuation
# sentence_words = [w for w in sentence.split() if w not in stopwords.words('english')]
return sentence
def find_mention(sentence, phrase_list):
"""
Takes a phase_list, which is an inventory of phrases the place
every phrase corresponds to an inventory of the phrases within the phrase, and
checks to see whether or not all of the phrases of any of the phrases are
current in "sentence".
"""
for phrases in phrase_list:
phrases = set(phrases)
if phrases.issubset(sentence):
return True
return False # Not one of the phrase lists had been subsets
def check_each_alias(sentence, alias_dict):
"""
Checks to see whether or not any of the aliases for
every band talked about in alias_dict are talked about
in "sentence".
band_bool is a dictionary that comprises all band
names as keys and True or False as values corresponding
as to if or not the band was talked about within the sentence.
"""
band_bool={}
sentence = set(clean_sentence(sentence).cut up())
for okay, v in alias_dict.iteritems():
band_bool[k] = find_mention(sentence, v)
return pd.Collection(band_bool)
def build_apply_fun(alias_dict):
"""
Flip check_each_alias into an nameless operate.
"""
apply_fun = lambda x : check_each_alias(x, alias_dict)
return apply_fun
def get_bandPop(df, alias_dict):
"""
For tweet DataFrame enter "df", construct histogram of of mentions
for every band in alias_dict.
"""
bandPop = df['text'].apply(build_apply_fun(alias_dict), alias_dict)
bandPop = bandPop.sum(axis=0)
bandPop.type(ascending=False)
return bandPop
And now, lastly, all we now have to do is kind bandPop[:10].plot(sort='bar')
(and perhaps fiddle round in matplotlib for an hour adjusting properties of the determine) and we get a histogram of mentions for the highest ten hottest bands at Bonnaroo:
And naturally it’s Kanye! Is anyone shocked?
Wow, that was lots of work for one, measly histogram! Nonetheless, we now have a bunch of information analytical equipment that we will use to delve deeper into this dataset. In my subsequent submit, I’ll just do that!