brian: hacks

Splunking pitchfork album reviews

One of my favorite sites is the record review and music news site pitchfork media. On the site they have a bunch of interesting statistics like top record for each decade/year but these are obviously a more subjective list than if they crunched the raw stats. For example their #1 album of the nineties is Radiohead’s “Ok Computer” (rated 10.0) and the #15 is “The Bends” by Radiohead ( which isn’t reviewed on the site at all ). I was interested in crunching the data provided by their wealth of reviews. So I downloaded all the record reviews using a simple python script. And parsed out the description, rating, label, reviewer, release year, title and artist using the following regex :

.*?<h2 class=”fn”>\s*(.*?):<br />([^\n]*)\n.*?<div class=”info”>\n\[([^<;]*);?\s*(\d*)\]?.*?<span class=”rating”>(.*?)<.*?<div class=”content description”>(.*?)</div>.*? - <span class=”reviewer”><span class=”vcard”><span class=”fn”>(.*?)</span>.*?title=”\d+”>(.*?)<

I can now run some interesting queries :

  • * | chart avg(rating) by releaseYear

    Which graphs the average rating per calendar year of the release.
  • *| stats count(title), avg(rating) by artist | search “count(title)”>2| sort “avg(rating)” d | head 10

    This shows the top rated artists that have a least 3 reviews on pitchfork