Splunking pitchfork album reviews
| Topics: | Homepage, dev, hacks |
|---|---|
| Tags: | |
| Share: |
One of my favorite sites is the record review and music news site pitchfork media. On the site they have a bunch of interesting statistics like top record for each decade/year but these are obviously a more subjective list than if they crunched the raw stats. For example their #1 album of the nineties is Radiohead’s “Ok Computer” (rated 10.0) and the #15 is “The Bends” by Radiohead ( which isn’t reviewed on the site at all ). I was interested in crunching the data provided by their wealth of reviews. So I downloaded all the record reviews using a simple python script. And parsed out the description, rating, label, reviewer, release year, title and artist using the following regex :
.*?<h2 class=”fn”>\s*(.*?):<br />([^\n]*)\n.*?<div class=”info”>\n\[([^<;]*);?\s*(\d*)\]?.*?<span class=”rating”>(.*?)<.*?<div class=”content description”>(.*?)</div>.*? - <span class=”reviewer”><span class=”vcard”><span class=”fn”>(.*?)</span>.*?title=”\d+”>(.*?)<
I can now run some interesting queries :


