Friday, April 06, 2007

Most popular blog posts

Mine! by Gini~ from flickr (CC-NC-ND)Popularity tends to follow some sort of a power law - a few posts are interesting to many people, and most posts are interesting to just a few (or only the author). Blogs often contain lists of most popular posts on sidebar. A person visiting such a blog can look at such list, and if there's anything on the blog that they would be interested in, it's most likely there. Blogger doesn't support such lists, so I had to hack them on my own. Blogger also doesn't support statistics, so I'm getting them through Google Analytics. Did you notice a pattern here ? Having a blog on Blogger means spending a lot of time hacking small tools to get things which WordPress bloggers get for free. Probably that's why so many hackers use Blogger instead of WordPress - creating tools is much more fun than using them. At first I wanted to simply list the most popular posts according to Google Analytics, but the list isn't supposed to faithfully reflect their popularity. It's supposed to guess what would be most useful to readers. I got data from three sources:
  • From Google Analytics - number of unique visits
  • From del.icio.us - number of people who saved post url
  • From Blogger - number of comments

Comments

This wasn't the first metric I used, but the code is relatively simple. The first thing - I use wget instead of net/http or open-uri. Ruby HTTP libraries are weak and hard to use for anything non-trivial, while wget is absolutely awesome - powerful and simple. I also cache everything I download on disk, in meaningfully named files. It's not particularly important in "production", but during development having to wait for things to download is extremely distracting. It destroys productivity almost as badly as waiting for things to compile. So the following function fetches url from the Internet, and caches it in file. If file is not specified, it's simply the portion of url following the last slash.
def wget(url, file=url.match(/([^\/]*)\Z/)[0])
 retval = system "wget", "--no-verbose", url, "-O", file unless File.exists?(file)
 File.read(file)
end
And the whole script:
require 'rubygems'
require 'hpricot'

def wget(url, file=url.match(/([^\/]*)\Z/)[0])
 retval = system "wget", "--no-verbose", url, "-O", file unless File.exists?(file)
 File.read(file)
end

url = "http://t-a-w.blogspot.com/"
file_name = "blog-main"

doc = Hpricot(wget(url, file_name))
(doc/"div#ArchiveList a.post-count-link").each{|archive_link|
 url = archive_link.attributes["href"]
 next unless url =~ %r[\Ahttp://t-a-w\.blogspot\.com/[0-9_]*_archive.html\Z]
 archive_doc = Hpricot(wget(url))
 (archive_doc/".post").each{|post|
     post_link = post.at("h3.post-title a")
     link = post_link.attributes["href"]
     title = post_link.inner_text
     post.at("a.comment-link").inner_text =~ /\A(\d+) comments?\Z/
     comments = $1.to_i

     puts [comments, title, link].join("\t")
 }
}
Its output looks something like that:
6       How to code debuggers   http://t-a-w.blogspot.com/2007/03/how-to-code-debuggers.html
7       Programming in Blub     http://t-a-w.blogspot.com/2006/08/programming-in-blub.html
One more thing - I always use \A and \Z instead of ^ and $ in Ruby regular expressions, unless I explicitly need the latter. ^ and $ mean different things in different contexts and can cause weird problems. Ten most popular posts according to number of comments metric were:

del.icio.us saves

I save all the blog posts on del.icio.us, tagged taw+blog. At first I did it because Blogger didn't have labels, but even when it has labels, it's just useful to have all URLs in one place. Thanks to web 2.0 buttons, it's just a single click to add blog posts to del.icio.us (or reddit or digg). Number of people who saved URL on del.icio.us might be even better indicator of post interestingness than bare view count - these people consider the post interesting enough to bookmark it, instead of just enter, look at the cat pic, and go away ;-) del.icio.us is kind enough to show number of other people who saved an URL. Unfortunately it only does so for logged in users, so the wget function needs to be slightly modified to make del.icio.us think I'm logged in. It would be too much bother to write login code just for such a simple script - reusing Firefox cookies is way simpler. Firefox uses silly strings in directory names - my cookies are in /home/taw/.mozilla/firefox/g9exa7wa.default/cookies.txt, but Ruby builtin Dir[] function saves the day.
$cookies = Dir["#{ENV['HOME']}/.mozilla/firefox/*/cookies.txt"][0]
unless $cookies
 STDERR.print "Cannot find cookies\n"
 exit 1
end

def wget(url, file=url.match(/([^\/]*)\Z/)[0])
 system "wget", "--no-verbose", url, "--load-cookies", $cookies, "-O", file unless File.exists?(file)
 File.read(file)
end
And the entire script:
require 'rubygems'
require 'hpricot'

$cookies = Dir["#{ENV['HOME']}/.mozilla/firefox/*/cookies.txt"][0]
unless $cookies
 STDERR.print "Cannot find cookies\n"
 exit 1
end

class String
 # hpricot/text is supposed to be doing this, but it doesn't work
 def unescape_html
     ent = {"quot" => "\"", "apos" => "'", "lt" => "<", "gt" => ">", "amp" => "&"}
     gsub(/&(quot|apos|lt|gt|amp);/) { ent[$1] }
 end
end

def wget(url, file=url.match(/([^\/]*)\Z/)[0])
 system "wget", "--no-verbose", url, "--load-cookies", $cookies, "-O", file unless File.exists?(file)
 File.read(file)
end

def deli_pages(*tags)
 tags_u = "/#{tags.join('+')}" unless tags.empty?
 url = "http://del.icio.us/taw#{tags_u}?setcount=100"
 page_number = 1
 page = wget(url, "deli_bookmarks_page_#{tags.join('_')}_#{page_number}")
 yield page
 while page =~ /<a rel="prev" accesskey=".*?" href="(.*?)">/
    page_number += 1
    url = "http://del.icio.us#{$1}&setcount=100"
    page = wget(url, "deli_bookmarks_page_#{tags.join('_')}_#{page_number}")
    yield page
 end
end

pages = []
deli_pages(%w[taw blog]){|page|
 doc = Hpricot(page)
 (doc/"li.post").each{|post|
     desc = post.at("h4.desc a")

     title = desc.inner_text.unescape_html
     link = desc.attributes["href"]

     saved_by_others = post.at("a.pop")
     if saved_by_others
         saved_by_others.inner_text =~ /\Asaved by (\d+)/
         popularity = $1.to_i
     else
         popularity = 0
     end

     pages << [popularity, title]
 }
}
pages.sort.each{|popularity, title|
 puts "#{popularity} - #{title}"
}
Top ten according to number of del.icio.us saves is:
  • 84 - The right to criticize programming languages
  • 78 - How to code debuggers
  • 56 - Atomic Coding with Subversion
  • 38 - Modern x86 assembly
  • 28 - Prototype-based Ruby
  • 24 - Making Ruby faster
  • 20 - taw's blog
  • 19 - Segfaulting own programs for fun and profit
  • 18 - magic/help for Ruby
  • 17 - Yannis's Law: Programmer Productivity Doubles Every 6 Years

Google Analytics

Google Analytics has no API and is proud of it. Fortunately they provide reports in tab-separated format. URLs are not documented anyway - to extract URL open the Flash application, view the report, and copy its URL. I had problems with authorization too. It seems using Firefox's cookies.txt is not enough for Google Analytics to let me view reports. They probably use session cookies or something like that. I copy&amp;pasted the cookie using FireBug and it worked. The modified wget function is now:
$cookie_header = File.read("/home/taw/.google_analytics_cookie").chomp

def wget(url, file=url.match(/([^\/]*)\Z/)[0])
 system "wget", "--no-verbose", url, "--header", $cookie_header, "-O", file unless File.exists?(file)
 File.read(file)
end
The .google_analytics_cookie file looks like that: Cookie: AnalyticsUserLocale=en-GB; __utm... and the whole source:
$cookie_header = File.read("/home/taw/.google_analytics_cookie").chomp

def wget(url, file=url.match(/([^\/]*)\Z/)[0])
 system "wget", "--no-verbose", url, "--header", $cookie_header, "-O", file unless File.exists?(file)
 File.read(file)
end

class GoogleAnalytics
 def initialize
     @rid        = '1222880'
     @start_date = '20060701'
     @end_date   = '20101231'
 end
 def report_url(vid)
     "https://www.google.com/analytics/home/report?rid=#{@rid}&user=&amp;amp;amp;amp;vid=#{vid}&bd=#{@start_date}&amp;amp;amp;amp;ed=#{@end_date}&ns=10&amp;ss=0&fd=&ft=2&amp;sf=2&sb=1&amp;amp;amp;amp;dow=0&dt=10&amp;dtc=2&dcomp=0&amp;xd=1&x=1"
 end
 def content_by_titles_url
     report_url(1306)
 end
 def referring_source_url
     report_url(1208)
 end
 def content_by_titles
     wget(content_by_titles_url, "content_by_titles")
 end
 def referring_source
     wget(referring_source_url, "referring_source")
 end
 def self.run
     ga = new
     ga.content_by_titles
     ga.referring_source
 end
end

GoogleAnalytics.run
The Google Analytics top ten is:
  • 6631 - The right to criticize programming languages
  • 3776 - Making Ruby faster
  • 3066 - Yannis's Law: Programmer Productivity Doubles Every 6 Years
  • 2631 - Modern x86 assembly
  • 2562 - Atomic Coding with Subversion
  • 2033 - How to code debuggers
  • 1525 - Prototype-based Ruby
  • 1511 - Segfaulting own programs for fun and profit
  • 1211 - Big O analysis considered harmful
  • 930 - My first impressions of Erlang

The final results

On all criteria "The right to criticize programming languages" got the top spot by far. Lists created by both del.icio.us saves and Google Analytics unique views seemed reasonable. Number of comments seemed to be a poor predictor of overall popularity. In the end I added percent of views and percent of saves to get the final score:
# Uses 3 sources to determine popularity:
# * Number of del.icio.us saves
# * Number of Blogger comments
# * Number of visits according to Google Analytics

unless File.exists?("content_by_titles")
 system "./google_analytics_report"
end

unless File.exists?("blogger_comments_cnt")
 system "./blogger_comments >blogger_comments_cnt"
end

unless File.exists?("../del.icio.us/deli_popularity")
 Dir.chdir("../del.icio.us") {
     system "./deli_popularity.rb >deli_popularity"
 }
end

visits   = File.read("content_by_titles")
comments = File.read("blogger_comments_cnt")
saves    = File.read("../del.icio.us/deli_popularity")

statistics = {}

class Article
 attr_reader :comments, :visits, :saves
 @@comments_total = 0
 @@visits_total   = 0
 @@saves_total    = 0
 def initialize(title)
     @title    = title
     @comments = 0
     @visits   = 0
     @saves    = 0
 end
 def comments=(comments)
     @@comments_total += comments - @comments
     @comments         = comments
 end
 def visits=(visits)
     @@visits_total += visits - @visits
     @visits         = visits
 end
 def saves=(saves)
     @@saves_total += saves - @saves
     @saves         = saves
 end
 def comments_perc
     100.0 * @comments / @@comments_total
 end
 def saves_perc
     100.0 * @saves / @@saves_total
 end
 def visits_perc
     100.0 * @visits / @@visits_total
 end
 def to_s
     sprintf "%s (%.0f%% saves, %.0f%% visits, %.0f%% comments)",
         @title, saves_perc, visits_perc, comments_perc
 end
 include Comparable
 def <=>(other)
     (saves_perc + visits_perc) <=>
     (other.saves_perc + other.visits_perc)
 end
end

comments.each{|line|
 comments, title, url = line.chomp.split(/\t/)
 statistics[title] = Article.new(title)
 statistics[title].comments = comments.to_i
}

saves.each{|line|
 saves, title = line.chomp.split(/ - /, 2)
 next if title == "taw's blog"
 statistics[title].saves = saves.to_i
}

visits.each{|line|
 line.chomp!
 next if line == "" or line =~ /\A#/
 title, unique_views, *stuff = line.split(/\t/)
 if title =~ /\Ataw\'s blog: (.*)/ and statistics[$1]
     statistics[$1].visits = unique_views.to_i
 end
}

stats = statistics.values

puts "By saves:"
stats.sort_by{|post| -post.saves}[0,10].each{|post|
 puts "* #{post}"
}
puts ""

puts "By views:"
stats.sort_by{|post| -post.visits}[0,10].each{|post|
 puts "* #{post}"
}
puts ""

puts "By comments:"
stats.sort_by{|post| -post.comments}[0,10].each{|post|
 puts "* #{post}"
}
puts ""

puts "By total score:"
stats.sort.reverse[0,15].each{|post|
 puts "* #{post}"
}
puts ""
The final top 15 list is:
  • The right to criticize programming languages (17% saves, 16% visits, 11% comments)
  • How to code debuggers (16% saves, 5% visits, 2% comments)
  • Atomic Coding with Subversion (12% saves, 6% visits, 3% comments)
  • Modern x86 assembly (8% saves, 6% visits, 2% comments)
  • Making Ruby faster (5% saves, 9% visits, 3% comments)
  • Yannis's Law: Programmer Productivity Doubles Every 6 Years (4% saves, 7% visits, 4% comments)
  • Prototype-based Ruby (6% saves, 4% visits, 1% comments)
  • Segfaulting own programs for fun and profit (4% saves, 4% visits, 1% comments)
  • magic/help for Ruby (4% saves, 2% visits, 2% comments)
  • Fight to the death between Ruby and Python (3% saves, 2% visits, 5% comments)
  • RLisp - Lisp naturally embedded in Ruby (3% saves, 2% visits, 1% comments)
  • My first impressions of Erlang (2% saves, 2% visits, 4% comments)
  • Big O analysis considered harmful (1% saves, 3% visits, 2% comments)
  • The programming language of 2010 (2% saves, 2% visits, 7% comments)
  • iPod-last.fm bridge (2% saves, 2% visits, 4% comments)

4 comments:

  1. Been looking for 4 or 5 hours now all over the web for a script like this that will create a list of your site's most popular pages based on google analytics stats. Not a hardcore programmer, so can't quite follow what you're doing here, and doesn't look like it's packaged up in a widget or block of PHP code or something that I could deploy. So you're downloading a cookie of the Analytics data down, and using that as the basis of data? Could this be done using live stats info instead? Perhaps you could provide a more generic version with some commenting on using on your own site. It would be really useful, as many people run Analytics, few run databases which I assume is the way "Most Popular" sections are done on sites like YouTube, CNET, etc.

    ReplyDelete
  2. window: The data is available from Google Analytics in tab-separated format under URL like https://www.google.com/analytics/home/report?rid=1222880&user=&vid=1306&bd=20060101&ed=20071231&ns=10&ss=0&fd=&ft=2&sf=2&sb=1&dow=0&dt=3&dtc=2&dcomp=0&xd=1&x=1
    Change rid from 1222880 (my blog's) to one for your website to get the data, bd and ed are start and end dates in YYYYMMDD format. You must be logged to Google Analytics to download it.

    The entries are already sorted by popularity.

    Was it helpful ?

    ReplyDelete
  3. Hi taw,

    This is a very useful hack, but it looks like it is described for serious programmers. For lesser mortals, it looks tough going. For example, there is no explanation as to which part of the template the script should go into, and if where we encounter url, we are supposed to leave it as it is or to replace it with the url of our blog. Any chance of you elaborating?

    ReplyDelete
  4. curious: You're right, these scripts were written by a programmer, and while most readers of this blog seem to be programmers, others probably won't be able to use them on their own blogs.

    If you're not a programmer and you want to get a list of most popular blog posts, the easiest way would be to get a Google Analytics account, put Google Analytics Javascript in your blog's template (it's explained by Google Analytics quite well), and then use Google Analytics GUI to get the data.

    ReplyDelete