The best kittens, technology, and video games blog in the world.

Friday, April 06, 2007

Most popular blog posts

Mine! by Gini~ from flickr (CC-NC-ND)Popularity tends to follow some sort of a power law - a few posts are interesting to many people, and most posts are interesting to just a few (or only the author).

Blogs often contain lists of most popular posts on sidebar. A person visiting such a blog can look at such list, and if there's anything on the blog that they would be interested in, it's most likely there.

Blogger doesn't support such lists, so I had to hack them on my own. Blogger also doesn't support statistics, so I'm getting them through Google Analytics. Did you notice a pattern here ? Having a blog on Blogger means spending a lot of time hacking small tools to get things which WordPress bloggers get for free. Probably that's why so many hackers use Blogger instead of WordPress - creating tools is much more fun than using them.

At first I wanted to simply list the most popular posts according to Google Analytics, but the list isn't supposed to faithfully reflect their popularity. It's supposed to guess what would be most useful to readers.

I got data from three sources:


This wasn't the first metric I used, but the code is relatively simple. The first thing - I use wget instead of net/http or open-uri. Ruby HTTP libraries are weak and hard to use for anything non-trivial, while wget is absolutely awesome - powerful and simple. I also cache everything I download on disk, in meaningfully named files. It's not particularly important in "production", but during development having to wait for things to download is extremely distracting. It destroys productivity almost as badly as waiting for things to compile.

So the following function fetches url from the Internet, and caches it in file. If file is not specified, it's simply the portion of url following the last slash.

def wget(url, file=url.match(/([^\/]*)\Z/)[0])
retval = system "wget", "--no-verbose", url, "-O", file unless File.exists?(file)
And the whole script:
require 'rubygems'
require 'hpricot'

def wget(url, file=url.match(/([^\/]*)\Z/)[0])
retval = system "wget", "--no-verbose", url, "-O", file unless File.exists?(file)

url = ""
file_name = "blog-main"

doc = Hpricot(wget(url, file_name))
url = archive_link.attributes["href"]
next unless url =~ %r[\Ahttp://t-a-w\.blogspot\.com/[0-9_]*_archive.html\Z]
archive_doc = Hpricot(wget(url))
post_link =" a")
link = post_link.attributes["href"]
title = post_link.inner_text"a.comment-link").inner_text =~ /\A(\d+) comments?\Z/
comments = $1.to_i

puts [comments, title, link].join("\t")
Its output looks something like that:
6       How to code debuggers
7 Programming in Blub
One more thing - I always use \A and \Z instead of ^ and $ in Ruby regular expressions, unless I explicitly need the latter. ^ and $ mean different things in different contexts and can cause weird problems.

Ten most popular posts according to number of comments metric were: saves

I save all the blog posts on, tagged taw+blog. At first I did it because Blogger didn't have labels, but even when it has labels, it's just useful to have all URLs in one place. Thanks to web 2.0 buttons, it's just a single click to add blog posts to (or reddit or digg).

Number of people who saved URL on might be even better indicator of post interestingness than bare view count - these people consider the post interesting enough to bookmark it, instead of just enter, look at the cat pic, and go away ;-) is kind enough to show number of other people who saved an URL. Unfortunately it only does so for logged in users, so the wget function needs to be slightly modified to make think I'm logged in. It would be too much bother to write login code just for such a simple script - reusing Firefox cookies is way simpler.

Firefox uses silly strings in directory names - my cookies are in /home/taw/.mozilla/firefox/g9exa7wa.default/cookies.txt, but Ruby builtin Dir[] function saves the day.

$cookies = Dir["#{ENV['HOME']}/.mozilla/firefox/*/cookies.txt"][0]
unless $cookies
STDERR.print "Cannot find cookies\n"
exit 1

def wget(url, file=url.match(/([^\/]*)\Z/)[0])
system "wget", "--no-verbose", url, "--load-cookies", $cookies, "-O", file unless File.exists?(file)

And the entire script:
require 'rubygems'
require 'hpricot'

$cookies = Dir["#{ENV['HOME']}/.mozilla/firefox/*/cookies.txt"][0]
unless $cookies
STDERR.print "Cannot find cookies\n"
exit 1

class String
# hpricot/text is supposed to be doing this, but it doesn't work
def unescape_html
ent = {"quot" => "\"", "apos" => "'", "lt" => "<", "gt" => ">", "amp" => "&"}
gsub(/&(quot|apos|lt|gt|amp);/) { ent[$1] }

def wget(url, file=url.match(/([^\/]*)\Z/)[0])
system "wget", "--no-verbose", url, "--load-cookies", $cookies, "-O", file unless File.exists?(file)

def deli_pages(*tags)
tags_u = "/#{tags.join('+')}" unless tags.empty?
url = "{tags_u}?setcount=100"
page_number = 1
page = wget(url, "deli_bookmarks_page_#{tags.join('_')}_#{page_number}")
yield page
while page =~ /<a rel="prev" accesskey=".*?" href="(.*?)">/
page_number += 1
url = "{$1}&setcount=100"
page = wget(url, "deli_bookmarks_page_#{tags.join('_')}_#{page_number}")
yield page

pages = []
deli_pages(%w[taw blog]){|page|
doc = Hpricot(page)
desc ="h4.desc a")

title = desc.inner_text.unescape_html
link = desc.attributes["href"]

saved_by_others ="a.pop")
if saved_by_others
saved_by_others.inner_text =~ /\Asaved by (\d+)/
popularity = $1.to_i
popularity = 0

pages << [popularity, title]
pages.sort.each{|popularity, title|
puts "#{popularity} - #{title}"

Top ten according to number of saves is:
  • 84 - The right to criticize programming languages
  • 78 - How to code debuggers
  • 56 - Atomic Coding with Subversion
  • 38 - Modern x86 assembly
  • 28 - Prototype-based Ruby
  • 24 - Making Ruby faster
  • 20 - taw's blog
  • 19 - Segfaulting own programs for fun and profit
  • 18 - magic/help for Ruby
  • 17 - Yannis's Law: Programmer Productivity Doubles Every 6 Years

Google Analytics

Google Analytics has no API and is proud of it. Fortunately they provide reports in tab-separated format. URLs are not documented anyway - to extract URL open the Flash application, view the report, and copy its URL. I had problems with authorization too. It seems using Firefox's cookies.txt is not enough for Google Analytics to let me view reports. They probably use session cookies or something like that. I copy&amp;pasted the cookie using FireBug and it worked.

The modified wget function is now:

$cookie_header ="/home/taw/.google_analytics_cookie").chomp

def wget(url, file=url.match(/([^\/]*)\Z/)[0])
system "wget", "--no-verbose", url, "--header", $cookie_header, "-O", file unless File.exists?(file)
The .google_analytics_cookie file looks like that: Cookie: AnalyticsUserLocale=en-GB; __utm...

and the whole source:
$cookie_header ="/home/taw/.google_analytics_cookie").chomp

def wget(url, file=url.match(/([^\/]*)\Z/)[0])
system "wget", "--no-verbose", url, "--header", $cookie_header, "-O", file unless File.exists?(file)

class GoogleAnalytics
def initialize
@rid = '1222880'
@start_date = '20060701'
@end_date = '20101231'
def report_url(vid)
def content_by_titles_url
def referring_source_url
def content_by_titles
wget(content_by_titles_url, "content_by_titles")
def referring_source
wget(referring_source_url, "referring_source")
ga = new

The Google Analytics top ten is:
  • 6631 - The right to criticize programming languages
  • 3776 - Making Ruby faster
  • 3066 - Yannis's Law: Programmer Productivity Doubles Every 6 Years
  • 2631 - Modern x86 assembly
  • 2562 - Atomic Coding with Subversion
  • 2033 - How to code debuggers
  • 1525 - Prototype-based Ruby
  • 1511 - Segfaulting own programs for fun and profit
  • 1211 - Big O analysis considered harmful
  • 930 - My first impressions of Erlang

The final results

On all criteria "The right to criticize programming languages" got the top spot by far. Lists created by both saves and Google Analytics unique views seemed reasonable. Number of comments seemed to be a poor predictor of overall popularity. In the end I added percent of views and percent of saves to get the final score:
# Uses 3 sources to determine popularity:
# * Number of saves
# * Number of Blogger comments
# * Number of visits according to Google Analytics

unless File.exists?("content_by_titles")
system "./google_analytics_report"

unless File.exists?("blogger_comments_cnt")
system "./blogger_comments >blogger_comments_cnt"

unless File.exists?("../")
Dir.chdir("../") {
system "./deli_popularity.rb >deli_popularity"

visits ="content_by_titles")
comments ="blogger_comments_cnt")
saves ="../")

statistics = {}

class Article
attr_reader :comments, :visits, :saves
@@comments_total = 0
@@visits_total = 0
@@saves_total = 0
def initialize(title)
@title = title
@comments = 0
@visits = 0
@saves = 0
def comments=(comments)
@@comments_total += comments - @comments
@comments = comments
def visits=(visits)
@@visits_total += visits - @visits
@visits = visits
def saves=(saves)
@@saves_total += saves - @saves
@saves = saves
def comments_perc
100.0 * @comments / @@comments_total
def saves_perc
100.0 * @saves / @@saves_total
def visits_perc
100.0 * @visits / @@visits_total
def to_s
sprintf "%s (%.0f%% saves, %.0f%% visits, %.0f%% comments)",
@title, saves_perc, visits_perc, comments_perc
include Comparable
def <=>(other)
(saves_perc + visits_perc) <=>
(other.saves_perc + other.visits_perc)

comments, title, url = line.chomp.split(/\t/)
statistics[title] =
statistics[title].comments = comments.to_i

saves, title = line.chomp.split(/ - /, 2)
next if title == "taw's blog"
statistics[title].saves = saves.to_i

next if line == "" or line =~ /\A#/
title, unique_views, *stuff = line.split(/\t/)
if title =~ /\Ataw\'s blog: (.*)/ and statistics[$1]
statistics[$1].visits = unique_views.to_i

stats = statistics.values

puts "By saves:"
stats.sort_by{|post| -post.saves}[0,10].each{|post|
puts "* #{post}"
puts ""

puts "By views:"
stats.sort_by{|post| -post.visits}[0,10].each{|post|
puts "* #{post}"
puts ""

puts "By comments:"
stats.sort_by{|post| -post.comments}[0,10].each{|post|
puts "* #{post}"
puts ""

puts "By total score:"
puts "* #{post}"
puts ""

The final top 15 list is:
  • The right to criticize programming languages (17% saves, 16% visits, 11% comments)
  • How to code debuggers (16% saves, 5% visits, 2% comments)
  • Atomic Coding with Subversion (12% saves, 6% visits, 3% comments)
  • Modern x86 assembly (8% saves, 6% visits, 2% comments)
  • Making Ruby faster (5% saves, 9% visits, 3% comments)
  • Yannis's Law: Programmer Productivity Doubles Every 6 Years (4% saves, 7% visits, 4% comments)
  • Prototype-based Ruby (6% saves, 4% visits, 1% comments)
  • Segfaulting own programs for fun and profit (4% saves, 4% visits, 1% comments)
  • magic/help for Ruby (4% saves, 2% visits, 2% comments)
  • Fight to the death between Ruby and Python (3% saves, 2% visits, 5% comments)
  • RLisp - Lisp naturally embedded in Ruby (3% saves, 2% visits, 1% comments)
  • My first impressions of Erlang (2% saves, 2% visits, 4% comments)
  • Big O analysis considered harmful (1% saves, 3% visits, 2% comments)
  • The programming language of 2010 (2% saves, 2% visits, 7% comments)
  • bridge (2% saves, 2% visits, 4% comments)


window said...

Been looking for 4 or 5 hours now all over the web for a script like this that will create a list of your site's most popular pages based on google analytics stats. Not a hardcore programmer, so can't quite follow what you're doing here, and doesn't look like it's packaged up in a widget or block of PHP code or something that I could deploy. So you're downloading a cookie of the Analytics data down, and using that as the basis of data? Could this be done using live stats info instead? Perhaps you could provide a more generic version with some commenting on using on your own site. It would be really useful, as many people run Analytics, few run databases which I assume is the way "Most Popular" sections are done on sites like YouTube, CNET, etc.

taw said...

window: The data is available from Google Analytics in tab-separated format under URL like
Change rid from 1222880 (my blog's) to one for your website to get the data, bd and ed are start and end dates in YYYYMMDD format. You must be logged to Google Analytics to download it.

The entries are already sorted by popularity.

Was it helpful ?

curious said...

Hi taw,

This is a very useful hack, but it looks like it is described for serious programmers. For lesser mortals, it looks tough going. For example, there is no explanation as to which part of the template the script should go into, and if where we encounter url, we are supposed to leave it as it is or to replace it with the url of our blog. Any chance of you elaborating?

taw said...

curious: You're right, these scripts were written by a programmer, and while most readers of this blog seem to be programmers, others probably won't be able to use them on their own blogs.

If you're not a programmer and you want to get a list of most popular blog posts, the easiest way would be to get a Google Analytics account, put Google Analytics Javascript in your blog's template (it's explained by Google Analytics quite well), and then use Google Analytics GUI to get the data.