Popularity tends to follow some sort of a power law - a few posts are interesting to many people, and most posts are interesting to just a few (or only the author). Blogs often contain lists of most popular posts on sidebar. A person visiting such a blog can look at such list, and if there's anything on the blog that they would be interested in, it's most likely there. Blogger doesn't support such lists, so I had to hack them on my own. Blogger also doesn't support statistics, so I'm getting them through Google Analytics. Did you notice a pattern here ? Having a blog on Blogger means spending a lot of time hacking small tools to get things which WordPress bloggers get for free. Probably that's why so many hackers use Blogger instead of WordPress - creating tools is much more fun than using them. At first I wanted to simply list the most popular posts according to Google Analytics, but the list isn't supposed to faithfully reflect their popularity. It's supposed to guess what would be most useful to readers. I got data from three sources:
- From Google Analytics - number of unique visits
- From del.icio.us - number of people who saved post url
- From Blogger - number of comments
Comments
This wasn't the first metric I used, but the code is relatively simple. The first thing - I usewget
instead of net/http
or open-uri
. Ruby HTTP libraries are weak and hard to use for anything non-trivial, while wget
is absolutely awesome - powerful and simple. I also cache everything I download on disk, in meaningfully named files. It's not particularly important in "production", but during development having to wait for things to download is extremely distracting. It destroys productivity almost as badly as waiting for things to compile.
So the following function fetches url
from the Internet, and caches it in file
. If file
is not specified, it's simply the portion of url
following the last slash.
def wget(url, file=url.match(/([^\/]*)\Z/)[0])
retval = system "wget", "--no-verbose", url, "-O", file unless File.exists?(file)
File.read(file)
end
And the whole script:
require 'rubygems'
require 'hpricot'
def wget(url, file=url.match(/([^\/]*)\Z/)[0])
retval = system "wget", "--no-verbose", url, "-O", file unless File.exists?(file)
File.read(file)
end
url = "http://t-a-w.blogspot.com/"
file_name = "blog-main"
doc = Hpricot(wget(url, file_name))
(doc/"div#ArchiveList a.post-count-link").each{|archive_link|
url = archive_link.attributes["href"]
next unless url =~ %r[\Ahttp://t-a-w\.blogspot\.com/[0-9_]*_archive.html\Z]
archive_doc = Hpricot(wget(url))
(archive_doc/".post").each{|post|
post_link = post.at("h3.post-title a")
link = post_link.attributes["href"]
title = post_link.inner_text
post.at("a.comment-link").inner_text =~ /\A(\d+) comments?\Z/
comments = $1.to_i
puts [comments, title, link].join("\t")
}
}
Its output looks something like that:6 How to code debuggers http://t-a-w.blogspot.com/2007/03/how-to-code-debuggers.html
7 Programming in Blub http://t-a-w.blogspot.com/2006/08/programming-in-blub.html
One more thing - I always use \A
and \Z
instead of ^
and $
in Ruby regular expressions, unless I explicitly need the latter. ^
and $
mean different things in different contexts and can cause weird problems.
Ten most popular posts according to number of comments metric were:
- 40 - The right to criticize programming languages
- 24 - The programming language of 2010
- 17 - Fight to the death between Ruby and Python
- 15 - My first impressions of Erlang
- 14 - iPod-last.fm bridge
- 14 - Yannis's Law: Programmer Productivity Doubles Every 6 Years
- 13 - Why Perl Is a Great Language for Concurrent Programming
- 13 - List of things that suck in Ruby
- 12 - Making Ruby faster
- 12 - Atomic Coding with Subversion
del.icio.us saves
I save all the blog posts on del.icio.us, tagged taw+blog. At first I did it because Blogger didn't have labels, but even when it has labels, it's just useful to have all URLs in one place. Thanks to web 2.0 buttons, it's just a single click to add blog posts to del.icio.us (or reddit or digg). Number of people who saved URL on del.icio.us might be even better indicator of post interestingness than bare view count - these people consider the post interesting enough to bookmark it, instead of just enter, look at the cat pic, and go away ;-) del.icio.us is kind enough to show number of other people who saved an URL. Unfortunately it only does so for logged in users, so thewget
function needs to be slightly modified to make del.icio.us think I'm logged in. It would be too much bother to write login code just for such a simple script - reusing Firefox cookies is way simpler.
Firefox uses silly strings in directory names - my cookies are in /home/taw/.mozilla/firefox/g9exa7wa.default/cookies.txt
, but Ruby builtin Dir[]
function saves the day.
$cookies = Dir["#{ENV['HOME']}/.mozilla/firefox/*/cookies.txt"][0]
unless $cookies
STDERR.print "Cannot find cookies\n"
exit 1
end
def wget(url, file=url.match(/([^\/]*)\Z/)[0])
system "wget", "--no-verbose", url, "--load-cookies", $cookies, "-O", file unless File.exists?(file)
File.read(file)
end
And the entire script:require 'rubygems'
require 'hpricot'
$cookies = Dir["#{ENV['HOME']}/.mozilla/firefox/*/cookies.txt"][0]
unless $cookies
STDERR.print "Cannot find cookies\n"
exit 1
end
class String
# hpricot/text is supposed to be doing this, but it doesn't work
def unescape_html
ent = {"quot" => "\"", "apos" => "'", "lt" => "<", "gt" => ">", "amp" => "&"}
gsub(/&(quot|apos|lt|gt|amp);/) { ent[$1] }
end
end
def wget(url, file=url.match(/([^\/]*)\Z/)[0])
system "wget", "--no-verbose", url, "--load-cookies", $cookies, "-O", file unless File.exists?(file)
File.read(file)
end
def deli_pages(*tags)
tags_u = "/#{tags.join('+')}" unless tags.empty?
url = "http://del.icio.us/taw#{tags_u}?setcount=100"
page_number = 1
page = wget(url, "deli_bookmarks_page_#{tags.join('_')}_#{page_number}")
yield page
while page =~ /<a rel="prev" accesskey=".*?" href="(.*?)">/
page_number += 1
url = "http://del.icio.us#{$1}&setcount=100"
page = wget(url, "deli_bookmarks_page_#{tags.join('_')}_#{page_number}")
yield page
end
end
pages = []
deli_pages(%w[taw blog]){|page|
doc = Hpricot(page)
(doc/"li.post").each{|post|
desc = post.at("h4.desc a")
title = desc.inner_text.unescape_html
link = desc.attributes["href"]
saved_by_others = post.at("a.pop")
if saved_by_others
saved_by_others.inner_text =~ /\Asaved by (\d+)/
popularity = $1.to_i
else
popularity = 0
end
pages << [popularity, title]
}
}
pages.sort.each{|popularity, title|
puts "#{popularity} - #{title}"
}
Top ten according to number of del.icio.us saves is:
- 84 - The right to criticize programming languages
- 78 - How to code debuggers
- 56 - Atomic Coding with Subversion
- 38 - Modern x86 assembly
- 28 - Prototype-based Ruby
- 24 - Making Ruby faster
- 20 - taw's blog
- 19 - Segfaulting own programs for fun and profit
- 18 - magic/help for Ruby
- 17 - Yannis's Law: Programmer Productivity Doubles Every 6 Years
Google Analytics
Google Analytics has no API and is proud of it. Fortunately they provide reports in tab-separated format. URLs are not documented anyway - to extract URL open the Flash application, view the report, and copy its URL. I had problems with authorization too. It seems using Firefox'scookies.txt
is not enough for Google Analytics to let me view reports. They probably use session cookies or something like that. I copy&pasted the cookie using FireBug and it worked.
The modified wget
function is now:
$cookie_header = File.read("/home/taw/.google_analytics_cookie").chomp
def wget(url, file=url.match(/([^\/]*)\Z/)[0])
system "wget", "--no-verbose", url, "--header", $cookie_header, "-O", file unless File.exists?(file)
File.read(file)
end
The .google_analytics_cookie
file looks like that: Cookie: AnalyticsUserLocale=en-GB; __utm...
and the whole source:
$cookie_header = File.read("/home/taw/.google_analytics_cookie").chomp
def wget(url, file=url.match(/([^\/]*)\Z/)[0])
system "wget", "--no-verbose", url, "--header", $cookie_header, "-O", file unless File.exists?(file)
File.read(file)
end
class GoogleAnalytics
def initialize
@rid = '1222880'
@start_date = '20060701'
@end_date = '20101231'
end
def report_url(vid)
"https://www.google.com/analytics/home/report?rid=#{@rid}&user=&amp;amp;amp;vid=#{vid}&bd=#{@start_date}&amp;amp;amp;ed=#{@end_date}&ns=10&ss=0&fd=&ft=2&sf=2&sb=1&amp;amp;amp;dow=0&dt=10&dtc=2&dcomp=0&xd=1&x=1"
end
def content_by_titles_url
report_url(1306)
end
def referring_source_url
report_url(1208)
end
def content_by_titles
wget(content_by_titles_url, "content_by_titles")
end
def referring_source
wget(referring_source_url, "referring_source")
end
def self.run
ga = new
ga.content_by_titles
ga.referring_source
end
end
GoogleAnalytics.run
The Google Analytics top ten is:
- 6631 - The right to criticize programming languages
- 3776 - Making Ruby faster
- 3066 - Yannis's Law: Programmer Productivity Doubles Every 6 Years
- 2631 - Modern x86 assembly
- 2562 - Atomic Coding with Subversion
- 2033 - How to code debuggers
- 1525 - Prototype-based Ruby
- 1511 - Segfaulting own programs for fun and profit
- 1211 - Big O analysis considered harmful
- 930 - My first impressions of Erlang
The final results
On all criteria "The right to criticize programming languages" got the top spot by far. Lists created by both del.icio.us saves and Google Analytics unique views seemed reasonable. Number of comments seemed to be a poor predictor of overall popularity. In the end I added percent of views and percent of saves to get the final score:# Uses 3 sources to determine popularity:
# * Number of del.icio.us saves
# * Number of Blogger comments
# * Number of visits according to Google Analytics
unless File.exists?("content_by_titles")
system "./google_analytics_report"
end
unless File.exists?("blogger_comments_cnt")
system "./blogger_comments >blogger_comments_cnt"
end
unless File.exists?("../del.icio.us/deli_popularity")
Dir.chdir("../del.icio.us") {
system "./deli_popularity.rb >deli_popularity"
}
end
visits = File.read("content_by_titles")
comments = File.read("blogger_comments_cnt")
saves = File.read("../del.icio.us/deli_popularity")
statistics = {}
class Article
attr_reader :comments, :visits, :saves
@@comments_total = 0
@@visits_total = 0
@@saves_total = 0
def initialize(title)
@title = title
@comments = 0
@visits = 0
@saves = 0
end
def comments=(comments)
@@comments_total += comments - @comments
@comments = comments
end
def visits=(visits)
@@visits_total += visits - @visits
@visits = visits
end
def saves=(saves)
@@saves_total += saves - @saves
@saves = saves
end
def comments_perc
100.0 * @comments / @@comments_total
end
def saves_perc
100.0 * @saves / @@saves_total
end
def visits_perc
100.0 * @visits / @@visits_total
end
def to_s
sprintf "%s (%.0f%% saves, %.0f%% visits, %.0f%% comments)",
@title, saves_perc, visits_perc, comments_perc
end
include Comparable
def <=>(other)
(saves_perc + visits_perc) <=>
(other.saves_perc + other.visits_perc)
end
end
comments.each{|line|
comments, title, url = line.chomp.split(/\t/)
statistics[title] = Article.new(title)
statistics[title].comments = comments.to_i
}
saves.each{|line|
saves, title = line.chomp.split(/ - /, 2)
next if title == "taw's blog"
statistics[title].saves = saves.to_i
}
visits.each{|line|
line.chomp!
next if line == "" or line =~ /\A#/
title, unique_views, *stuff = line.split(/\t/)
if title =~ /\Ataw\'s blog: (.*)/ and statistics[$1]
statistics[$1].visits = unique_views.to_i
end
}
stats = statistics.values
puts "By saves:"
stats.sort_by{|post| -post.saves}[0,10].each{|post|
puts "* #{post}"
}
puts ""
puts "By views:"
stats.sort_by{|post| -post.visits}[0,10].each{|post|
puts "* #{post}"
}
puts ""
puts "By comments:"
stats.sort_by{|post| -post.comments}[0,10].each{|post|
puts "* #{post}"
}
puts ""
puts "By total score:"
stats.sort.reverse[0,15].each{|post|
puts "* #{post}"
}
puts ""
The final top 15 list is:
- The right to criticize programming languages (17% saves, 16% visits, 11% comments)
- How to code debuggers (16% saves, 5% visits, 2% comments)
- Atomic Coding with Subversion (12% saves, 6% visits, 3% comments)
- Modern x86 assembly (8% saves, 6% visits, 2% comments)
- Making Ruby faster (5% saves, 9% visits, 3% comments)
- Yannis's Law: Programmer Productivity Doubles Every 6 Years (4% saves, 7% visits, 4% comments)
- Prototype-based Ruby (6% saves, 4% visits, 1% comments)
- Segfaulting own programs for fun and profit (4% saves, 4% visits, 1% comments)
- magic/help for Ruby (4% saves, 2% visits, 2% comments)
- Fight to the death between Ruby and Python (3% saves, 2% visits, 5% comments)
- RLisp - Lisp naturally embedded in Ruby (3% saves, 2% visits, 1% comments)
- My first impressions of Erlang (2% saves, 2% visits, 4% comments)
- Big O analysis considered harmful (1% saves, 3% visits, 2% comments)
- The programming language of 2010 (2% saves, 2% visits, 7% comments)
- iPod-last.fm bridge (2% saves, 2% visits, 4% comments)
4 comments:
Been looking for 4 or 5 hours now all over the web for a script like this that will create a list of your site's most popular pages based on google analytics stats. Not a hardcore programmer, so can't quite follow what you're doing here, and doesn't look like it's packaged up in a widget or block of PHP code or something that I could deploy. So you're downloading a cookie of the Analytics data down, and using that as the basis of data? Could this be done using live stats info instead? Perhaps you could provide a more generic version with some commenting on using on your own site. It would be really useful, as many people run Analytics, few run databases which I assume is the way "Most Popular" sections are done on sites like YouTube, CNET, etc.
window: The data is available from Google Analytics in tab-separated format under URL like https://www.google.com/analytics/home/report?rid=1222880&user=&vid=1306&bd=20060101&ed=20071231&ns=10&ss=0&fd=&ft=2&sf=2&sb=1&dow=0&dt=3&dtc=2&dcomp=0&xd=1&x=1
Change rid from 1222880 (my blog's) to one for your website to get the data, bd and ed are start and end dates in YYYYMMDD format. You must be logged to Google Analytics to download it.
The entries are already sorted by popularity.
Was it helpful ?
Hi taw,
This is a very useful hack, but it looks like it is described for serious programmers. For lesser mortals, it looks tough going. For example, there is no explanation as to which part of the template the script should go into, and if where we encounter url, we are supposed to leave it as it is or to replace it with the url of our blog. Any chance of you elaborating?
curious: You're right, these scripts were written by a programmer, and while most readers of this blog seem to be programmers, others probably won't be able to use them on their own blogs.
If you're not a programmer and you want to get a list of most popular blog posts, the easiest way would be to get a Google Analytics account, put Google Analytics Javascript in your blog's template (it's explained by Google Analytics quite well), and then use Google Analytics GUI to get the data.
Post a Comment