The best kittens, technology, and video games blog in the world.

Wednesday, June 06, 2007

Extracting backlinks to all posts in the blog

Be careful honey.... by JennyHuang from flickr (CC-BY)
Blogger software gathers information on backlinks to blog posts, but the only way of accessing it seems to be looking at pages of individual posts. I was kinda interested who's linking to my blog - that's a good way of finding out other cool blogs. Of course I wasn't going to look at 160 pages. The task looked trivial - just download 160 pages, hpricot links out of them, and put them in a single HTML document. Unfortunately something was missing in the HTML sources:

<span class='post-backlinks post-comment-link'>


I guess that's to make the links invisible to search engines and to discourage spam. Unfortunately it also makes them invisible to all kinds of useful bots, Web 2.0 feels so wrong sometimes. Without FireBug I'd probably go "oh, screw that", fortunately thanks to FireBug I was able to find out what's going on quite fast - JavaScript was POSTing a request, and the results executed themselves:
try {
_WidgetManager._HandleControllerResult('Blog1', 'backlinks',{'numBacklinks': 1, 'backlinksLabel': 'Links to this post', 'authorLabel': 'Posted by', 'timestampLabel': 'at', 'createLinkLabel': 'Create a Link', 'createLinkUrl': '', 'backlinks': [{'url': '', 'author': '', 'title': 'taw's blog: How to test glue code ?', 'snippet': 'taw's blog: How to test glue code ? Indeed. How? ', 'timestamp': '2:58 AM', 'deleteUrl': '\u003d27488238&postID\u003d7234750665665792269&backlinkURL\', 'adminClass': 'pid-695331528'}]});
} catch (e) {
if (typeof log != 'undefined') {
log('HandleControllerResult failed: ' + e);

That's bad, as the code is in full-blown JavaScript, not JSON. Fortunately it was possible to convert it to something JSON enough with a few regular expresions, and there were no obstacles further on. Here's the code:
require 'rubygems'
require 'hpricot'
require 'json'

def wget(url, file=url.match(/([^\/]*)\Z/)[0])
retval = system "wget", "--no-verbose", url, "-O", file unless File.exists?(file)

post_pages = []

doc = Hpricot(wget("", "blog-main"))
url = archive_link.attributes["href"]
next unless url =~ %r[\Ahttp://t-a-w\.blogspot\.com/[0-9_]*_archive.html\Z]
archive_doc = Hpricot(wget(url))
post_link =" a")
link = post_link.attributes["href"]
title = post_link.inner_text
post_id ="a").attributes["name"]
post_pages << [title, link, post_id]

puts "<html><head><meta http-equiv='Content-Type' content='text/html; charset=UTF-8' /><title>Backlinks</head><body><ul>"
post_pages.each{|title, url, post_id|
unless File.exists?("backlinks-#{post_id}.js")
system "wget", url, "-O", "backlinks-#{post_id}.js", "--post-data",
data ="backlinks-#{post_id}.js")
data =~ /\n(.*)/
data = $1.sub(/\A_WidgetManager\._HandleControllerResult\(/, "[").sub(/\);/,"]").gsub(/\'/,'"')
data = JSON.parse(data)[2]["backlinks"]
if data
# Self links are not very interesting
data ={|backlink| backlink['url'] !~ %r[\A] }
next if data.empty?
puts "<li><a href='#{url}'>#{title}</a><ul>"
# backlink['snippet']
puts "<li><a href='#{backlink['url']}'>#{backlink['title']}</a> - #{backlink['snippet']}</li>"
puts "</ul></li>"
puts "</ul></body></html>"


Anonymous said...

taw said...

Divide: Wow, it's great. It's needs some heavy filtering, but it should be easy as it's plain CSV.