If you asked people what things Perl is good for, they would talk about stuff like web programming, data processing, system scripting and duct taping things. But concurrency ? Can Perl seriously beat languages designed with concurrency in mind like Java (well, kinda) and
Erlang ? Surprisingly it can, and here's the story.
Wikipedia contains a lot of links to other websites. About one per article. It does not have any control whatsoever over servers where they are hosted, and every day many of the links die. Some websites move somewhere else, others are removed completely, and there are even those that were incorrect in the first place due to typos etc. Nobody can seriously expect people to check millions of links by hand every day - this is something that a bot should do. And that's what tawbot does.
The first thing we need is to extract list of external links from
Wikipedia database. There are basically two ways. First way is to get XML dump of Wikipedia and parse wiki markup to extract links. It's easy to get about half of the links this way, but it's pretty much impossible get close to all. There are at least three formats of external links, and even worse - links can also be assembled from pieces by templates hackery - like links to
imdb in
film infobox which are not specified explicitly, but only imdb code of a movie is in the article, and template turns the code into an URL.
The alternative is to use MediaWiki engine. Reparsing every single page would be really slow. That's not a new problem - every Wikipedia mirror needs to do the same thing. Fortunately they already fixed it, and Wikipedia dumps include SQL dumps of the tables that are technically derivable from the XML database, but are a bit too expensive. One of the tables is "externallinks" table (yay). Another is "page" table with pages metainformation (id, title, access restrictions etc.).
The first problem is to extract data from MySQL dumps somehow. I could run a MySQL server, but I hate using database servers for standalone applications. I would need to install MySQL server, setup user accounts, and create a special database on every computer I wanted to run the script on. And then deal with usernames, passwords and the rest of the ugly stuff. So I thought I'd use
Sqlite 3 instead. It was to be expected that table creation syntaix is different - they aren't the same in any two RDBMS (SQL is as much of "a standard" as
Democratic People's Republic od Korea is of "a democracy"). That part was easy. But Sqlite didn't even accept MySQL INSERTs !
Interting multiple values with a single INSERT statement is MySQL extension of SQL (a pretty useful one):
INSERT INTO `externallinks` VALUES (a, b,c), (d, e, f), (g, h, i);
and Sqlite doesn't support it. Now come on, MySQL is extremely popular, so being compatible with some of its extensions would really increase Sqlite's value as duct tape. Anyway, I had to write a pre-parser to split multirecord inserts into series of single inserts. It took some regexp hackery to get it right, and Sqlite started to accept the input. But it was so unbelievably slow that Perl 6 would have been released by the time Sqlite finishes importing the dump. I guess it was doing something as braindead as fsync'ing database after every insert. I couldn't find anything about it on man pages or Sqlite FAQ. But then I thought - the code that splits multi-value INSERTs into single-valued inserts is already almost parsing the data, so what the heck do I need Sqlite for ? A few more regexps, a lot of time (regexping hundreds of MBs is pretty slow, but it was way faster than Sqlite), and I have the list of links to check extracted. So far I didn't use any Perl or concurrency, it was unthreaded Ruby.
Now let's get back to Perl. There were 386705 links to check, and some time ago I wrote a Perl program verifying links. It is so simple that I'm pasting it almost whole here (headers + code to preload cache omitted):
sub link_status {
my ($ua,$link) = @_;
if(defined $CACHE{$link}) {
return $CACHE{$link};
}
my $request = new HTTP::Request('HEAD', $link);
my $response = $ua->request($request);
my $status = $response->code;
$CACHE{$link} = $status;
return $status;
}
my $ua = LWP::UserAgent->new();
while(<>)
{
/^(.*)\t(.*)$/ or die "Bad input format of line $.: `$_'";
my($title,$link)=($1,$2);
my $result = link_status($ua, $link);
print "$result\t$title\t$link\n";
}
Basically it reads info from stdin, runs HTTP HEAD request on all URLs, and prints URL status on stdout. As the same URLs are often in multiple articles, and we want to be able to stop the program and restart it later without remembering where it finished, a simple cache system is used.
The program was getting the work done, but it was insanely slow. Most HTTP requests are very fast, so it doesn't hurt much to run them serially. But every now and then there are dead servers that do not return any answer, but simply timeout. When I came the next day to check how the script was doing, I found it was spending most of its time waiting for dead servers, and in effect it was really slow. Now there is no reason to wait, it should be possible to have multiple HTTP connections in parallel. Basically it was in need for some sort of concurrency.
Concurrent programming in most languages is ugly, and people tend to avoid it unless they really have no other choice. Well, I had no other choice. First thing that came to my mind was
Erlang (really). The concurrency part shtould be simple enough, and it would be highly educational. But does it have something like LWP ? It would really suck to implement cool concurrency and then have to work with some lame half-complete HTTP library. So I didn't even try. Ruby has only interpretter threads, so concurrent I/O is not going to work. And it would be an understatement to call its HTTP library incomplete. But wait a moment - Perl has real threads. They are pretty weird, sure, but I don't need anything fancy. And I can keep using LWP.
Perl (since 5.6.0) uses theading model that is quite different from most other languages. Each thread runs as a separate virtual machine, and only data that is explicitly marked as shared can be shared. It also has syntax-based locks and a few other unusual ideas. The
real reason it was done this way was retrofitting threading on fundamentally unthreadable interpretter. That's the real rationale behind Perl OO too. Of course it sucks for some things, but for some problems it happens to work very nicely.
The design I used was very simple - one master thread, and a group of worker threads. Master would keep pushing tasks on
@work_todo
, and threads would pop items from it and push result to
@work_done
. When master doesn't have anything else to do, it sets
$finished
to true and waits for workers to finish.
Before we spawn threads, we need to declare shared variables:
my @work_todo : shared = ();
my @work_done : shared = ();
my $finished : shared = 0;
Now let's spawn 20 workers (I originally had 5, but it wasn't enough):
my @t;
push @t, threads->new(\&worker) for 1..20;
Workers basically keep taking tasks from todo list until master says it's over:
sub worker {
my $ua = LWP::UserAgent->new(); # Initialize LWP object
while (1) {
my $cmd;
{
lock @work_todo;
$cmd = pop @work_todo; # get a command
}
if($cmd) { # did we get any command ?
my ($link, $title) = @$cmd;
# do the computations
my $status = link_status($ua,$link);
{
# @done_item needs to be shared because threads are otherwise independent
my @done_item : shared = ($status, $title, $link);
lock @work_done;
push @work_done, \@done_item;
}
} elsif($finished) { # No command, is it over already ?
last; # Get out of the loop
} else { # It's not over, wait for new commands from the master
sleep 1;
}
}
}
Link checker got even simpler, as cache logic was moved to master.
sub link_status {
my ($ua,$link) = @_;
my $request = new HTTP::Request('HEAD', $link);
my $response = $ua->request($request);
my $status = $response->code;
return $status;
}
After spawning threads master preloads cache from disk (boring code). Then it loops. The first action is clearing
@work_done
. The done items should be saved to disk and updated in the cache. It is important that only one thread writes to the same file. Under Unices writes are not atomic. So if one process does
print("abc")
and another does
print("def")
, it is possible that we get
adbefc
. Unix is again guilty of
being totally broken, inconsistent and hard to use, for the sake of marginally simpler implementation.
while(1)
{
{
lock @work_done;
for(@work_done) {
my ($result, $title, $link) = @$_;
$CACHE{$link} = $result;
print "$result\t$title\t$link\n";
}
@work_done = ();
}
if (@work_todo > 100) {
sleep 1;
next;
}
...
# More code
}
And finally the code to put new items on the todo list:
my $cmd = <>;
last unless defined $cmd;
$cmd =~ /^(.*)\t(.*)$/ or die "Bad input format of line $.: `$_'";
my($link, $title)=($1, $2);
next if defined $CACHE{$link};
{
# Explicit sharing of @todo_item again
my @todo_item : shared = ($link, $title);
lock @work_todo;
push @work_todo, \@todo_item;
}
And the code to wait for the threads after we have nothing more to do:
$finished = 1;
$_->join for @t;
Now what's so great about Perl concurrency ?
- It works
- Access to all Perl libraries
- By getting rid of really stupid "shared everything by default" paradigm, it avoids a lot of errors that plague concurrent programs
The most important are the first two points. Third is just a small bonus. Perl concurrency works a lot better than Ruby or Python (at least last time I checked). We get access to all libraries that Erlang will
never ever get its hands on. And while "shared-nothing pure message passing" may be more convenient, at least explicit sharing and locking-by-scope are better than "shared everything" model. And we get the basic stuff like
Thread::Queue
(which I should have used instead of managing lists by hand, whatever).
Could such link checker be written in Erlang as quickly ? I don't think so. In most programs that need concurrency or would benefit from concurrency, the "concurrency" part is a very small part of the program. It's usually far better to have lame concurrency support and good rest of the language than absolutely awesome concurrency part and
lame rest of the language. Every now and then you simply have to write a massively concurrent program (like a phone switch), and a special language can be really great. For everything else, there are "business as usual, with concurrency bolted-on" languages like Perl and Java.