The best kittens, technology, and video games blog in the world.

Wednesday, October 04, 2006

Why Perl Is a Great Language for Concurrent Programming

Cable Beach sunset by marj k from flickr (CC-BY-NC)If you asked people what things Perl is good for, they would talk about stuff like web programming, data processing, system scripting and duct taping things. But concurrency ? Can Perl seriously beat languages designed with concurrency in mind like Java (well, kinda) and Erlang ? Surprisingly it can, and here's the story.

Wikipedia contains a lot of links to other websites. About one per article. It does not have any control whatsoever over servers where they are hosted, and every day many of the links die. Some websites move somewhere else, others are removed completely, and there are even those that were incorrect in the first place due to typos etc. Nobody can seriously expect people to check millions of links by hand every day - this is something that a bot should do. And that's what tawbot does.

The first thing we need is to extract list of external links from Wikipedia database. There are basically two ways. First way is to get XML dump of Wikipedia and parse wiki markup to extract links. It's easy to get about half of the links this way, but it's pretty much impossible get close to all. There are at least three formats of external links, and even worse - links can also be assembled from pieces by templates hackery - like links to imdb in film infobox which are not specified explicitly, but only imdb code of a movie is in the article, and template turns the code into an URL.

The alternative is to use MediaWiki engine. Reparsing every single page would be really slow. That's not a new problem - every Wikipedia mirror needs to do the same thing. Fortunately they already fixed it, and Wikipedia dumps include SQL dumps of the tables that are technically derivable from the XML database, but are a bit too expensive. One of the tables is "externallinks" table (yay). Another is "page" table with pages metainformation (id, title, access restrictions etc.).

The first problem is to extract data from MySQL dumps somehow. I could run a MySQL server, but I hate using database servers for standalone applications. I would need to install MySQL server, setup user accounts, and create a special database on every computer I wanted to run the script on. And then deal with usernames, passwords and the rest of the ugly stuff. So I thought I'd use Sqlite 3 instead. It was to be expected that table creation syntaix is different - they aren't the same in any two RDBMS (SQL is as much of "a standard" as Democratic People's Republic od Korea is of "a democracy"). That part was easy. But Sqlite didn't even accept MySQL INSERTs !

Interting multiple values with a single INSERT statement is MySQL extension of SQL (a pretty useful one):
INSERT INTO `externallinks` VALUES (a, b,c), (d, e, f), (g, h, i);
and Sqlite doesn't support it. Now come on, MySQL is extremely popular, so being compatible with some of its extensions would really increase Sqlite's value as duct tape. Anyway, I had to write a pre-parser to split multirecord inserts into series of single inserts. It took some regexp hackery to get it right, and Sqlite started to accept the input. But it was so unbelievably slow that Perl 6 would have been released by the time Sqlite finishes importing the dump. I guess it was doing something as braindead as fsync'ing database after every insert. I couldn't find anything about it on man pages or Sqlite FAQ. But then I thought - the code that splits multi-value INSERTs into single-valued inserts is already almost parsing the data, so what the heck do I need Sqlite for ? A few more regexps, a lot of time (regexping hundreds of MBs is pretty slow, but it was way faster than Sqlite), and I have the list of links to check extracted. So far I didn't use any Perl or concurrency, it was unthreaded Ruby.

Now let's get back to Perl. There were 386705 links to check, and some time ago I wrote a Perl program verifying links. It is so simple that I'm pasting it almost whole here (headers + code to preload cache omitted):

sub link_status {
my ($ua,$link) = @_;
if(defined $CACHE{$link}) {
return $CACHE{$link};
}
my $request = new HTTP::Request('HEAD', $link);
my $response = $ua->request($request);
my $status = $response->code;
$CACHE{$link} = $status;
return $status;
}

my $ua = LWP::UserAgent->new();
while(<>)
{
/^(.*)\t(.*)$/ or die "Bad input format of line $.: `$_'";
my($title,$link)=($1,$2);
my $result = link_status($ua, $link);
print "$result\t$title\t$link\n";
}
Basically it reads info from stdin, runs HTTP HEAD request on all URLs, and prints URL status on stdout. As the same URLs are often in multiple articles, and we want to be able to stop the program and restart it later without remembering where it finished, a simple cache system is used.

The program was getting the work done, but it was insanely slow. Most HTTP requests are very fast, so it doesn't hurt much to run them serially. But every now and then there are dead servers that do not return any answer, but simply timeout. When I came the next day to check how the script was doing, I found it was spending most of its time waiting for dead servers, and in effect it was really slow. Now there is no reason to wait, it should be possible to have multiple HTTP connections in parallel. Basically it was in need for some sort of concurrency.

Concurrent programming in most languages is ugly, and people tend to avoid it unless they really have no other choice. Well, I had no other choice. First thing that came to my mind was Erlang (really). The concurrency part shtould be simple enough, and it would be highly educational. But does it have something like LWP ? It would really suck to implement cool concurrency and then have to work with some lame half-complete HTTP library. So I didn't even try. Ruby has only interpretter threads, so concurrent I/O is not going to work. And it would be an understatement to call its HTTP library incomplete. But wait a moment - Perl has real threads. They are pretty weird, sure, but I don't need anything fancy. And I can keep using LWP.

Perl (since 5.6.0) uses theading model that is quite different from most other languages. Each thread runs as a separate virtual machine, and only data that is explicitly marked as shared can be shared. It also has syntax-based locks and a few other unusual ideas. The real reason it was done this way was retrofitting threading on fundamentally unthreadable interpretter. That's the real rationale behind Perl OO too. Of course it sucks for some things, but for some problems it happens to work very nicely.

The design I used was very simple - one master thread, and a group of worker threads. Master would keep pushing tasks on @work_todo, and threads would pop items from it and push result to @work_done. When master doesn't have anything else to do, it sets $finished to true and waits for workers to finish.

Before we spawn threads, we need to declare shared variables:
my @work_todo : shared = ();
my @work_done : shared = ();
my $finished : shared = 0;
Now let's spawn 20 workers (I originally had 5, but it wasn't enough):
my @t;
push @t, threads->new(\&worker) for 1..20;
Workers basically keep taking tasks from todo list until master says it's over:
sub worker {
my $ua = LWP::UserAgent->new(); # Initialize LWP object
while (1) {
my $cmd;
{
lock @work_todo;
$cmd = pop @work_todo; # get a command
}
if($cmd) { # did we get any command ?
my ($link, $title) = @$cmd;
# do the computations
my $status = link_status($ua,$link);
{
# @done_item needs to be shared because threads are otherwise independent
my @done_item : shared = ($status, $title, $link);
lock @work_done;
push @work_done, \@done_item;
}
} elsif($finished) { # No command, is it over already ?
last; # Get out of the loop
} else { # It's not over, wait for new commands from the master
sleep 1;
}
}
}
Link checker got even simpler, as cache logic was moved to master.
sub link_status {
my ($ua,$link) = @_;
my $request = new HTTP::Request('HEAD', $link);
my $response = $ua->request($request);
my $status = $response->code;
return $status;
}
After spawning threads master preloads cache from disk (boring code). Then it loops. The first action is clearing @work_done. The done items should be saved to disk and updated in the cache. It is important that only one thread writes to the same file. Under Unices writes are not atomic. So if one process does print("abc") and another does print("def"), it is possible that we get adbefc. Unix is again guilty of being totally broken, inconsistent and hard to use, for the sake of marginally simpler implementation.
while(1)
{
{
lock @work_done;
for(@work_done) {
my ($result, $title, $link) = @$_;
$CACHE{$link} = $result;
print "$result\t$title\t$link\n";
}
@work_done = ();
}
if (@work_todo > 100) {
sleep 1;
next;
}
...
# More code
}
And finally the code to put new items on the todo list:
  my $cmd = <>;
last unless defined $cmd;
$cmd =~ /^(.*)\t(.*)$/ or die "Bad input format of line $.: `$_'";
my($link, $title)=($1, $2);
next if defined $CACHE{$link};
{
# Explicit sharing of @todo_item again
my @todo_item : shared = ($link, $title);
lock @work_todo;
push @work_todo, \@todo_item;
}
And the code to wait for the threads after we have nothing more to do:
$finished = 1;
$_->join for @t;
Now what's so great about Perl concurrency ?
  • It works
  • Access to all Perl libraries
  • By getting rid of really stupid "shared everything by default" paradigm, it avoids a lot of errors that plague concurrent programs
The most important are the first two points. Third is just a small bonus. Perl concurrency works a lot better than Ruby or Python (at least last time I checked). We get access to all libraries that Erlang will never ever get its hands on. And while "shared-nothing pure message passing" may be more convenient, at least explicit sharing and locking-by-scope are better than "shared everything" model. And we get the basic stuff like Thread::Queue (which I should have used instead of managing lists by hand, whatever).

Could such link checker be written in Erlang as quickly ? I don't think so. In most programs that need concurrency or would benefit from concurrency, the "concurrency" part is a very small part of the program. It's usually far better to have lame concurrency support and good rest of the language than absolutely awesome concurrency part and lame rest of the language. Every now and then you simply have to write a massively concurrent program (like a phone switch), and a special language can be really great. For everything else, there are "business as usual, with concurrency bolted-on" languages like Perl and Java.

28 comments:

Anonymous said...

erlang has a http client (and server!) built into the standard distribution or you can opt for something like ibrowse

taw said...

How good is Erlang http support ? Could you write code for checking a single link ? It's only 3 lines in Perl, so it shouldn't be hard.

Anonymous said...

how bout 1 line?

{ok, {{_Version, 200, _Reason}, _Headers, _Body}} = http:request("http://google.com/").

taw said...

That's not exactly what it's supposed to be doing. This thing is pretty close (I think):

{ok, {{_, Code, _}, _, _}} = http:request(head, {"http://google.com/",[]}, [], []).

Anyway, what does "Normaly the application using this API should have started inets application. This is also true for the ssl application when using https." mean ? Are all requests going through a single "inet" process ?

Anonymous said...

pretty cool code. Not yet convinced it beats erlang, which can happily run a hundred thousand threads at once, and shares nothing at all.

Admittedly it won't have cpan anytime soon, but otp has some pretty slick stuff that ain't necessarily in cpan either, and some people are working hard on adding interesting libraries (see yarivsblog.com).

Jason said...

it means you should really call erlang with -run inets or inets:start() or application:start(inets) to start the inets application, you can use the client in sync or async mode

here is a quick example that handles 20 requests at a time like your example, but i'm sure more experienced erlers could make it much cleaner

-module(checker).
-compile(export_all).

start() ->
Queue = read_urls(),
master_loop(Queue, []).

read_urls() ->
{ok, Contents} = file:read_file("urls.txt"),
string:tokens(binary_to_list(Contents), "\n").

master_loop([],[]) ->
done;
master_loop(Queue, []) ->
{NewQ, Pending} = enqueue(20, Queue),
master_loop(NewQ, Pending);
master_loop(Queue, Pending) ->
receive
{http, {RequestId, Result}} ->
{value, {_R, Url}} = lists:keysearch(RequestId, 1, Pending),
{{_, Status, _}, _, _} = Result,
io:format("~s ~b~n", [Url, Status]),
{NewQ, NewIds} = enqueue(1, Queue),
Merged = [ {Id, U} || {Id, U} <- Pending, Id /= RequestId ] ++ NewIds,
master_loop(NewQ, Merged);
M ->
throw(M)
end.

enqueue(_Num, []) ->
{[],[]};
enqueue(0, Q) ->
{Q, []};
enqueue(Num, [H|T]) ->
{ok, RequestId} = http:request(head, {H, []}, [{autoredirect, false}], [{sync, false}]),
{NewQ, Ids} = enqueue(Num - 1, T),
{NewQ, [{RequestId, H}] ++ Ids}.

taw said...

Anonymous: There's concurrent programming and there's concurrent programming. Perl is good at "mildly concurrent, with real need for some library from CPAN" kind of concurrent programming. Erlang is good at "massively concurrent, with no need for fancy libraries" kind of concurrent programming (and Perl isn't even pretending to try here). The former is much more common than the latter - so Perl is a language of choice for a lot more concurrent programming tasks than Erlang.

Jason: Your program kills my machine. I guess it's simply reading urls file to memory all at once (Perl program does it line by line), then converting it to linked list of integers (memory usage x8), and then tokenizing it (memory usage x2).

Even if we converted the program to read lines one at a time, the x8 memory explosion for storing strings troubles me a lot. We cannot store result cache in memory any more, as it would massively swap (cache in Perl program is pretty big already, I just don't have 8x as much memory here). Does Erlang has some sort of packed string object ?

Jason said...

like i said earlier it was just a quick proof of concept

heres one that doesnt slurp it all into memory, and the result cache isnt needed if you prefilter the url list with sort | uniq... if you really wanted it you could add a third state variable to the loop or throw it in an ets table, and to reduce memory of strings you can pack them into a binary with <<>> which brings it back down to 1 byte per char



-module(checker).
-compile(export_all).

start() ->
{ok, Fd} = file:open("urls.txt", [read]),
Pending = enqueue(20, Fd),
master_loop(Fd, Pending),
file:close(Fd).

master_loop(_, []) ->
done;
master_loop(Fd, Pending) ->
receive
{http, {RequestId, Result}} ->
{value, {_R, Url}} = lists:keysearch(RequestId, 1, Pending),
{{_, Status, _}, _, _} = Result,
io:format("~s ~b~n", [Url, Status]),
NewIds = enqueue(1, Fd),
Merged = [ {Id, U} || {Id, U} <- Pending, Id /= RequestId ] ++ NewIds,
master_loop(Fd, Merged);
M ->
throw(M)
end.

enqueue(0, _) ->
[];
enqueue(Num, Fd) ->
case io:get_line(Fd, "") of
eof ->
[];
Line ->
Url = string:strip(Line, both, $\n),
{ok, RequestId} = http:request(head, {Url, []}, [{autoredirect, false}], [{sync, false}]),
Ids = enqueue(Num - 1, Fd),
[{RequestId, Url} | Ids]
end.

Quickshot said...

Just as a small question, but does it work under perl 6 as well? Or does that seriously change the problem?

taw said...

Quickshot: I don't threads in Perl 6 are even implemented. Or LWP. Perl 6 is pre-alpha software right now.

Anonymous said...

#!/usr/bin/env ruby

require 'open-uri'
require 'thread'

def http_status l
begin
open(l) do |f|
f.status.join(' ')
end
rescue OpenURI::HTTPError => e
e.message
rescue => e
"ERROR -- #{e.message}"
end
end

def deq1 q
thr, l, m = q.pop
thr.join
puts "#{m}: #{l}"
end

q = Queue.new
nthr = 0

ARGF.each do |l|
next unless l =~ /^(http:\/\/.*)$/
Thread.new(l) do |l|
q << [Thread.current, l, http_status(l)]
end
if nthr >= 20
deq1 q
else
nthr += 1
end
end
nthr.times { deq1 q }

Anonymous said...

#!/usr/bin/env ruby

require 'open-uri' 
require 'thread'

def http_status l
  begin
    open(l) do |f|
      f.status.join(' ')
    end
  rescue OpenURI::HTTPError => e
    e.message
  rescue => e
    "ERROR -- #{e.message}"
  end
end

def deq1 q
  thr, l, m = q.pop
  thr.join
  puts "#{m}: #{l}"
end

q = Queue.new
nthr = 0

ARGF.each do |l|
  next unless l =~ /^(http:\/\/.*)$/
  Thread.new(l) do |l|
    q << [Thread.current, l, http_status(l)]
  end
  if nthr >= 20
    deq1 q
  else
    nthr += 1
  end
end
nthr.times { deq1 q }

Anonymous said...

Saying "Perl is great at ..." is the same as saying "I don't know much about ...."

ketvin said...

Your codes are very cool, give me a lot of inspiration.

Question, I don't really understand how you call your thread concurrently.

For example, if i am having a few threads, each doing different thing, how do i able to make it run concurrently?

eg.

my $thr0 = threads->create('one');
my $result0 = $thr0->join();

my $thr1 = threads->create('two');
my $result1 = $thr1->join();

my $thr2 = threads->create('three');
my $result2 = $thr2->join();

and i have different sub of one,two and three, the way i call th join() will just make it run in serial.

Just asking for opinion

taw said...

ketvin: You need to create the threads first, and only join them later:

my $thr0 = threads->create('one');

# main + $thr0 are running now

my $thr1 = threads->create('two');

# main + $thr0 + $thr1 are running now

my $thr2 = threads->create('three');

# main + $thr0 + $thr1 + $thr2 are running now

my $result0 = $thr0->join();

my $result1 = $thr1->join();

my $result2 = $thr2->join();

ketvin said...

it works amazingly, thx so much!!

Bagfields said...
This comment has been removed by the author.
Bagfields said...

Hi Taw,

Read your blog with great interest as I have failed to get Perl threads running and has started to look at Erlang as an altenative.

I have experienced the same problem with slow SQLite updating and at that occasion I was using class-dbi-sqlite. The solution to increase speed when I was creating a database with data was to turn off "autoupdate"
"FFF::group::member->autoupdate(0);"

When I had finished creating all records, I turned on autoupdate and everything was saved as expected. "...->autoupdate(1)"

With that small change in my code SQLite worked like a charm.

Now I will get Perl threads running and shelve the idea with Erlang, as there is not even SQLite for Erlang.

Thanks

Stefan

Bagfields said...

For everyone else, that wants to use threads for "concurrent programming", the following two lines are mandatory:

use threads;
use threads::shared;

Jon said...

Hi

I came across your blog and believe it might have a bearing on a problem I'm having. Basically I have an output table produced from a perl script that contains the results of a formula in each cell - a correlation in fact. However each cell is produced sequentially and this takes a lot of time - especially when hundreds of correlations are run. And the mathematical procedure required for each cell is identical. A method of producing the result for each cell in parallel would be much faster, I believe. I'm not a programmer myself, so any thoughts on how to go about this - or find someone who could help with the problem - would be appreciated.

Thnx

Jon

taw said...

Jon: Threads and other kinds of parallelism are only useful for latency-bound problems, not for throughput-bound problems.

In latency-bound problems program spends a lot of time waiting for something and doing nothing - for example it asks one websites, waits, then asks another, waits, and so on. Doing such things in parallel can make it much faster because many things can be done during the wait, and it doesn't cost you anything to wait for more than one thing at the same time.

In throughput-bound problems programs works as fast as it can and it simply cannot do things any faster. Making such problems parallel won't affect the total running time at all, and it may even become slower.

Most problems are somewhere in between these two. Yours sound more like a throughput-bound problem, so I don't think parallelism is going to help you much.

Jon said...

I think that is probably a useful way to look at it. And no doubt I have a throuhput problem. The question is how to solve it?

To work out one correlation, the script can do very quickly. But to do 2 correlations, the script has to be used twice, and 3 correlations, 3 times and so on. So how do you make all this happen in parallel? Would one solution perhaps be to make the script (or the necessary component) reproduce itself? So that if you need 3 correlations done, you'd have 3 scripts.

Would that be a possibility?

taw said...

Jon: How do you run the script to produce 3 correlations ? It's probably easy to automate.

Jon said...

You mean how are the correlations done? I have a script functioning for that. You can see it on: http://psychonomics.com/stockscan2.html

It's not fully functional - but if you use any sector with an x in it at the begining of the symbol eg XBD, XAL and choose a low enough threshold and the periods you want, you should get a table out with correlations. Datafeed must be set to yahoo.

As you'll see, it's very slow - precisely because it's doing a sequential process. And we always know in advance how many correlations it has to do - and the script tells you on the output page. So theoretically if it has to do 66 pairwise correlations, we would need 66 parallel processes... I guess.

Jon said...

I wondered if you'd had any more thoughts about this?

taw said...

Jon: No really, sorry.

Anonymous said...

suprised nobody mentioned POE. it's an event-driven multitasking environment for perl. there's an example of a parallel link-checker in POE here: http://www.stonehenge.com/merlyn/LinuxMag/col41.html

Terrence Brannon said...

AnyEvent::HTTP is the most promising thing I have found having looked high and low and posted here for help:

http://perlmonks.org/?node_id=758739