I hate waiting. And all too often when I tell computer to do stuff, the only reply I get is a progress bar with ETA far in the future.
Fortunately plenty of such things are highly parallelizable. For this post let's focus on a simple task of downloading every single Questionable Content strip ever released because QS has really simple URLs. It's real simple to do it serially but waiting time of about 23 minutes is killing me:
(1..1665).each{|i|
system "wget", "-q", "http://www.questionablecontent.net/comics/#{i}.png"
}
Notice how I didn't even bother with shell, and went straight for a real programming language. Anyway, Ruby has very easy threading system - just use Thread.new{ ... } to spawn, Thread#join to join, avoid global mutable state, and you're done.
We can get it running in no time:
module Enumerable
def in_parallel
map{|x| Thread.new{ yield(x) } }.each{|t| t.join}
end
end
(1..10).in_parallel{|i|
system "wget", "-q", "http://www.questionablecontent.net/comics/#{i}.png"
}
The only thing that changed was replacement a one-liner middle-ware method, and replacement of Enumerable#each with Enumerable#in_parallel - so far really good.
On the other hand, we don't want to spawn 1665 threads all at once - neither our network connection nor QS's servers would appreciate that much. Let's write some code which would create 10 threads and keep them all busy instead.
Exception handling interlude
A very common problem with Ruby multi-threaded programming is that we don't want the program (or even thread) to die just because of some error - usually we want to capture, print, and otherwise ignore exceptions. Here Ruby fails hard - Exception has no method to generate user-friendly message of the kind it prints when exceptions fall out of the main loop. There are some gems that handle that, but here I'm rather unconcerned with this - and instead I'll just use a really simple catch-print-ignore method - which does not even bother printing backtrace.def Exception.ignoring_exceptions
begin
yield
rescue Exception => e
STDERR.puts e.message
end
end
Execute N tasks in parallel
How do we communicate with N threads? A simple and wrong way would be to create N queues, and feed them in a round-robin manner. Unfortunately unless tasks always take nearly the same amount of time (this is almost never even close to true) - this results in a lot of threads waiting aimlessly.
We need exactly one todo queue. Now it would be better if this queue had a size limit - but SizedQueue was giving me troubles, so I did it with the plainest possible Queue.
The main thread is really simple:
require 'thread'
module Enumerable
def in_parallel_n(n)
todo = Queue.new # Create queue
ts = (1..n).map{ # Start threads
Thread.new{
# Do stuff in threads
}
}
each{|x| todo << x} # Push things into queue
ts.each{|t| t.join} # Wait until threads finish
end
end
Now the main question is how do we tell threads there are more tasks. Unfortunately queues have only two states:
- Queue not empty - dequeue one element, do work, go back for more
- Queue empty - sleep and wait for the queue
- Queue not empty - dequeue one element, do work, go back for more
- Queue empty, producer not finished - sleep and wait for the queue
- Queue empty, producer finished - finish thread
And because I don't want the program to break when object we're iterating over contains legitimate nils I'm passing whatever we get in a one-element array (Ruby probably has equivalent of Ocaml's 'a option / Haskells's Maybe monad etc. - but for this the simplest solution will work just fine).
require 'thread'
module Enumerable
def in_parallel_n(n)
todo = Queue.new
ts = (1..n).map{
Thread.new{
while x = todo.deq
Exception.ignoring_exceptions{ yield(x[0]) }
end
}
}
each{|x| todo << [x]}
n.times{ todo << nil }
ts.each{|t| t.join}
end
end
And now we can happily get the comics. It took 4 minutes 11 seconds instead of 23 minutes - 8x speedup!
(1..1665).in_parallel_n(10){|i|
system "wget", "-q", "http://www.questionablecontent.net/comics/#{i}.png"
}
What about other comics?
Now unfortunately far too many comics don't have pretty URLs like that. You can always (well, nearly always) resolve to wget --mirror or find a greasemonkey script which makes the comics' website bearable. Fortunately quite often we can handle even somewhat more complicated URL schemes.First, for comicses like sinfest, which use dates in URLs, it's only marginally more difficult:
require 'date'
(Date.parse('2000-01-17')..Date.parse('2010-05-17')).in_parallel_n(10){|d|
system "wget", "-q", d.strftime("http://www.sinfest.net/comikaze/comics/%Y-%m-%d.gif")
}
3536 files, 7 minutes 52 second.
Because Ruby dates are first class objects and Range class works very sanely with them - it took nearly no effort to code that, other than trying to remember wtf were all those strftime %-codes (or you could use Date#year etc. and plain string interpolation - it would probably be better for your sanity actually).
And wget will quietly ignore proper 404s, so you don't need to wonder if they're updating on weekends too, or only on weekdays, or what.
For other comicses like xkcd you might need some hpricoting (and put alt text in metadata, then teach image viewer to read them? Or use imagemagick to put them into the image? choices, oh all the choices...). Or you could use This awesome Webcomics Reader Greasemonkey script. Unfortunately it only knows about a handful of the most popular ones.
Now this technique is applicable for a lot of uses other than downloading funnies - like handling S3 backups, encoding large number of MP3s and so on. Everything where it makes sense to run multiple threads in parallel, but not too insanely many will be really easy to do now thanks to Enumerable#in_parallel_n or similar code.
It works with both 1.8, and 1.9 equally well, in spite of their completely different threading systems.
Enjoy.
14 comments:
Simple but definitely very usable. I'm stealing this for my pingback client.
Oh, and I *love* your cat images.
There's also a ruby gem to ease the burden of managing threads http://rubygems.org/gems/work_queue
Any idea how to go about, less friendly names, like the later episodes of order of the stick, http://www.giantitp.com/comics/oots0723.html
Notthinking: OOTS like most such comicses has very easy html urls. You need to get all htmls (easy), extract image urls from that (it's usually fairly straightforward Hpricot, or you can even use regexps if you don't know any Hpricot, but that's just ugly), and get that.
If even that is not possible, there are two more ways:
* get first page, extract both image and next link.
* get archive page, extract links from there etc.
All that requires a lot more effort.
For OOTS, just get all IMG tags, and follow one with src starting with "/comics/images/" - it's simple enough.
Seen this? It's a shell utility for running parallel jobs, written in Perl. I use it all the time. Aaron Crane improved the UI.
Thread#value is another decent tool for cheap, ad hoc parallelization. On its first invocation, it joins a thread and returns the value produced by the thread's computation. Later invocations return the value immediately.
Mark Dominus: Using Unix shell as a programming language is a horrible horrible idea, and I would never use it.
Try implementing this sinfest downloader - which is essentially a Ruby one-liner - in shell to see why; and that's even before you run into shell escaping issues, which you are certainly aware of.
Shell makes trivial things trivial to do half-right (and nearly impossible to do fully correctly). Even minimally non-trivial things require monstrous horrible code to get running at all.
Also possible without thread, with EventMachine and em-http-request:
http://github.com/igrigorik/em-http-request
See the multi request example there.
dubek: As far as I can tell from that link, event machine would require significantly more coding, and is difficult to extend to things like disk and s3 access which threads handle easily.
Parallel.each(xxx) / Parallel.map(xxx) does the same, with processes or threads. http://github.com/grosser/parallel
Ara Howard put together two gems to handle this very easily with either threads of processes (fork).
http://github.com/ahoward/threadify
http://github.com/ahoward/forkoff
Awesome! Thanks!
I was able to use this in my program 5 years later in 2015. Nice!
Anonymous: Congratulations :-D
Post a Comment