There wasn't that much code in my blog recently. So today I decided to change that by coding something cool. And what could be simpler than taking a popular language and making it much much faster ?
So first I downloaded
Ruby 1.8.5-p12, and
some benchmarks. The benchmarks are the same as used in the
blog post comparing different implementations of Ruby.
I don't like the benchmarks much - they test microfunctionality and they cannot be realistically extrapolated to real programs, but I'm too lazy to make better ones just yet.
I wanted to speed up Ruby. Some rules:
- Under no circumstance can it break Ruby functionality (hacks like Ruby/DL may need updating).
- Speed-up must be observable in benchmarks.
- Speed-up should be significant (more than just 1-2%).
- It must be simple to code - a few hours for a working patch. Programmer productivity is supposed to double every six years after all.
- It should not make code much more complex than it currently is.
I'm not trying to compete with YARV - the point is a major speed-up
right now.
First I modified the benchmark driver to run everything 5 times, and report average of 3 middle results, like they do in many sports competitions. I ran everything on an old Duron 1400MHz box with 512MB memory. I limited set of benchmarks to just 7 "app" benchmarks (all except for pentomino, which was too slow and factorial, which raised stack overflow) to get faster feedback.
CFLAGS
The first idea was to simply play with CFLAGS. I have no idea why people aren't doing that. Isn't the very idea of coding in C about performance ?
I didn't do anything fancy - just the standard
-O3 -march=athlon -mtune=athlon -fomit-frame-pointer
. Results show how much extra time it needs compared to the fastest implementation.
Benchmark | Ruby 1.8.4 from Ubuntu | Ruby 1.8.5-p12 | Ruby 1.8.5-p12 with nice CFLAGS |
---|
app_answer | +86% | +0% | +10% |
---|
app_fib | +72% | +4% | +0% |
---|
app_mandelbrot | +44% | +0% | +1% |
---|
app_raise | +42% | +5% | +0% |
---|
app_strconcat | +49% | +7% | +0% |
---|
app_tak | +77% | +6% | +0% |
---|
app_tarai | +69% | +10% | +0% |
Two surprises here. First, messing with CFLAGS brought just a few percent speed-up, and in one case even a 10% slow-down. Second, either 1.8.5 is massively faster than 1.8.4 or Ubuntu uses some really horrible compilation options. Assuming the former, that's a great thing.
Anyway, that's kinda disappointing, as I was expecting a 5%-10% improvement pretty much for free. So, time to look for another easy target. Running the benchmarks with a profiler (
CFLAGS=-O3 -march=athlon -mtune=athlon -pg
), and compiling everything to assembly seemed like a good idea. A candidate I remembered from earlier was
st.c
- hash tables called functions like these many times during each operation, and used multiple pointer dereferences for it, instead of inlining them (and simplifying to pretty much no code):
static int
numcmp(x, y)
long x, y;
{
return x != y;
}
static int
numhash(n)
long n;
{
return n;
}
It seemed straighforward to simply specialize the hash table to numeric version, but first I wanted to check if it's really likely to be worthwhile.
And indeed,
st_lookup
and
st_foreach
were important, at least in some benchmarks. What I didn't expect were
fix_minus
,
fix_plus
, and other
Fixnum
functions. These also seemed easy to improve (more about it later). Unfortunately the five most performance-critical functions were
rb_eval
,
rb_call0
,
rb_call
,
rb_newobj
and
call_cfunc
, that I had no clue about.
So two candidates giving at best a minor improvement, and 5 candidates I have no clue about, sounds like fun.
Optimizing rb_call0
rb_call0
seemed like the easiest way of making a difference. It contained a big switch which was compiled into a binary search tree of comparisons and jumps. That's usually a good idea for sparse switches, but in
rb_call0
two cases are far more common than the other six. They are -
NODE_CFUN (most common) and
NODE_SCOPE (significantly less so). Everything else is far behind the two. So the switch was replaced by "if first else if second else switch" construct. This is only likely to be worthwhile because
rb_call0
is called so often. By the way a good JIT compiler should be able to optimize such cases on its own, but compilers for static languages like C need manual help.
And the results:
Benchmark | Ruby 1.8.4 from Ubuntu | Ruby 1.8.5-p12 | Ruby 1.8.5-p12 with nice CFLAGS | Nice CFLAGS + rb_call0 |
---|
app_answer | +86% | +0% | +10% | +7% |
---|
app_fib | +77% | +6% | +3% | +0% |
---|
app_mandelbrot | +46% | +2% | +3% | +0% |
---|
app_raise | +43% | +6% | +0% | +0% |
---|
app_strconcat | +49% | +7% | +0% | +2% |
---|
app_tak | +80% | +8% | +2% | +0% |
---|
app_tarai | +72% | +12% | +2% | +0% |
---|
So Ruby got some 2%-3% faster already.
Specializing hash tables
There are 3 kinds of hash tables used in Ruby - number, string and object hash tables. It seems pretty excessive to triple amount of hash-related code in hope of optimizing it a bit, and it wouldn't really be necessary if Ruby was coded in C++, as templates would probably do the trick. I'm going to do something a bit less radical, and only specialize number tables.
The optimization required quite a bit of code duplication and patching, and the speed-up is relatively modest.
Benchmark | Ruby 1.8.4 from Ubuntu | Ruby 1.8.5-p12 | Ruby 1.8.5-p12 with nice CFLAGS | Nice CFLAGS + rb_call0 | Nice CFLAGS + rb_call0 + st_num_table |
---|
app_answer | +86% | +0% | +10% | +7% | +7% |
---|
app_fib | +77% | +6% | +3% | +0% | +0% |
---|
app_mandelbrot | +49% | +3% | +4% | +2% | +0% |
---|
app_raise | +43% | +6% | +0% | +0% | +2% |
---|
app_strconcat | +54% | +10% | +3% | +5% | +0% |
---|
app_tak | +81% | +8% | +2% | +1% | +0% |
---|
app_tarai | +74% | +14% | +3% | +1% | +0% |
---|
Normally I wouldn't feel too good about patches like that, but having separate data structure for numeric hashes makes it possible to use more efficient data structure like
Judy.
Making Fixnum
operations faster
Fixnum
operations are executed a huge number of times even in short programs. Unlike most programming languages (like let's say - Java),
Fixnum
s in Ruby are just normal objects, and the virtual machine has no special support for making them fast.
Fixing the virtual machine would take too much work, so let's just make sure the methods perform well. Unfortunately not all of them do. Take a look at this simple method:
static VALUE
fix_equal(x, y)
VALUE x, y;
{
if (FIXNUM_P(y)) {
return (FIX2LONG(x) == FIX2LONG(y))?Qtrue:Qfalse;
}
else {
return num_equal(x, y);
}
}
static VALUE
num_equal(x, y)
VALUE x, y;
{
if (x == y) return Qtrue;
return rb_funcall(y, id_eq, 1, x);
}
Unfortunately what it really means is:
fix_equal:
subl $28, %esp
movl 36(%esp), %edx
movl 32(%esp), %ecx
testb $1, %dl ; Check if y is a fixnum
jne .L610
cmpl %ecx, %edx ; Check if x == y
je .L605 ; If yes, return Qtrue
movl id_eq, %eax ; If not, call rb_funcall(y, id_eq, 1, x)
movl %ecx, 12(%esp)
movl $1, 8(%esp)
movl %edx, (%esp)
movl %eax, 4(%esp)
call rb_funcall
.L607:
addl $28, %esp
ret
.L610:
sarl %ecx ; Shift x by 1 bit
sarl %edx ; Shift y by 1 bit
xorl %eax, %eax
cmpl %edx, %ecx ; Check if shifted x == shifted y
jne .L607 ; If not, return Qfalse
.p2align 4,,7
.L605:
movl $2, %eax ; Return Qtrue
addl $28, %esp
ret
The that two shifts are completely unnecessary.
x
and
y
are equal after shift if and only if they were equal before shift. And if
y
isn't a
Fixnum
, it's impossible for it to equal
x
.
static VALUE
fix_equal(x, y)
VALUE x, y;
{
if (FIXNUM_P(y))
return (x == y)?Qtrue:Qfalse;
else
return rb_funcall(y, id_eq, 1, x);
}
fix_equal:
subl $28, %esp
movl 36(%esp), %edx
movl 32(%esp), %eax
testb $1, %dl
je .L2
cmpl %eax, %edx
sete %al
addl $28, %esp
movzbl %al, %eax
addl %eax, %eax
ret
.p2align 4,,7
.L2:
movl %eax, 12(%esp)
movl id_eq, %eax
movl $1, 8(%esp)
movl %edx, (%esp)
movl %eax, 4(%esp)
call rb_funcall
addl $28, %esp
ret
And suddenly we have one conditional jump instead of three.
Other
Fixnum
operations could be made faster too. They're common enough to be worth it.
Summary
Here's
the patch. It passes
make test-all
and does well in benchmarks.
rb_call0
and
Fixnum#equal
improvements seem pretty useful, and are too simple to break anything. I'm not sure about
st_num_table
- Ruby C code uses random type casts all over the place, and the code is pretty ugly. So it's quite likely I missed some
st_table
vs
st_num_table
thing.
What's left to do:
- Running more benchmarks, on more machines.
- Verifying
st_num_table
If it works well, the patch could be included in the next
stable version of Ruby.
And to improve performance even more without much work, one could:
- Convert
rb_eval
from node hopping to real "bytecode". - Convert
st_num_table
to something faster and more memory-efficient (like Judy). - Two extensions seem to be using old numeric
st_table
. I converted syck
, but I cannot even compile win32ole.c
to verify whether it works or not. - Replace
call_cfunc
with compiled ABI trampolines (move to stack, move to stack, move to stack, unconditional jump). This sounds terrifying, but it should be pretty fast and simple.
Thanks for showing that compiler flags don't matter much :)
ReplyDeleteYour tiny slowdowns might've been due to -O3 instead of -O2. If you like the labour, you might wanna try -O and -Os, too.
Ruby isn't stable with -fomit-frame-pointer. The GC needs the frame pointer to properly walk and mark the stack. Expect strange crashes when running with that flag. (Search google for references.)
ReplyDeleteIf you want to try to squeze some performance out with compiler flags, a tool like Acovea might help out a lot.
ReplyDeleteAnonymous: The only compiler option that I expected to make a serious difference was -march=athlon -mtune=athlon. This often makes a big difference, especially in number-crunching and codec software. I guess it doesn't help interpretters much.
ReplyDeleteEric Hodel: Thanks for the info, the next time I'll benchmark it with frame pointers.
Anonymous: Acovea looks cool for optimizing software that has precise benchmarks, I just don't think Ruby interpretter is one of them. It seems to me that the benchmarks we have are just too unreliable, the flags probably wouldn't even be applicable to other kinds of x86s, and compile+benchmark takes way too much time to try even 10s of configuration (at least on my box). Of course if you feel like running it, just do so. I'd be glad to hear that I was wrong.
I was able to get big performance improvements (30-40%) out of perl by using icc instead of gcc a few years ago. Sun's compiler offered similar improvements on SPARC processors over gcc. While gcc's optimizer has improved, I'd still expect icc to be noticeably better.
ReplyDeleteseems like your patch is committed: http://svn.ruby-lang.org/cgi-bin/viewvc.cgi/trunk/numeric.c?r1=11807&r2=11806
ReplyDeleteThat's a really interesting post.
ReplyDeleteI'm using your (great) magic/xml wrapper on Ubuntu and could do with some more speed. The progress bar is currently saying another 40 hours to go on my 10GB XML file...
Have you tried recompiling ruby-1.8.4 from hand and comparing it to the version supplied by Ubuntu?
I probably should have run this job on my Gentoo boxes :-)
I should probably also ask if you've had any success moving from REXML to libxml/expat for magic/xml at the same time!
ReplyDeleteJanek: Thanks for the info.
ReplyDeleteGiles: Not yet, but that's definitely something I'm going to try.
Anyone tried http://nekovm.org ? (... or is it necessary to wait for http://rubyforge.org/projects/ruccola/ )
ReplyDeleteSmall pedantic comment...
ReplyDelete-march implies -mtune
From the gcc info docs...
`-march=CPU-TYPE'
Generate instructions for the machine type CPU-TYPE. The choices
for CPU-TYPE are the same as for `-mtune'. Moreover, specifying
`-march=CPU-TYPE' implies `-mtune=CPU-TYPE'.
A good area to play with are the builtins and the attributes.
ReplyDeleteC is ambiguous about the programmers intent leaving the optimizer to often make worst case decisions.
You may use `__builtin_expect' to provide the compiler with branch
prediction information. In general, you should prefer to use
actual profile feedback for this (`-fprofile-arcs'), as
programmers are notoriously bad at predicting how their programs
actually perform. However, there are applications in which this
data is hard to collect
__builtin_prefetch (const void *ADDR, ...)
This function is used to minimize cache-miss latency by moving
data into a cache before it is accessed. You can insert calls to
`__builtin_prefetch' into code for which you know addresses of
data in memory that is likely to be accessed soon. If the target
supports them, data prefetch instructions will be generated. If
the prefetch is done early enough before the access then the data
will be in the cache by the time it is accessed.
Function attributes...
`malloc'
The `malloc' attribute is used to tell the compiler that a function
may be treated as if any non-`NULL' pointer it returns cannot
alias any other pointer valid when the function returns. This
will often improve optimization.
`pure'
Many functions have no effects except the return value and their
return value depends only on the parameters and/or global
variables. Such a function can be subject to common subexpression
elimination and loop optimization just as an arithmetic operator
would be. These functions should be declared with the attribute
`pure'. For example,
int square (int) __attribute__ ((pure));
hmm maybe once you double check and not use -fomit-frame-pointer you can submit a patch! Thank you!!
ReplyDelete