The best kittens, technology, and video games blog in the world.

Wednesday, February 21, 2007

Making Ruby faster

Red Kitty by Bob.Fornal from flickr (CC-NC-SA)

There wasn't that much code in my blog recently. So today I decided to change that by coding something cool. And what could be simpler than taking a popular language and making it much much faster ?

So first I downloaded Ruby 1.8.5-p12, and some benchmarks. The benchmarks are the same as used in the blog post comparing different implementations of Ruby.

I don't like the benchmarks much - they test microfunctionality and they cannot be realistically extrapolated to real programs, but I'm too lazy to make better ones just yet.

I wanted to speed up Ruby. Some rules:

  • Under no circumstance can it break Ruby functionality (hacks like Ruby/DL may need updating).
  • Speed-up must be observable in benchmarks.
  • Speed-up should be significant (more than just 1-2%).
  • It must be simple to code - a few hours for a working patch. Programmer productivity is supposed to double every six years after all.
  • It should not make code much more complex than it currently is.


I'm not trying to compete with YARV - the point is a major speed-up right now.

First I modified the benchmark driver to run everything 5 times, and report average of 3 middle results, like they do in many sports competitions. I ran everything on an old Duron 1400MHz box with 512MB memory. I limited set of benchmarks to just 7 "app" benchmarks (all except for pentomino, which was too slow and factorial, which raised stack overflow) to get faster feedback.

CFLAGS

The first idea was to simply play with CFLAGS. I have no idea why people aren't doing that. Isn't the very idea of coding in C about performance ?

I didn't do anything fancy - just the standard -O3 -march=athlon -mtune=athlon -fomit-frame-pointer. Results show how much extra time it needs compared to the fastest implementation.

BenchmarkRuby 1.8.4 from UbuntuRuby 1.8.5-p12Ruby 1.8.5-p12 with nice CFLAGS
app_answer+86%+0%+10%
app_fib+72%+4%+0%
app_mandelbrot+44%+0%+1%
app_raise+42%+5%+0%
app_strconcat+49%+7%+0%
app_tak+77%+6%+0%
app_tarai+69%+10%+0%


Two surprises here. First, messing with CFLAGS brought just a few percent speed-up, and in one case even a 10% slow-down. Second, either 1.8.5 is massively faster than 1.8.4 or Ubuntu uses some really horrible compilation options. Assuming the former, that's a great thing.

Anyway, that's kinda disappointing, as I was expecting a 5%-10% improvement pretty much for free. So, time to look for another easy target. Running the benchmarks with a profiler (CFLAGS=-O3 -march=athlon -mtune=athlon -pg), and compiling everything to assembly seemed like a good idea. A candidate I remembered from earlier was st.c - hash tables called functions like these many times during each operation, and used multiple pointer dereferences for it, instead of inlining them (and simplifying to pretty much no code):
static int
numcmp(x, y)
long x, y;
{
return x != y;
}

static int
numhash(n)
long n;
{
return n;
}

It seemed straighforward to simply specialize the hash table to numeric version, but first I wanted to check if it's really likely to be worthwhile.

And indeed, st_lookup and st_foreach were important, at least in some benchmarks. What I didn't expect were fix_minus, fix_plus, and other Fixnum functions. These also seemed easy to improve (more about it later). Unfortunately the five most performance-critical functions were rb_eval, rb_call0, rb_call, rb_newobj and call_cfunc, that I had no clue about.

So two candidates giving at best a minor improvement, and 5 candidates I have no clue about, sounds like fun.

Optimizing rb_call0



rb_call0 seemed like the easiest way of making a difference. It contained a big switch which was compiled into a binary search tree of comparisons and jumps. That's usually a good idea for sparse switches, but in rb_call0 two cases are far more common than the other six. They are - NODE_CFUN (most common) and NODE_SCOPE (significantly less so). Everything else is far behind the two. So the switch was replaced by "if first else if second else switch" construct. This is only likely to be worthwhile because rb_call0 is called so often. By the way a good JIT compiler should be able to optimize such cases on its own, but compilers for static languages like C need manual help.

And the results:


BenchmarkRuby 1.8.4 from UbuntuRuby 1.8.5-p12Ruby 1.8.5-p12 with nice CFLAGSNice CFLAGS + rb_call0
app_answer+86%+0%+10%+7%
app_fib+77%+6%+3%+0%
app_mandelbrot+46%+2%+3%+0%
app_raise+43%+6%+0%+0%
app_strconcat+49%+7%+0%+2%
app_tak+80%+8%+2%+0%
app_tarai+72%+12%+2%+0%


So Ruby got some 2%-3% faster already.

Specializing hash tables



There are 3 kinds of hash tables used in Ruby - number, string and object hash tables. It seems pretty excessive to triple amount of hash-related code in hope of optimizing it a bit, and it wouldn't really be necessary if Ruby was coded in C++, as templates would probably do the trick. I'm going to do something a bit less radical, and only specialize number tables.

The optimization required quite a bit of code duplication and patching, and the speed-up is relatively modest.

BenchmarkRuby 1.8.4 from UbuntuRuby 1.8.5-p12Ruby 1.8.5-p12 with nice CFLAGSNice CFLAGS + rb_call0Nice CFLAGS + rb_call0 + st_num_table
app_answer+86%+0%+10%+7%+7%
app_fib+77%+6%+3%+0%+0%
app_mandelbrot+49%+3%+4%+2%+0%
app_raise+43%+6%+0%+0%+2%
app_strconcat+54%+10%+3%+5%+0%
app_tak+81%+8%+2%+1%+0%
app_tarai+74%+14%+3%+1%+0%


Normally I wouldn't feel too good about patches like that, but having separate data structure for numeric hashes makes it possible to use more efficient data structure like Judy.

Making Fixnum operations faster



Fixnum operations are executed a huge number of times even in short programs. Unlike most programming languages (like let's say - Java), Fixnums in Ruby are just normal objects, and the virtual machine has no special support for making them fast.

Fixing the virtual machine would take too much work, so let's just make sure the methods perform well. Unfortunately not all of them do. Take a look at this simple method:
static VALUE
fix_equal(x, y)
VALUE x, y;
{
if (FIXNUM_P(y)) {
return (FIX2LONG(x) == FIX2LONG(y))?Qtrue:Qfalse;
}
else {
return num_equal(x, y);
}
}

static VALUE
num_equal(x, y)
VALUE x, y;
{
if (x == y) return Qtrue;
return rb_funcall(y, id_eq, 1, x);
}


Unfortunately what it really means is:
fix_equal:
subl $28, %esp
movl 36(%esp), %edx
movl 32(%esp), %ecx
testb $1, %dl ; Check if y is a fixnum
jne .L610
cmpl %ecx, %edx ; Check if x == y
je .L605 ; If yes, return Qtrue
movl id_eq, %eax ; If not, call rb_funcall(y, id_eq, 1, x)
movl %ecx, 12(%esp)
movl $1, 8(%esp)
movl %edx, (%esp)
movl %eax, 4(%esp)
call rb_funcall
.L607:
addl $28, %esp
ret
.L610:
sarl %ecx ; Shift x by 1 bit
sarl %edx ; Shift y by 1 bit
xorl %eax, %eax
cmpl %edx, %ecx ; Check if shifted x == shifted y
jne .L607 ; If not, return Qfalse
.p2align 4,,7
.L605:
movl $2, %eax ; Return Qtrue
addl $28, %esp
ret


The that two shifts are completely unnecessary. x and y are equal after shift if and only if they were equal before shift. And if y isn't a Fixnum, it's impossible for it to equal x.

static VALUE
fix_equal(x, y)
VALUE x, y;
{
if (FIXNUM_P(y))
return (x == y)?Qtrue:Qfalse;
else
return rb_funcall(y, id_eq, 1, x);
}


fix_equal:
subl $28, %esp
movl 36(%esp), %edx
movl 32(%esp), %eax
testb $1, %dl
je .L2
cmpl %eax, %edx
sete %al
addl $28, %esp
movzbl %al, %eax
addl %eax, %eax
ret
.p2align 4,,7
.L2:
movl %eax, 12(%esp)
movl id_eq, %eax
movl $1, 8(%esp)
movl %edx, (%esp)
movl %eax, 4(%esp)
call rb_funcall
addl $28, %esp
ret
And suddenly we have one conditional jump instead of three.

Other Fixnum operations could be made faster too. They're common enough to be worth it.

Summary



Here's the patch. It passes make test-all and does well in benchmarks. rb_call0 and Fixnum#equal improvements seem pretty useful, and are too simple to break anything. I'm not sure about st_num_table - Ruby C code uses random type casts all over the place, and the code is pretty ugly. So it's quite likely I missed some st_table vs st_num_table thing.

What's left to do:
  • Running more benchmarks, on more machines.
  • Verifying st_num_table


If it works well, the patch could be included in the next stable version of Ruby.

And to improve performance even more without much work, one could:
  • Convert rb_eval from node hopping to real "bytecode".
  • Convert st_num_table to something faster and more memory-efficient (like Judy).
  • Two extensions seem to be using old numeric st_table. I converted syck, but I cannot even compile win32ole.c to verify whether it works or not.
  • Replace call_cfunc with compiled ABI trampolines (move to stack, move to stack, move to stack, unconditional jump). This sounds terrifying, but it should be pretty fast and simple.

13 comments:

Anonymous said...

Thanks for showing that compiler flags don't matter much :)

Your tiny slowdowns might've been due to -O3 instead of -O2. If you like the labour, you might wanna try -O and -Os, too.

Eric Hodel said...

Ruby isn't stable with -fomit-frame-pointer. The GC needs the frame pointer to properly walk and mark the stack. Expect strange crashes when running with that flag. (Search google for references.)

Anonymous said...

If you want to try to squeze some performance out with compiler flags, a tool like Acovea might help out a lot.

taw said...

Anonymous: The only compiler option that I expected to make a serious difference was -march=athlon -mtune=athlon. This often makes a big difference, especially in number-crunching and codec software. I guess it doesn't help interpretters much.

Eric Hodel: Thanks for the info, the next time I'll benchmark it with frame pointers.

Anonymous: Acovea looks cool for optimizing software that has precise benchmarks, I just don't think Ruby interpretter is one of them. It seems to me that the benchmarks we have are just too unreliable, the flags probably wouldn't even be applicable to other kinds of x86s, and compile+benchmark takes way too much time to try even 10s of configuration (at least on my box). Of course if you feel like running it, just do so. I'd be glad to hear that I was wrong.

Anonymous said...

I was able to get big performance improvements (30-40%) out of perl by using icc instead of gcc a few years ago. Sun's compiler offered similar improvements on SPARC processors over gcc. While gcc's optimizer has improved, I'd still expect icc to be noticeably better.

Janek said...

seems like your patch is committed: http://svn.ruby-lang.org/cgi-bin/viewvc.cgi/trunk/numeric.c?r1=11807&r2=11806

Giles said...

That's a really interesting post.

I'm using your (great) magic/xml wrapper on Ubuntu and could do with some more speed. The progress bar is currently saying another 40 hours to go on my 10GB XML file...

Have you tried recompiling ruby-1.8.4 from hand and comparing it to the version supplied by Ubuntu?

I probably should have run this job on my Gentoo boxes :-)

Giles said...

I should probably also ask if you've had any success moving from REXML to libxml/expat for magic/xml at the same time!

taw said...

Janek: Thanks for the info.

Giles: Not yet, but that's definitely something I'm going to try.

Anonymous said...

Anyone tried http://nekovm.org ? (... or is it necessary to wait for http://rubyforge.org/projects/ruccola/ )

Anonymous said...

Small pedantic comment...

-march implies -mtune

From the gcc info docs...

`-march=CPU-TYPE'
Generate instructions for the machine type CPU-TYPE. The choices
for CPU-TYPE are the same as for `-mtune'. Moreover, specifying
`-march=CPU-TYPE' implies `-mtune=CPU-TYPE'.

Anonymous said...

A good area to play with are the builtins and the attributes.

C is ambiguous about the programmers intent leaving the optimizer to often make worst case decisions.

You may use `__builtin_expect' to provide the compiler with branch
prediction information. In general, you should prefer to use
actual profile feedback for this (`-fprofile-arcs'), as
programmers are notoriously bad at predicting how their programs
actually perform. However, there are applications in which this
data is hard to collect


__builtin_prefetch (const void *ADDR, ...)
This function is used to minimize cache-miss latency by moving
data into a cache before it is accessed. You can insert calls to
`__builtin_prefetch' into code for which you know addresses of
data in memory that is likely to be accessed soon. If the target
supports them, data prefetch instructions will be generated. If
the prefetch is done early enough before the access then the data
will be in the cache by the time it is accessed.


Function attributes...
`malloc'
The `malloc' attribute is used to tell the compiler that a function
may be treated as if any non-`NULL' pointer it returns cannot
alias any other pointer valid when the function returns. This
will often improve optimization.

`pure'
Many functions have no effects except the return value and their
return value depends only on the parameters and/or global
variables. Such a function can be subject to common subexpression
elimination and loop optimization just as an arithmetic operator
would be. These functions should be declared with the attribute
`pure'. For example,

int square (int) __attribute__ ((pure));

Roger Pack said...

hmm maybe once you double check and not use -fomit-frame-pointer you can submit a patch! Thank you!!