A few days ago I tried to
make Ruby a bit faster. Optimizing
Fixnum
operations and tweaking
rb_call0
seemed pretty straighforward, but specializing
st_num_table
must have seemed pretty weird.
Every non-trivial program uses a lot of collections. Collections must basically support the following operations:
- Getting element by key (
Collection#[]
) - Setting element by key (
Collection#[]=
) - Iterating over all elements (
Collection#each
) - all kinds of collection-specific and helper methods.
In many ways, a single collection type is enough for a language.
PHP has only array
. Perl has arrays and hashes, but hashes can be indexed only by strings - that's a very serious limitation. Ruby also has two kinds of collections -
Array
s and
Hash
es. Ruby Hashes are indexed by arbitrary objects, so they could in principle replace Arrays, at cost of some API changes, any possibly somewhat inferior performance (with good implementation, the difference wouldn't be high).
Below the surface, Ruby uses many more collection kinds:
RArray
- dense array indexed by small positive integersst_table
- a general purpose hash, which is specialized into three kinds:- *
st_num_table
- indexed by numbers - *
st_str_table
- indexed by strings - *
st_obj_table
- indexed by arbitrary Ruby objects
The specialization is done in a very inefficient way - using hand-coded vtable:
struct st_hash_type {
int (*compare)();
int (*hash)();
};
struct st_table {
struct st_hash_type *type;
int num_bins;
int num_entries;
struct st_table_entry **bins;
};
st_obj_table
is used in implementation of
Hash
es. It isn't very efficient, calling
#eql?
method to check object equality, but a few special cases let it handle
Fixnum
s and
String
s more efficiently.
Because of
st_obj_table
's extra performance cost, and possibly also for API reasons, Ruby interpretter rarely uses it internally. It uses
st_str_table
quite a bit (mostly in uninteresting places), and
st_num_table
a lot.
Why
st_num_table
instead of
RArray
? Basically because
st_num_table
is sparse. When Java sees the following code:
double do_stuff(Complex x) {
return x.real + x.imaginary;
}
It knows
Complex
has two fields:
protected double imaginary;
protected double real;
so the code get compiled to something like:
double do_stuff(Complex x) {
return get_field(x, 1) + get_field(x, 0);
}
It's more difficult for duck-typed languages like Ruby. They don't even know internal representation of
self
, so the code:
def do_stuff
return @real + @image
end
gets compiled to something like:
def do_stuff
return get_field(self, real_id) + get_field(self, image_id)
end
where
real_id
and
image_id
equal
:@real.to_i
and
:@image.to_i
. In a particular session of
irb
I tried they were 26314 and 26322:
irb> a = Complex.new(1.0, 2.0)
=> Complex(1.0, 2.0)
irb> a.instance_variables
=> ["@image", "@real"]
irb> a.instance_variables.map{|s| s.to_sym.to_i}
=> [26314, 26322]
irb> a.instance_variable_get(:@real)
=> 1.0
irb> a.instance_variable_get(26322.to_sym)
=> 2.0
Ruby method calls work in a very similar way. So Ruby needs efficient collections indexed by small integers. Obviously, they need to be sparse -
Array
containing something at 26314 and 26322 would allocate memory for all elements from 0 to 26322. That's clearly not acceptable. Hash tables like
st_num_table
don't have this kind of overhead, so they work much better. However,
st_num_table
is used very often in Ruby, and hash tables have pretty high constant overhead.
To demonstrate this I wrote an instance variable table printer (works with 1.8.5):
require 'complex'
require 'dl'
require 'dl/struct'
include DL::Importable
# Some bits copied from evil.rb
typealias "VALUE", nil, nil, nil, "unsigned long"
Basic = ["long flags", "VALUE klass"]
RBasic = struct Basic
RObject = struct(Basic + [
"st_table *iv_tbl"
])
RClass = struct(Basic + [
"st_table *iv_tbl",
"st_table *m_tbl",
"VALUE super"
])
St_table = struct([
"st_hash_type *type",
"int num_bins",
"int num_entries",
"st_table_entry **bins",
])
St_table_entry = struct([
"uint hash",
"long key",
"VALUE record",
#"st_table_entry *next",
"long next",
])
def value_to_obj(value)
return false if value == 0
return true if value == 2
return nil if value == 4
# 6 is undef placeholder
if (value & 3) == 0
# Object
ObjectSpace._id2ref((value - 2**32)/2)
elsif (value & 3) == 2
# Symbols
(value >> 8).to_sym
else
# Integers
return ObjectSpace._id2ref(value)
end
end
def print_object_iv_tbl(obj)
puts "IV Table of object #{obj.inspect}"
addr = obj.object_id * 2
obj_data = RObject.new(DL::PtrData.new(addr))
puts "Flags = #{obj_data.flags}"
puts "Class = #{value_to_obj(obj_data.klass).inspect}"
iv_tbl = St_table.new(obj_data.iv_tbl)
bins = iv_tbl.bins
num_bins = iv_tbl.num_bins
# x86-ish, assumes normal Object implemented as RObject, nothing fancy
object_memory_use = 12 + 4 * num_bins
puts "IV_TBL:"
puts "* Bins: #{num_bins}"
puts "* Entries: #{iv_tbl.num_entries}"
bins.size = num_bins * 4
bins.to_str.unpack("L*").each_with_index{|addr, i|
if addr == 0
puts "* Bin #{i}: empty"
else
puts "* Bin #{i}:"
while addr != 0
object_memory_use += 16
table_entry = St_table_entry.new(DL::PtrData.new(addr))
puts " * Entry"
puts " * Hash: #{table_entry.hash}"
puts " * Key: #{table_entry.key.to_sym}"
puts " * Record: #{value_to_obj(table_entry.record).inspect}"
addr = table_entry.next
end
end
}
puts "Memory use: #{object_memory_use} bytes"
end
a = Complex.new(1.0, 2.0)
print_object_iv_tbl(a)
which reports:
IV Table of object Complex(1.0, 2.0)
Flags = 2
Class = Complex
IV_TBL:
* Bins: 11
* Entries: 2
* Bin 0:
* Entry
* Hash: 10802
* Key: @real
* Record: 1.0
* Bin 1: empty
* Bin 2: empty
* Bin 3: empty
* Bin 4: empty
* Bin 5: empty
* Bin 6: empty
* Bin 7: empty
* Bin 8:
* Entry
* Hash: 10810
* Key: @image
* Record: 2.0
* Bin 9: empty
* Bin 10: empty
Memory use: 88 bytes
That's 88 bytes for the object + 2*16 more for storing contents of its fields (Floats take only 16 bytes as they don't have own
iv_tbl
s). 120 bytes instead of 16 needed for storing just two doubles.
Benchmarking memory usage
One problem with benchmarking under Linux is lack of easy way of measuring memory use. Standard Unix
/usr/bin/time
tries to measure it, but as Linux kernel doesn't set the relevant fields of
struct rusage
after call to
wait4
, it always reports 0 kB used. Since 2.6.16 Linux at least provides
/proc/<process_id>/smaps
.
The only thing needed is a convenient wrapper like
time
for measuring execution time. Live monitoring would be complicated, but as most programs do not free any memory before exiting (normally
free()
d or GC-ed memory returns to the memory allocator, but not to the operating system), it's a good enough to measure at exit.
The simplest way of making program do something before exiting is
LD_PRELOAD
ing a shared library, and registering a destructor for it. Parsing text file
/proc/<process_id>/smaps
in C would be sick, so the library simply spawns a Ruby program which does all the work.
Process.ppid
is pid of parent process, so
/proc/#{Process.ppid}/smaps
is file containing parent process's memory use information.
system("exec ./memory_benchmarker.rb")
is necessary as libc
system("./memory_benchmarker.rb")
would spawn
/bin/sh
, which would in turn spawn
./memory_benchmarker.rb
, and
Process.ppid
would return shell's pid, not the actual process's.
So the
Makefile
:
memory_benchmarker.so: memory_benchmarker.c
gcc -Wall -W -fPIC -shared memory_benchmarker.c -o memory_benchmarker.so
memory_benchmarker.c
:
#include <stdlib.h>
#include <unistd.h>
void memory_benchmarker_fini() __attribute__ ((destructor));
void memory_benchmarker_fini()
{
unsetenv("LD_PRELOAD");
system("exec ./memory_benchmarker.rb");
}
/* If you don't want to trace memory use of child process, unset LD_PRELOAD here.
Unfortunately doing so also affects exec()ed process.
void memory_benchmarker_init() __attribute__ ((constructor));
void memory_benchmarker_init()
{
unsetenv("LD_PRELOAD");
}
*/
and
memory_benchmarker.rb
:
#!/usr/bin/ruby
smaps = File.read("/proc/#{Process.ppid}/smaps")
sz = smaps.scan(/^Size:\s+(\d+) kB/).map{|x| x[0].to_i}.inject{|a,b| a+b}
rss = smaps.scan(/^Rss:\s+(\d+) kB/).map{|x| x[0].to_i}.inject{|a,b| a+b}
puts "Memory use: #{sz} kB"
puts "Rss: #{rss} kB"
Judy
Judy is an efficient sparse array library. It implements a few data structures, of which the most relevant is
JudyL
- a sparse array with word (32 or 64 bit) indexes and values. To use
JudyL
as replacement for
st_num_array
and new implementation of
iv_tbl
,
m_tbl
etc., its indexes will be interpretted as integers, and its keys as
VALUEs (arbitrary Ruby objects).
The replacement is pretty straightforward. There are two minor technical issues:
JudyL
assumes there's only one pointer to it, so it can reallocate itself without complicating the API. The reallocating macros simply modify the passed pointer. To quickly retrofit Ruby with Judy, I used struct st_num_table
containing just this pointer. It should be simple to get rid of this extra level of indirection later.- Unlike
st_num_table
, JudyL
doesn't store number of elements. A lot of code assumes number of elements is easily accessible, and calls slow st_num_count_entries()
even if it's just to check whether the table is empty or not. I replaced such calls by st_num_is_empty()
and other faster calls.
Here's a
patch against
Ruby 1.8.5-p12. It contains all optimizations:
rb_call0
,
Fixnum#equal
, and Judy (and separate
st_num_table
required by it).
Results
Everything was run with CFLAS
-O3 -march=athlon-xp
. As someone pointed out, Ruby GC becomes unstable with
-fomit-frame-pointer
and
-mtune
is redundant.
Numbers represent overhead compared to the most efficient implementation, in terms of memory:
| Plain 1.8.5-p12 | rb_call0 | rb_call0 + st_num_table | rb_call0 + Judy |
---|
app_answer | +1% | +1% | +1% | +0% |
---|
app_fib | +2% | +1% | +2% | +0% |
---|
app_mandelbrot | +13% | +14% | +14% | +0% |
app_raise | +6% | +7% | +6% | +0% |
---|
app_strconcat | +0% | +0% | +1% | +0% |
app_tak | +2% | +1% | +2% | +0% |
---|
app_tarai | +1% | +1% | +1% | +0% |
---|
and time:
| Plain 1.8.5-p12 | rb_call0 | rb_call0 + st_num_table | rb_call0 + Judy |
---|
app_answer | +1% | +1% | +0% | +0% |
---|
app_fib | +2% | +0% | +1% | +1% |
---|
app_mandelbrot | +7% | +5% | +4% | +0% |
---|
app_raise | +6% | +5% | +0% | +5% |
---|
app_strconcat | +4% | +3% | +0% | +2% |
---|
app_tak | +1% | +0% | +1% | +1% |
---|
app_tarai | +0% | +0% | +0% | +0% |
---|
The only benchmark showing significant improvement is
app_mandelbrot
. It's also the only one that uses anything object-oriented (
require 'complex'
), others jush crunch numbers, concatenate strings and raise exceptions in loops. It's no wonder they show little improvement.
I thought that real Ruby applications would make many more
iv_tbl
/
m_tbl
operations than the benchmarks. The only problem was finding better benchmarks. For no particular reason I took Ruby unit tests (ones that get run if you do
make test-all
) as a "representative" app.
The final results:
| Plain 1.8.5-p12 | rb_call0 | rb_call0 + st_num_table | rb_call0 + Judy |
---|
CPU | +5% | +2% | +2% | +0% |
Memory | +4% | +4% | +4% | +0% |
Conclusions
I'm not really sure how "representative" this benchmark is. All performance improvements so far resulted in Ruby being about 5% faster and taking 4% less memory. Performance of optimized subsystems got significantly improved, but the true bottleneck seems to be somewhere else - in slow
rb_eval
and related functions.
It's up to Ruby maintainers whether they want to use my patch or not. On the one hand - performance improvement is not negligible and if they ever speed up
rb_eval
,
st_num_table
will need to be dealt with too. On the other hand, introducing dependency on just another library (Judy) is something more developers learn to appoach with caution. The code passes all unit tests, but it still requires more testing (including performance testing on different architectures) and significant clean-up.
The light-gray text color for code samples makes the article hard to read.
ReplyDeleteLast summer I had a play with trying to fix call dispatch to use a monomorphic inline cache, but ran out of time and C knowledge.
ReplyDeleteIt looked as though the biggest problem was that there's not quite enough room in a Ruby opnode to stash the required information, but it also looked as if it wouldn't be too much work to extend the node structure.
Is this something you've looked into?
Piers Cawley: I didn't consider this direction. Right now I think that to get Ruby perform much beter, the best first action would be coding better ways of measuring performance.
ReplyDeleteRight now benchmark are slow, very far from being representative, have high standard deviations of results, and many outliers (I reject highest and lowest 20% to obtain more robust results), and the result is just a single number without obvious interpretation.
We also need good correctness tests. When I ported Judy to Ruby, the only correctness measurement tool I had was running make test-all and looking whether it segfaults or not. Static typing regimen of Ruby interpreter wasn't high enough, and there's no such thing as unit tests for segfaulting.
It seems simple and promising to take Ruby NODE tree, convert it to something with simpler operations and simpler control flow (basically VM "bytecode"), and compile it to native bytcode by plain memcpy. So @foo = @bar would actually do tmp = st_num_fetch(self_iv_tbl, :@foo); st_num_store(self_iv_tbl, :@bar, tmp) instead of doing all the extra work. Supporting flexible C APIs makes writing extensions easier, but it also makes them inherenly slow. And as Ruby stdlib is implemented as C extenstion, it make Ruby inherently slow (compared to about equally dynamic Smalltalk). It would be much more efficient to define just one or two function call formats, and implement others as macros or ABI helper functions.
And better caching would be useful no matter what.
"It seems simple and promising to take Ruby NODE tree, convert it to something with simpler operations and simpler control flow"
ReplyDelete>> this is what YARV does (getting rid of rb_eval and the tree walking).
mfp: Yes, this is pretty much what YARV does. But I think it would be far easier to only change that without touching anything else, instead of recreating major part of the interpretter from scratch like YARV.
ReplyDeleteIt's atomic coding + continuous refactoring vs waterfall-style total rewrite thing.
nice thanks for that patch! Now onto fixing the 'real bottleneck' whatever that is...
ReplyDelete