
A few days ago I tried to make Ruby a bit faster. Optimizing Fixnum operations and tweaking rb_call0 seemed pretty straighforward, but specializing st_num_table must have seemed pretty weird.
Every non-trivial program uses a lot of collections. Collections must basically support the following operations:
- Getting element by key (
Collection#[]) - Setting element by key (
Collection#[]=) - Iterating over all elements (
Collection#each) - all kinds of collection-specific and helper methods.
In many ways, a single collection type is enough for a language. PHP has only
array. Perl has arrays and hashes, but hashes can be indexed only by strings - that's a very serious limitation. Ruby also has two kinds of collections - Arrays and Hashes. Ruby Hashes are indexed by arbitrary objects, so they could in principle replace Arrays, at cost of some API changes, any possibly somewhat inferior performance (with good implementation, the difference wouldn't be high).Below the surface, Ruby uses many more collection kinds:
RArray- dense array indexed by small positive integersst_table- a general purpose hash, which is specialized into three kinds:- *
st_num_table- indexed by numbers - *
st_str_table- indexed by strings - *
st_obj_table- indexed by arbitrary Ruby objects
The specialization is done in a very inefficient way - using hand-coded vtable:
struct st_hash_type {
int (*compare)();
int (*hash)();
};
struct st_table {
struct st_hash_type *type;
int num_bins;
int num_entries;
struct st_table_entry **bins;
};st_obj_table is used in implementation of Hashes. It isn't very efficient, calling #eql? method to check object equality, but a few special cases let it handle Fixnums and Strings more efficiently.Because of
st_obj_table's extra performance cost, and possibly also for API reasons, Ruby interpretter rarely uses it internally. It uses st_str_table quite a bit (mostly in uninteresting places), and st_num_table a lot.Why
st_num_table instead of RArray ? Basically because st_num_table is sparse. When Java sees the following code:double do_stuff(Complex x) {
return x.real + x.imaginary;
}It knows
Complex has two fields:
protected double imaginary;
protected double real;
so the code get compiled to something like:
double do_stuff(Complex x) {
return get_field(x, 1) + get_field(x, 0);
}It's more difficult for duck-typed languages like Ruby. They don't even know internal representation of
self, so the code:def do_stuff
return @real + @image
end
gets compiled to something like:
def do_stuff
return get_field(self, real_id) + get_field(self, image_id)
end
where
real_id and image_id equal :@real.to_i and :@image.to_i. In a particular session of irb I tried they were 26314 and 26322:irb> a = Complex.new(1.0, 2.0)
=> Complex(1.0, 2.0)
irb> a.instance_variables
=> ["@image", "@real"]
irb> a.instance_variables.map{|s| s.to_sym.to_i}
=> [26314, 26322]
irb> a.instance_variable_get(:@real)
=> 1.0
irb> a.instance_variable_get(26322.to_sym)
=> 2.0
Ruby method calls work in a very similar way. So Ruby needs efficient collections indexed by small integers. Obviously, they need to be sparse -
Array containing something at 26314 and 26322 would allocate memory for all elements from 0 to 26322. That's clearly not acceptable. Hash tables like st_num_table don't have this kind of overhead, so they work much better. However, st_num_table is used very often in Ruby, and hash tables have pretty high constant overhead.To demonstrate this I wrote an instance variable table printer (works with 1.8.5):
require 'complex'
require 'dl'
require 'dl/struct'
include DL::Importable
# Some bits copied from evil.rb
typealias "VALUE", nil, nil, nil, "unsigned long"
Basic = ["long flags", "VALUE klass"]
RBasic = struct Basic
RObject = struct(Basic + [
"st_table *iv_tbl"
])
RClass = struct(Basic + [
"st_table *iv_tbl",
"st_table *m_tbl",
"VALUE super"
])
St_table = struct([
"st_hash_type *type",
"int num_bins",
"int num_entries",
"st_table_entry **bins",
])
St_table_entry = struct([
"uint hash",
"long key",
"VALUE record",
#"st_table_entry *next",
"long next",
])
def value_to_obj(value)
return false if value == 0
return true if value == 2
return nil if value == 4
# 6 is undef placeholder
if (value & 3) == 0
# Object
ObjectSpace._id2ref((value - 2**32)/2)
elsif (value & 3) == 2
# Symbols
(value >> 8).to_sym
else
# Integers
return ObjectSpace._id2ref(value)
end
end
def print_object_iv_tbl(obj)
puts "IV Table of object #{obj.inspect}"
addr = obj.object_id * 2
obj_data = RObject.new(DL::PtrData.new(addr))
puts "Flags = #{obj_data.flags}"
puts "Class = #{value_to_obj(obj_data.klass).inspect}"
iv_tbl = St_table.new(obj_data.iv_tbl)
bins = iv_tbl.bins
num_bins = iv_tbl.num_bins
# x86-ish, assumes normal Object implemented as RObject, nothing fancy
object_memory_use = 12 + 4 * num_bins
puts "IV_TBL:"
puts "* Bins: #{num_bins}"
puts "* Entries: #{iv_tbl.num_entries}"
bins.size = num_bins * 4
bins.to_str.unpack("L*").each_with_index{|addr, i|
if addr == 0
puts "* Bin #{i}: empty"
else
puts "* Bin #{i}:"
while addr != 0
object_memory_use += 16
table_entry = St_table_entry.new(DL::PtrData.new(addr))
puts " * Entry"
puts " * Hash: #{table_entry.hash}"
puts " * Key: #{table_entry.key.to_sym}"
puts " * Record: #{value_to_obj(table_entry.record).inspect}"
addr = table_entry.next
end
end
}
puts "Memory use: #{object_memory_use} bytes"
end
a = Complex.new(1.0, 2.0)
print_object_iv_tbl(a)
which reports:
IV Table of object Complex(1.0, 2.0)
Flags = 2
Class = Complex
IV_TBL:
* Bins: 11
* Entries: 2
* Bin 0:
* Entry
* Hash: 10802
* Key: @real
* Record: 1.0
* Bin 1: empty
* Bin 2: empty
* Bin 3: empty
* Bin 4: empty
* Bin 5: empty
* Bin 6: empty
* Bin 7: empty
* Bin 8:
* Entry
* Hash: 10810
* Key: @image
* Record: 2.0
* Bin 9: empty
* Bin 10: empty
Memory use: 88 bytes
That's 88 bytes for the object + 2*16 more for storing contents of its fields (Floats take only 16 bytes as they don't have own
iv_tbls). 120 bytes instead of 16 needed for storing just two doubles.Benchmarking memory usage
One problem with benchmarking under Linux is lack of easy way of measuring memory use. Standard Unix
/usr/bin/time tries to measure it, but as Linux kernel doesn't set the relevant fields of struct rusage after call to wait4, it always reports 0 kB used. Since 2.6.16 Linux at least provides /proc/<process_id>/smaps.The only thing needed is a convenient wrapper like
time for measuring execution time. Live monitoring would be complicated, but as most programs do not free any memory before exiting (normally free()d or GC-ed memory returns to the memory allocator, but not to the operating system), it's a good enough to measure at exit.The simplest way of making program do something before exiting is
LD_PRELOADing a shared library, and registering a destructor for it. Parsing text file /proc/<process_id>/smaps in C would be sick, so the library simply spawns a Ruby program which does all the work.Process.ppid is pid of parent process, so /proc/#{Process.ppid}/smaps is file containing parent process's memory use information. system("exec ./memory_benchmarker.rb") is necessary as libc system("./memory_benchmarker.rb") would spawn /bin/sh, which would in turn spawn ./memory_benchmarker.rb, and Process.ppid would return shell's pid, not the actual process's.So the
Makefile:memory_benchmarker.so: memory_benchmarker.c
gcc -Wall -W -fPIC -shared memory_benchmarker.c -o memory_benchmarker.so
memory_benchmarker.c:#include <stdlib.h>
#include <unistd.h>
void memory_benchmarker_fini() __attribute__ ((destructor));
void memory_benchmarker_fini()
{
unsetenv("LD_PRELOAD");
system("exec ./memory_benchmarker.rb");
}
/* If you don't want to trace memory use of child process, unset LD_PRELOAD here.
Unfortunately doing so also affects exec()ed process.
void memory_benchmarker_init() __attribute__ ((constructor));
void memory_benchmarker_init()
{
unsetenv("LD_PRELOAD");
}
*/
and
memory_benchmarker.rb:#!/usr/bin/ruby
smaps = File.read("/proc/#{Process.ppid}/smaps")
sz = smaps.scan(/^Size:\s+(\d+) kB/).map{|x| x[0].to_i}.inject{|a,b| a+b}
rss = smaps.scan(/^Rss:\s+(\d+) kB/).map{|x| x[0].to_i}.inject{|a,b| a+b}
puts "Memory use: #{sz} kB"
puts "Rss: #{rss} kB"
Judy
Judy is an efficient sparse array library. It implements a few data structures, of which the most relevant is
JudyL - a sparse array with word (32 or 64 bit) indexes and values. To use JudyL as replacement for st_num_array and new implementation of iv_tbl, m_tbl etc., its indexes will be interpretted as integers, and its keys as VALUEs (arbitrary Ruby objects).
The replacement is pretty straightforward. There are two minor technical issues:JudyLassumes there's only one pointer to it, so it can reallocate itself without complicating the API. The reallocating macros simply modify the passed pointer. To quickly retrofit Ruby with Judy, I usedstruct st_num_tablecontaining just this pointer. It should be simple to get rid of this extra level of indirection later.- Unlike
st_num_table,JudyLdoesn't store number of elements. A lot of code assumes number of elements is easily accessible, and calls slowst_num_count_entries()even if it's just to check whether the table is empty or not. I replaced such calls byst_num_is_empty()and other faster calls.
Here's a patch against Ruby 1.8.5-p12. It contains all optimizations:
rb_call0, Fixnum#equal, and Judy (and separate st_num_table required by it).Results
Everything was run with CFLAS
-O3 -march=athlon-xp. As someone pointed out, Ruby GC becomes unstable with -fomit-frame-pointer and -mtune is redundant.Numbers represent overhead compared to the most efficient implementation, in terms of memory:
| Plain 1.8.5-p12 | rb_call0 | rb_call0 + st_num_table | rb_call0 + Judy | |
|---|---|---|---|---|
| app_answer | +1% | +1% | +1% | +0% |
| app_fib | +2% | +1% | +2% | +0% |
| app_mandelbrot | +13% | +14% | +14% | +0% |
| app_raise | +6% | +7% | +6% | +0% |
| app_strconcat | +0% | +0% | +1% | +0% |
| app_tak | +2% | +1% | +2% | +0% |
| app_tarai | +1% | +1% | +1% | +0% |
and time:
| Plain 1.8.5-p12 | rb_call0 | rb_call0 + st_num_table | rb_call0 + Judy | |
|---|---|---|---|---|
| app_answer | +1% | +1% | +0% | +0% |
| app_fib | +2% | +0% | +1% | +1% |
| app_mandelbrot | +7% | +5% | +4% | +0% |
| app_raise | +6% | +5% | +0% | +5% |
| app_strconcat | +4% | +3% | +0% | +2% |
| app_tak | +1% | +0% | +1% | +1% |
| app_tarai | +0% | +0% | +0% | +0% |
The only benchmark showing significant improvement is
app_mandelbrot. It's also the only one that uses anything object-oriented (require 'complex'), others jush crunch numbers, concatenate strings and raise exceptions in loops. It's no wonder they show little improvement.I thought that real Ruby applications would make many more
iv_tbl/m_tbl operations than the benchmarks. The only problem was finding better benchmarks. For no particular reason I took Ruby unit tests (ones that get run if you do make test-all) as a "representative" app.The final results:
| Plain 1.8.5-p12 | rb_call0 | rb_call0 + st_num_table | rb_call0 + Judy | |
|---|---|---|---|---|
| CPU | +5% | +2% | +2% | +0% |
| Memory | +4% | +4% | +4% | +0% |
Conclusions
I'm not really sure how "representative" this benchmark is. All performance improvements so far resulted in Ruby being about 5% faster and taking 4% less memory. Performance of optimized subsystems got significantly improved, but the true bottleneck seems to be somewhere else - in slow
rb_eval and related functions.It's up to Ruby maintainers whether they want to use my patch or not. On the one hand - performance improvement is not negligible and if they ever speed up
rb_eval, st_num_table will need to be dealt with too. On the other hand, introducing dependency on just another library (Judy) is something more developers learn to appoach with caution. The code passes all unit tests, but it still requires more testing (including performance testing on different architectures) and significant clean-up.






