The best kittens, technology, and video games blog in the world.

Tuesday, February 27, 2007

Ruby and Judy

DUCK AND COVER! by jiva from flickr (CC-NC)

A few days ago I tried to make Ruby a bit faster. Optimizing Fixnum operations and tweaking rb_call0 seemed pretty straighforward, but specializing st_num_table must have seemed pretty weird.

Every non-trivial program uses a lot of collections. Collections must basically support the following operations:

  • Getting element by key (Collection#[])
  • Setting element by key (Collection#[]=)
  • Iterating over all elements (Collection#each)
  • all kinds of collection-specific and helper methods.


In many ways, a single collection type is enough for a language. PHP has only array. Perl has arrays and hashes, but hashes can be indexed only by strings - that's a very serious limitation. Ruby also has two kinds of collections - Arrays and Hashes. Ruby Hashes are indexed by arbitrary objects, so they could in principle replace Arrays, at cost of some API changes, any possibly somewhat inferior performance (with good implementation, the difference wouldn't be high).

Below the surface, Ruby uses many more collection kinds:
  • RArray - dense array indexed by small positive integers
  • st_table - a general purpose hash, which is specialized into three kinds:
  • * st_num_table - indexed by numbers
  • * st_str_table - indexed by strings
  • * st_obj_table - indexed by arbitrary Ruby objects


The specialization is done in a very inefficient way - using hand-coded vtable:

 struct st_hash_type {
int (*compare)();
int (*hash)();
};

struct st_table {
struct st_hash_type *type;
int num_bins;
int num_entries;
struct st_table_entry **bins;
};


st_obj_table is used in implementation of Hashes. It isn't very efficient, calling #eql? method to check object equality, but a few special cases let it handle Fixnums and Strings more efficiently.

Because of st_obj_table's extra performance cost, and possibly also for API reasons, Ruby interpretter rarely uses it internally. It uses st_str_table quite a bit (mostly in uninteresting places), and st_num_table a lot.

Why st_num_table instead of RArray ? Basically because st_num_table is sparse. When Java sees the following code:

double do_stuff(Complex x) {
return x.real + x.imaginary;
}


It knows Complex has two fields:

protected double imaginary;
protected double real;

so the code get compiled to something like:
double do_stuff(Complex x) {
return get_field(x, 1) + get_field(x, 0);
}


It's more difficult for duck-typed languages like Ruby. They don't even know internal representation of self, so the code:
def do_stuff
return @real + @image
end


gets compiled to something like:
def do_stuff
return get_field(self, real_id) + get_field(self, image_id)
end


where real_id and image_id equal :@real.to_i and :@image.to_i. In a particular session of irb I tried they were 26314 and 26322:

irb> a = Complex.new(1.0, 2.0)
=> Complex(1.0, 2.0)
irb> a.instance_variables
=> ["@image", "@real"]
irb> a.instance_variables.map{|s| s.to_sym.to_i}
=> [26314, 26322]
irb> a.instance_variable_get(:@real)
=> 1.0
irb> a.instance_variable_get(26322.to_sym)
=> 2.0


Ruby method calls work in a very similar way. So Ruby needs efficient collections indexed by small integers. Obviously, they need to be sparse - Array containing something at 26314 and 26322 would allocate memory for all elements from 0 to 26322. That's clearly not acceptable. Hash tables like st_num_table don't have this kind of overhead, so they work much better. However, st_num_table is used very often in Ruby, and hash tables have pretty high constant overhead.

To demonstrate this I wrote an instance variable table printer (works with 1.8.5):
require 'complex'
require 'dl'
require 'dl/struct'

include DL::Importable

# Some bits copied from evil.rb

typealias "VALUE", nil, nil, nil, "unsigned long"

Basic = ["long flags", "VALUE klass"]
RBasic = struct Basic

RObject = struct(Basic + [
"st_table *iv_tbl"
])

RClass = struct(Basic + [
"st_table *iv_tbl",
"st_table *m_tbl",
"VALUE super"
])

St_table = struct([
"st_hash_type *type",
"int num_bins",
"int num_entries",
"st_table_entry **bins",
])

St_table_entry = struct([
"uint hash",
"long key",
"VALUE record",
#"st_table_entry *next",
"long next",
])

def value_to_obj(value)
return false if value == 0
return true if value == 2
return nil if value == 4
# 6 is undef placeholder

if (value & 3) == 0
# Object
ObjectSpace._id2ref((value - 2**32)/2)
elsif (value & 3) == 2
# Symbols
(value >> 8).to_sym
else
# Integers
return ObjectSpace._id2ref(value)
end
end

def print_object_iv_tbl(obj)
puts "IV Table of object #{obj.inspect}"
addr = obj.object_id * 2
obj_data = RObject.new(DL::PtrData.new(addr))
puts "Flags = #{obj_data.flags}"
puts "Class = #{value_to_obj(obj_data.klass).inspect}"
iv_tbl = St_table.new(obj_data.iv_tbl)
bins = iv_tbl.bins
num_bins = iv_tbl.num_bins

# x86-ish, assumes normal Object implemented as RObject, nothing fancy
object_memory_use = 12 + 4 * num_bins

puts "IV_TBL:"
puts "* Bins: #{num_bins}"
puts "* Entries: #{iv_tbl.num_entries}"

bins.size = num_bins * 4
bins.to_str.unpack("L*").each_with_index{|addr, i|
if addr == 0
puts "* Bin #{i}: empty"
else
puts "* Bin #{i}:"
while addr != 0
object_memory_use += 16
table_entry = St_table_entry.new(DL::PtrData.new(addr))
puts " * Entry"
puts " * Hash: #{table_entry.hash}"
puts " * Key: #{table_entry.key.to_sym}"
puts " * Record: #{value_to_obj(table_entry.record).inspect}"
addr = table_entry.next
end
end
}
puts "Memory use: #{object_memory_use} bytes"
end

a = Complex.new(1.0, 2.0)

print_object_iv_tbl(a)


which reports:
IV Table of object Complex(1.0, 2.0)
Flags = 2
Class = Complex
IV_TBL:
* Bins: 11
* Entries: 2
* Bin 0:
* Entry
* Hash: 10802
* Key: @real
* Record: 1.0
* Bin 1: empty
* Bin 2: empty
* Bin 3: empty
* Bin 4: empty
* Bin 5: empty
* Bin 6: empty
* Bin 7: empty
* Bin 8:
* Entry
* Hash: 10810
* Key: @image
* Record: 2.0
* Bin 9: empty
* Bin 10: empty
Memory use: 88 bytes


That's 88 bytes for the object + 2*16 more for storing contents of its fields (Floats take only 16 bytes as they don't have own iv_tbls). 120 bytes instead of 16 needed for storing just two doubles.

Benchmarking memory usage


One problem with benchmarking under Linux is lack of easy way of measuring memory use. Standard Unix /usr/bin/time tries to measure it, but as Linux kernel doesn't set the relevant fields of struct rusage after call to wait4, it always reports 0 kB used. Since 2.6.16 Linux at least provides /proc/<process_id>/smaps.

The only thing needed is a convenient wrapper like time for measuring execution time. Live monitoring would be complicated, but as most programs do not free any memory before exiting (normally free()d or GC-ed memory returns to the memory allocator, but not to the operating system), it's a good enough to measure at exit.

The simplest way of making program do something before exiting is LD_PRELOADing a shared library, and registering a destructor for it. Parsing text file /proc/<process_id>/smaps in C would be sick, so the library simply spawns a Ruby program which does all the work.

Process.ppid is pid of parent process, so /proc/#{Process.ppid}/smaps is file containing parent process's memory use information. system("exec ./memory_benchmarker.rb") is necessary as libc system("./memory_benchmarker.rb") would spawn /bin/sh, which would in turn spawn ./memory_benchmarker.rb, and Process.ppid would return shell's pid, not the actual process's.

So the Makefile:
memory_benchmarker.so: memory_benchmarker.c
gcc -Wall -W -fPIC -shared memory_benchmarker.c -o memory_benchmarker.so


memory_benchmarker.c:
#include <stdlib.h>
#include <unistd.h>

void memory_benchmarker_fini() __attribute__ ((destructor));
void memory_benchmarker_fini()
{
unsetenv("LD_PRELOAD");
system("exec ./memory_benchmarker.rb");
}

/* If you don't want to trace memory use of child process, unset LD_PRELOAD here.
Unfortunately doing so also affects exec()ed process.
void memory_benchmarker_init() __attribute__ ((constructor));
void memory_benchmarker_init()
{
unsetenv("LD_PRELOAD");
}
*/


and memory_benchmarker.rb:
#!/usr/bin/ruby

smaps = File.read("/proc/#{Process.ppid}/smaps")
sz = smaps.scan(/^Size:\s+(\d+) kB/).map{|x| x[0].to_i}.inject{|a,b| a+b}
rss = smaps.scan(/^Rss:\s+(\d+) kB/).map{|x| x[0].to_i}.inject{|a,b| a+b}
puts "Memory use: #{sz} kB"
puts "Rss: #{rss} kB"


Judy


Judy is an efficient sparse array library. It implements a few data structures, of which the most relevant is JudyL - a sparse array with word (32 or 64 bit) indexes and values. To use JudyL as replacement for st_num_array and new implementation of iv_tbl, m_tbl etc., its indexes will be interpretted as integers, and its keys as VALUEs (arbitrary Ruby objects).

The replacement is pretty straightforward. There are two minor technical issues:
  • JudyL assumes there's only one pointer to it, so it can reallocate itself without complicating the API. The reallocating macros simply modify the passed pointer. To quickly retrofit Ruby with Judy, I used struct st_num_table containing just this pointer. It should be simple to get rid of this extra level of indirection later.
  • Unlike st_num_table, JudyL doesn't store number of elements. A lot of code assumes number of elements is easily accessible, and calls slow st_num_count_entries() even if it's just to check whether the table is empty or not. I replaced such calls by st_num_is_empty() and other faster calls.


Here's a patch against Ruby 1.8.5-p12. It contains all optimizations: rb_call0, Fixnum#equal, and Judy (and separate st_num_table required by it).

Results


Everything was run with CFLAS -O3 -march=athlon-xp. As someone pointed out, Ruby GC becomes unstable with -fomit-frame-pointer and -mtune is redundant.

Numbers represent overhead compared to the most efficient implementation, in terms of memory:


Plain 1.8.5-p12rb_call0rb_call0 + st_num_tablerb_call0 + Judy
app_answer+1%+1%+1%+0%
app_fib+2%+1%+2%+0%
app_mandelbrot+13%+14%+14%+0%
app_raise+6%+7%+6%+0%
app_strconcat+0%+0%+1%+0%
app_tak+2%+1%+2%+0%
app_tarai+1%+1%+1%+0%


and time:


Plain 1.8.5-p12rb_call0rb_call0 + st_num_tablerb_call0 + Judy
app_answer+1%+1%+0%+0%
app_fib+2%+0%+1%+1%
app_mandelbrot+7%+5%+4%+0%
app_raise+6%+5%+0%+5%
app_strconcat+4%+3%+0%+2%
app_tak+1%+0%+1%+1%
app_tarai+0%+0%+0%+0%


The only benchmark showing significant improvement is app_mandelbrot. It's also the only one that uses anything object-oriented (require 'complex'), others jush crunch numbers, concatenate strings and raise exceptions in loops. It's no wonder they show little improvement.

I thought that real Ruby applications would make many more iv_tbl/m_tbl operations than the benchmarks. The only problem was finding better benchmarks. For no particular reason I took Ruby unit tests (ones that get run if you do make test-all) as a "representative" app.

The final results:


Plain 1.8.5-p12rb_call0rb_call0 + st_num_tablerb_call0 + Judy
CPU+5%+2%+2%+0%
Memory+4%+4%+4%+0%


Conclusions


I'm not really sure how "representative" this benchmark is. All performance improvements so far resulted in Ruby being about 5% faster and taking 4% less memory. Performance of optimized subsystems got significantly improved, but the true bottleneck seems to be somewhere else - in slow rb_eval and related functions.

It's up to Ruby maintainers whether they want to use my patch or not. On the one hand - performance improvement is not negligible and if they ever speed up rb_eval, st_num_table will need to be dealt with too. On the other hand, introducing dependency on just another library (Judy) is something more developers learn to appoach with caution. The code passes all unit tests, but it still requires more testing (including performance testing on different architectures) and significant clean-up.

6 comments:

Anonymous said...

The light-gray text color for code samples makes the article hard to read.

Piers Cawley said...

Last summer I had a play with trying to fix call dispatch to use a monomorphic inline cache, but ran out of time and C knowledge.

It looked as though the biggest problem was that there's not quite enough room in a Ruby opnode to stash the required information, but it also looked as if it wouldn't be too much work to extend the node structure.

Is this something you've looked into?

taw said...

Piers Cawley: I didn't consider this direction. Right now I think that to get Ruby perform much beter, the best first action would be coding better ways of measuring performance.

Right now benchmark are slow, very far from being representative, have high standard deviations of results, and many outliers (I reject highest and lowest 20% to obtain more robust results), and the result is just a single number without obvious interpretation.

We also need good correctness tests. When I ported Judy to Ruby, the only correctness measurement tool I had was running make test-all and looking whether it segfaults or not. Static typing regimen of Ruby interpreter wasn't high enough, and there's no such thing as unit tests for segfaulting.

It seems simple and promising to take Ruby NODE tree, convert it to something with simpler operations and simpler control flow (basically VM "bytecode"), and compile it to native bytcode by plain memcpy. So @foo = @bar would actually do tmp = st_num_fetch(self_iv_tbl, :@foo); st_num_store(self_iv_tbl, :@bar, tmp) instead of doing all the extra work. Supporting flexible C APIs makes writing extensions easier, but it also makes them inherenly slow. And as Ruby stdlib is implemented as C extenstion, it make Ruby inherently slow (compared to about equally dynamic Smalltalk). It would be much more efficient to define just one or two function call formats, and implement others as macros or ABI helper functions.

And better caching would be useful no matter what.

mfp said...

"It seems simple and promising to take Ruby NODE tree, convert it to something with simpler operations and simpler control flow"

>> this is what YARV does (getting rid of rb_eval and the tree walking).

taw said...

mfp: Yes, this is pretty much what YARV does. But I think it would be far easier to only change that without touching anything else, instead of recreating major part of the interpretter from scratch like YARV.

It's atomic coding + continuous refactoring vs waterfall-style total rewrite thing.

Roger Pack said...

nice thanks for that patch! Now onto fixing the 'real bottleneck' whatever that is...