The best kittens, technology, and video games blog in the world.

Tuesday, June 05, 2007

Mock C backend for RLisp and its performance

!!!!the fifth element!!!! by gallmese from flickr (CC-NC-SA)
When RLisp was first created it was purely interpreted, with an interpreter written in Ruby. It was about 50x slower than Ruby. Then it went through unsuccessful stage of mixed interpreted/compiled-to-Ruby mode, finally getting a real compiler with Ruby back-end. That's where it is now, and it is about 5x slower than Ruby.

The code generated by RLisp compiler doesn't need Ruby power, it just needs a few simple operations and access to Ruby runtime. That is - it could be RubyC code for matz's Ruby, or Java code for JRuby. If using C/Java compiler was too slow, the RLisp compiler could generate LLVM, x86 asm, or JVM code directly. It would be a little bit slower than going through C/Java, as we wouldn't run the optimizations, but the overhead is in Ruby runtime anyway, and I doubt the difference would be more than a few percent.

Before implementing the RubyC backend I wanted to measure how fast would it be and what would it look like. For that I used the Ackermann benchmark from the Great Programming Language Shootout, which Ruby by the way drastically loses. Ruby and RLisp implementations looks like that:

def ack(m, n)
if m == 0 then
n + 1
elsif n == 0 then
ack(m - 1, 1)
else
ack(m - 1, ack(m, n - 1))
end
end

NUM = Integer(ARGV.shift || 1)
print "Ack(3,", NUM, "): ", ack(3, NUM), "\n"

(defun ack (m n)
(cond (eq? m 0) (+ n 1)
(eq? n 0) (ack (- m 1) 1)
(ack (- m 1) (ack m (- n 1)))
)
)
(let num [[ARGV shift] to_i])
(print "Ack(3," num "): " (ack 3 num))

Compiled RLisp code, after some reformatting or readability:
module ::RLispC7054184
def self.ack(globals)
Proc.new do |*args|
m, n, = *args
t2 = m.send(:==, 0)
if t2
t4 = n.send(:+, 1)
t3 = t4
else
t5 = n.send(:==, 0)
if t5
t7 = globals[:ack]
t8 = m.send(:-, 1)
t9 = t7.call(t8, 1)
t6 = t9
else
t10 = globals[:ack]
t11 = m.send(:-, 1)
t12 = globals[:ack]
t13 = n.send(:-, 1)
t14 = t12.call(m, t13)
t15 = t10.call(t11, t14)
t6 = t15
end
t3 = t6
end
t3
end
end
end
t16 = ::RLispC7054184::ack(globals)
globals[:ack] = t16

t17 = ::ARGV.send(:shift)
t18 = t17.send(:to_i)
globals[:num] = t18
t19 = globals[:print]
t20 = globals[:num]
t21 = globals[:ack]
t22 = globals[:num]
t23 = t21.call(3, t22)
t24 = t19.call("Ack(3,", t20, "): ", t23)

It should be pretty straightforward globals is a Hash containing RLisp global variables, the three-level module/def/Proc.new procedure creation is used to
  • Make sure procedures with the same name won't accidentally overwrite each other.
  • Make sure stack backtraces contain name ack (instead of ack_CHECKSUM.
  • To build a closure if necessary. As ack is global the only captured variable is globals.


I compiled the ack function to RubyC by hand. It follows generated Ruby code statement-by-statement, except that I didn't bother compiling things outside the ack function, which are executed only once and don't affect performance anyway. So the code consists of two files, Ruby file ackermann_driver.rb:
require 'ackermann'

globals = {}

closure_ack = RLispCompiledCode.build_closure_ack_CHECKSUM(globals)
globals[:ack] = lambda{|*args| RLispCompiledCode.ack_CHECKSUM(closure_ack, self, args, nil)}

num = (ARGV.shift || 1).to_i
res = globals[:ack].call(3, num)
print "Ack(3,", num, "): ", res, "\n"

and RubyC file ackermann.c:
#include <ruby.h>
#include <intern.h>

int ID_EQEQ, ID_PLUS, ID_MINUS, ID_call;
VALUE SYM_ack, SYM_globals;

static VALUE ack_CHECKSUM(VALUE rlisp_compiled_module, VALUE closure, VALUE self, VALUE args, VALUE block_arg)
{
VALUE globals, m, n, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11, t12, t13, t14, t15;

globals = rb_hash_aref(closure, SYM_globals);

m = rb_ary_entry(args, 0);
n = rb_ary_entry(args, 1);

t2 = rb_funcall(m, ID_EQEQ, 1, INT2FIX(0));

if(t2) {
t4 = rb_funcall(n, ID_PLUS, 1, INT2FIX(1));
t3 = t4;
} else {
t5 = rb_funcall(n, ID_EQEQ, 1, INT2FIX(0));
if(t5) {
t7 = rb_hash_aref(globals, SYM_ack);
t8 = rb_funcall(m, ID_MINUS, 1, INT2FIX(1));
t9 = rb_funcall(t7, ID_call, 2, t8, INT2FIX(1));
t6 = t9;
} else {
t10 = rb_hash_aref(globals, SYM_ack);
t11 = rb_funcall(m, ID_MINUS, 1, INT2FIX(1));
t12 = rb_hash_aref(globals, SYM_ack);
t13 = rb_funcall(n, ID_MINUS, 1, INT2FIX(1));
t14 = rb_funcall(t12, ID_call, 2, m, t13);
t15 = rb_funcall(t10, ID_call, 2, t11, t14);
t6 = t15;
}
t3 = t6;
}
return t3;
}

static VALUE build_closure_ack_CHECKSUM(VALUE rlisp_compiled_module, VALUE globals)
{
VALUE closure = rb_hash_new();
rb_hash_aset(closure, SYM_globals, globals);
return closure;
}

void Init_ackermann()
{
VALUE mRLispCompiledCode;
ID_EQEQ = rb_intern("==");
ID_PLUS = rb_intern("+");
ID_MINUS = rb_intern("-");
ID_call = rb_intern("call");
SYM_ack = ID2SYM(rb_intern("ack"));
SYM_globals = ID2SYM(rb_intern("globals"));

mRLispCompiledCode = rb_define_module("RLispCompiledCode");
rb_define_module_function(mRLispCompiledCode, "ack_CHECKSUM", ack_CHECKSUM, 4);
rb_define_module_function(mRLispCompiledCode, "build_closure_ack_CHECKSUM", build_closure_ack_CHECKSUM, 1);
}


The code is so scary that I don't even know where to start with explanations. The huge problem is that the compiled function ack_CHECKSUM needs to take closure argument somehow, but Ruby is not going to provide it. So we need an ugly wrapper globals[:ack] = lambda{...} in the driver.

As it turns out the performance is horrible, still over 2x slower than Ruby:
Ruby: 1.18s
RLisp: 5.93s (x5.0)
Mock binary RLisp: 2.48s (x2.1)

Many optimizations are possible, but I thought the guilty party is the ugly wrapper lambda. This or some other wrapper needs to be called every time we call RLisp function from Ruby, but what if we somehow optimized compile RLisp to compiled RLisp calls ? To make it really work we would probably subclass Proc to CompiledRLispProc and in the compiled code check wheather a Proc is of this particular kind, and if so, call it directly instead of using a wrapper. In this code we simply do that for calls do ack and nothing else. Here's the new driver ackermann_driver2.rb:
require 'ackermann2'

globals = {}

closure_ack = RLispCompiledCode.build_closure_ack_CHECKSUM(globals)
globals[:ack] = lambda{|*args| RLispCompiledCode.ack_CHECKSUM(closure_ack, self, args, nil)}

num = (ARGV.shift || 1).to_i
res = globals[:ack].call(3, num)
print "Ack(3,", num, "): ", res, "\n"

and the new hand-compiled code:
#include <ruby.h>
#include <intern.h>

int ID_EQEQ, ID_PLUS, ID_MINUS, ID_call;
VALUE SYM_ack, SYM_globals;

static VALUE ack_CHECKSUM(VALUE rlisp_compiled_module, VALUE closure, VALUE self, VALUE args, VALUE block_arg)
{
VALUE globals, m, n, t2, t3, t4, t5, t6, t8, t9, t11, t13, t14, t15;
VALUE new_args;

globals = rb_hash_aref(closure, SYM_globals);

m = rb_ary_entry(args, 0);
n = rb_ary_entry(args, 1);

t2 = rb_funcall(m, ID_EQEQ, 1, INT2FIX(0));

if(t2) {
t4 = rb_funcall(n, ID_PLUS, 1, INT2FIX(1));
t3 = t4;
} else {
t5 = rb_funcall(n, ID_EQEQ, 1, INT2FIX(0));
if(t5) {
t8 = rb_funcall(m, ID_MINUS, 1, INT2FIX(1));

new_args = rb_ary_new();
rb_ary_push(new_args, t8);
rb_ary_push(new_args, INT2FIX(1));
t9 = ack_CHECKSUM(rlisp_compiled_module, closure, self, new_args, Qnil);

t6 = t9;
} else {
t11 = rb_funcall(m, ID_MINUS, 1, INT2FIX(1));
t13 = rb_funcall(n, ID_MINUS, 1, INT2FIX(1));

new_args = rb_ary_new();
rb_ary_push(new_args, m);
rb_ary_push(new_args, t13);
t14 = ack_CHECKSUM(rlisp_compiled_module, closure, self, new_args, Qnil);

new_args = rb_ary_new();
rb_ary_push(new_args, t11);
rb_ary_push(new_args, t14);
t15 = ack_CHECKSUM(rlisp_compiled_module, closure, self, new_args, Qnil);;

t6 = t15;
}
t3 = t6;
}
return t3;
}

static VALUE build_closure_ack_CHECKSUM(VALUE rlisp_compiled_module, VALUE globals)
{
VALUE closure = rb_hash_new();
rb_hash_aset(closure, SYM_globals, globals);
return closure;
}

void Init_ackermann2()
{
VALUE mRLispCompiledCode;
ID_EQEQ = rb_intern("==");
ID_PLUS = rb_intern("+");
ID_MINUS = rb_intern("-");
ID_call = rb_intern("call");
SYM_ack = ID2SYM(rb_intern("ack"));
SYM_globals = ID2SYM(rb_intern("globals"));

mRLispCompiledCode = rb_define_module("RLispCompiledCode");
rb_define_module_function(mRLispCompiledCode, "ack_CHECKSUM", ack_CHECKSUM, 4);
rb_define_module_function(mRLispCompiledCode, "build_closure_ack_CHECKSUM", build_closure_ack_CHECKSUM, 1);
}

It's over 15 times faster than current RLisp and over 3 times faster than Ruby:
Ruby: 1.18s
RLisp: 5.93s (x5.0)
Mock binary RLisp: 2.48s (x2.1)
Mock binary RLisp 2: 0.38s (x0.3)

The road from a mock RubyC code to real RubyC back-end is a long one, but basically it proves what I always knew - RLisp can be much faster than Ruby.

Oh and the Rakefile:
task :default => ["ackermann.so", "ackermann2.so"]

def time_cmd(cmd)
cmd = "/usr/bin/time -f '%U+%S' #{cmd} 2>&1 >/dev/null"
raw_time = `#{cmd}`.chomp
raw_time =~ /\A(\d+\.\d*)\+(\d+\.\d*)\Z/ or raise "Cannot parse time: #{raw_time}"
$1.to_f + $2.to_f
end

desc "Run benchmarks"
task :benchmark do
ruby_time = time_cmd "ruby ./ackermann.ruby 6"
rlisp_time = time_cmd "./rlisp.rb ./ackermann.rl 6"
binrl_time = time_cmd "./ackermann_driver.rb 6"
binrl2_time = time_cmd "./ackermann_driver2.rb 6"

print "Ruby: #{ruby_time}s\n"
print "RLisp: #{rlisp_time}s (x#{sprintf "%0.1f", (rlisp_time.to_f/ruby_time)})\n"
print "Mock binary RLisp: #{binrl_time}s (x#{sprintf "%0.1f", (binrl_time.to_f/ruby_time)})\n"
print "Mock binary RLisp 2: #{binrl2_time}s (x#{sprintf "%0.1f", (binrl2_time.to_f/ruby_time)})\n"
print "\n"
end

desc "Compile ackermann.c"
file "ackermann.so" => "ackermann.c" do
sh "gcc -Wall -W -O6 -g -I/usr/lib/ruby/1.8/i486-linux/ -fPIC -shared ackermann.c -o ackermann.so"
end

desc "Compile ackermann2.c"
file "ackermann2.so" => "ackermann2.c" do
sh "gcc -Wall -W -O6 -g -I/usr/lib/ruby/1.8/i486-linux/ -fPIC -shared ackermann2.c -o ackermann2.so"
end

1 comment:

Anonymous said...

Just stumbled upon this blog via googling.

Although I am a bit confused, I appreciate it a lot that people put up the effort to try and explain how ruby internals work :-)

After writing ruby code for some years, it tends to become natural to dig deeper into the Ruby-C world (and as often is the case, documentation is a bit sparse)