When RLisp was first created it was purely interpreted, with an interpreter written in Ruby. It was about 50x slower than Ruby. Then it went through unsuccessful stage of mixed interpreted/compiled-to-Ruby mode, finally getting a real compiler with Ruby back-end. That's where it is now, and it is about 5x slower than Ruby.
The code generated by RLisp compiler doesn't need Ruby power, it just needs a few simple operations and access to Ruby runtime. That is - it could be RubyC code for matz's Ruby, or Java code for JRuby. If using C/Java compiler was too slow, the RLisp compiler could generate LLVM, x86 asm, or JVM code directly. It would be a little bit slower than going through C/Java, as we wouldn't run the optimizations, but the overhead is in Ruby runtime anyway, and I doubt the difference would be more than a few percent.
Before implementing the RubyC backend I wanted to measure how fast would it be and what would it look like. For that I used the Ackermann benchmark from the Great Programming Language Shootout, which Ruby by the way drastically loses. Ruby and RLisp implementations looks like that:
def ack(m, n)
if m == 0 then
n + 1
elsif n == 0 then
ack(m - 1, 1)
else
ack(m - 1, ack(m, n - 1))
end
end
NUM = Integer(ARGV.shift || 1)
print "Ack(3,", NUM, "): ", ack(3, NUM), "\n"
(defun ack (m n)
(cond (eq? m 0) (+ n 1)
(eq? n 0) (ack (- m 1) 1)
(ack (- m 1) (ack m (- n 1)))
)
)
(let num [[ARGV shift] to_i])
(print "Ack(3," num "): " (ack 3 num))
Compiled RLisp code, after some reformatting or readability:
module ::RLispC7054184
def self.ack(globals)
Proc.new do |*args|
m, n, = *args
t2 = m.send(:==, 0)
if t2
t4 = n.send(:+, 1)
t3 = t4
else
t5 = n.send(:==, 0)
if t5
t7 = globals[:ack]
t8 = m.send(:-, 1)
t9 = t7.call(t8, 1)
t6 = t9
else
t10 = globals[:ack]
t11 = m.send(:-, 1)
t12 = globals[:ack]
t13 = n.send(:-, 1)
t14 = t12.call(m, t13)
t15 = t10.call(t11, t14)
t6 = t15
end
t3 = t6
end
t3
end
end
end
t16 = ::RLispC7054184::ack(globals)
globals[:ack] = t16
t17 = ::ARGV.send(:shift)
t18 = t17.send(:to_i)
globals[:num] = t18
t19 = globals[:print]
t20 = globals[:num]
t21 = globals[:ack]
t22 = globals[:num]
t23 = t21.call(3, t22)
t24 = t19.call("Ack(3,", t20, "): ", t23)
It should be pretty straightforward
globals
is a Hash
containing RLisp global variables, the three-level module/def/Proc.new
procedure creation is used to- Make sure procedures with the same name won't accidentally overwrite each other.
- Make sure stack backtraces contain name
ack
(instead ofack_CHECKSUM
. - To build a closure if necessary. As
ack
is global the only captured variable isglobals
.
I compiled the
ack
function to RubyC by hand. It follows generated Ruby code statement-by-statement, except that I didn't bother compiling things outside the ack
function, which are executed only once and don't affect performance anyway. So the code consists of two files, Ruby file ackermann_driver.rb
:require 'ackermann'
globals = {}
closure_ack = RLispCompiledCode.build_closure_ack_CHECKSUM(globals)
globals[:ack] = lambda{|*args| RLispCompiledCode.ack_CHECKSUM(closure_ack, self, args, nil)}
num = (ARGV.shift || 1).to_i
res = globals[:ack].call(3, num)
print "Ack(3,", num, "): ", res, "\n"
and RubyC file
ackermann.c
:#include <ruby.h>
#include <intern.h>
int ID_EQEQ, ID_PLUS, ID_MINUS, ID_call;
VALUE SYM_ack, SYM_globals;
static VALUE ack_CHECKSUM(VALUE rlisp_compiled_module, VALUE closure, VALUE self, VALUE args, VALUE block_arg)
{
VALUE globals, m, n, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11, t12, t13, t14, t15;
globals = rb_hash_aref(closure, SYM_globals);
m = rb_ary_entry(args, 0);
n = rb_ary_entry(args, 1);
t2 = rb_funcall(m, ID_EQEQ, 1, INT2FIX(0));
if(t2) {
t4 = rb_funcall(n, ID_PLUS, 1, INT2FIX(1));
t3 = t4;
} else {
t5 = rb_funcall(n, ID_EQEQ, 1, INT2FIX(0));
if(t5) {
t7 = rb_hash_aref(globals, SYM_ack);
t8 = rb_funcall(m, ID_MINUS, 1, INT2FIX(1));
t9 = rb_funcall(t7, ID_call, 2, t8, INT2FIX(1));
t6 = t9;
} else {
t10 = rb_hash_aref(globals, SYM_ack);
t11 = rb_funcall(m, ID_MINUS, 1, INT2FIX(1));
t12 = rb_hash_aref(globals, SYM_ack);
t13 = rb_funcall(n, ID_MINUS, 1, INT2FIX(1));
t14 = rb_funcall(t12, ID_call, 2, m, t13);
t15 = rb_funcall(t10, ID_call, 2, t11, t14);
t6 = t15;
}
t3 = t6;
}
return t3;
}
static VALUE build_closure_ack_CHECKSUM(VALUE rlisp_compiled_module, VALUE globals)
{
VALUE closure = rb_hash_new();
rb_hash_aset(closure, SYM_globals, globals);
return closure;
}
void Init_ackermann()
{
VALUE mRLispCompiledCode;
ID_EQEQ = rb_intern("==");
ID_PLUS = rb_intern("+");
ID_MINUS = rb_intern("-");
ID_call = rb_intern("call");
SYM_ack = ID2SYM(rb_intern("ack"));
SYM_globals = ID2SYM(rb_intern("globals"));
mRLispCompiledCode = rb_define_module("RLispCompiledCode");
rb_define_module_function(mRLispCompiledCode, "ack_CHECKSUM", ack_CHECKSUM, 4);
rb_define_module_function(mRLispCompiledCode, "build_closure_ack_CHECKSUM", build_closure_ack_CHECKSUM, 1);
}
The code is so scary that I don't even know where to start with explanations. The huge problem is that the compiled function
ack_CHECKSUM
needs to take closure
argument somehow, but Ruby is not going to provide it. So we need an ugly wrapper globals[:ack] = lambda{...}
in the driver.As it turns out the performance is horrible, still over 2x slower than Ruby:
Ruby: 1.18s
RLisp: 5.93s (x5.0)
Mock binary RLisp: 2.48s (x2.1)
Many optimizations are possible, but I thought the guilty party is the ugly wrapper lambda. This or some other wrapper needs to be called every time we call RLisp function from Ruby, but what if we somehow optimized compile RLisp to compiled RLisp calls ? To make it really work we would probably subclass
Proc
to CompiledRLispProc
and in the compiled code check wheather a Proc
is of this particular kind, and if so, call it directly instead of using a wrapper. In this code we simply do that for calls do ack
and nothing else. Here's the new driver ackermann_driver2.rb
:require 'ackermann2'
globals = {}
closure_ack = RLispCompiledCode.build_closure_ack_CHECKSUM(globals)
globals[:ack] = lambda{|*args| RLispCompiledCode.ack_CHECKSUM(closure_ack, self, args, nil)}
num = (ARGV.shift || 1).to_i
res = globals[:ack].call(3, num)
print "Ack(3,", num, "): ", res, "\n"
and the new hand-compiled code:
#include <ruby.h>
#include <intern.h>
int ID_EQEQ, ID_PLUS, ID_MINUS, ID_call;
VALUE SYM_ack, SYM_globals;
static VALUE ack_CHECKSUM(VALUE rlisp_compiled_module, VALUE closure, VALUE self, VALUE args, VALUE block_arg)
{
VALUE globals, m, n, t2, t3, t4, t5, t6, t8, t9, t11, t13, t14, t15;
VALUE new_args;
globals = rb_hash_aref(closure, SYM_globals);
m = rb_ary_entry(args, 0);
n = rb_ary_entry(args, 1);
t2 = rb_funcall(m, ID_EQEQ, 1, INT2FIX(0));
if(t2) {
t4 = rb_funcall(n, ID_PLUS, 1, INT2FIX(1));
t3 = t4;
} else {
t5 = rb_funcall(n, ID_EQEQ, 1, INT2FIX(0));
if(t5) {
t8 = rb_funcall(m, ID_MINUS, 1, INT2FIX(1));
new_args = rb_ary_new();
rb_ary_push(new_args, t8);
rb_ary_push(new_args, INT2FIX(1));
t9 = ack_CHECKSUM(rlisp_compiled_module, closure, self, new_args, Qnil);
t6 = t9;
} else {
t11 = rb_funcall(m, ID_MINUS, 1, INT2FIX(1));
t13 = rb_funcall(n, ID_MINUS, 1, INT2FIX(1));
new_args = rb_ary_new();
rb_ary_push(new_args, m);
rb_ary_push(new_args, t13);
t14 = ack_CHECKSUM(rlisp_compiled_module, closure, self, new_args, Qnil);
new_args = rb_ary_new();
rb_ary_push(new_args, t11);
rb_ary_push(new_args, t14);
t15 = ack_CHECKSUM(rlisp_compiled_module, closure, self, new_args, Qnil);;
t6 = t15;
}
t3 = t6;
}
return t3;
}
static VALUE build_closure_ack_CHECKSUM(VALUE rlisp_compiled_module, VALUE globals)
{
VALUE closure = rb_hash_new();
rb_hash_aset(closure, SYM_globals, globals);
return closure;
}
void Init_ackermann2()
{
VALUE mRLispCompiledCode;
ID_EQEQ = rb_intern("==");
ID_PLUS = rb_intern("+");
ID_MINUS = rb_intern("-");
ID_call = rb_intern("call");
SYM_ack = ID2SYM(rb_intern("ack"));
SYM_globals = ID2SYM(rb_intern("globals"));
mRLispCompiledCode = rb_define_module("RLispCompiledCode");
rb_define_module_function(mRLispCompiledCode, "ack_CHECKSUM", ack_CHECKSUM, 4);
rb_define_module_function(mRLispCompiledCode, "build_closure_ack_CHECKSUM", build_closure_ack_CHECKSUM, 1);
}
It's over 15 times faster than current RLisp and over 3 times faster than Ruby:
Ruby: 1.18s
RLisp: 5.93s (x5.0)
Mock binary RLisp: 2.48s (x2.1)
Mock binary RLisp 2: 0.38s (x0.3)
The road from a mock RubyC code to real RubyC back-end is a long one, but basically it proves what I always knew - RLisp can be much faster than Ruby.
Oh and the
Rakefile
:task :default => ["ackermann.so", "ackermann2.so"]
def time_cmd(cmd)
cmd = "/usr/bin/time -f '%U+%S' #{cmd} 2>&1 >/dev/null"
raw_time = `#{cmd}`.chomp
raw_time =~ /\A(\d+\.\d*)\+(\d+\.\d*)\Z/ or raise "Cannot parse time: #{raw_time}"
$1.to_f + $2.to_f
end
desc "Run benchmarks"
task :benchmark do
ruby_time = time_cmd "ruby ./ackermann.ruby 6"
rlisp_time = time_cmd "./rlisp.rb ./ackermann.rl 6"
binrl_time = time_cmd "./ackermann_driver.rb 6"
binrl2_time = time_cmd "./ackermann_driver2.rb 6"
print "Ruby: #{ruby_time}s\n"
print "RLisp: #{rlisp_time}s (x#{sprintf "%0.1f", (rlisp_time.to_f/ruby_time)})\n"
print "Mock binary RLisp: #{binrl_time}s (x#{sprintf "%0.1f", (binrl_time.to_f/ruby_time)})\n"
print "Mock binary RLisp 2: #{binrl2_time}s (x#{sprintf "%0.1f", (binrl2_time.to_f/ruby_time)})\n"
print "\n"
end
desc "Compile ackermann.c"
file "ackermann.so" => "ackermann.c" do
sh "gcc -Wall -W -O6 -g -I/usr/lib/ruby/1.8/i486-linux/ -fPIC -shared ackermann.c -o ackermann.so"
end
desc "Compile ackermann2.c"
file "ackermann2.so" => "ackermann2.c" do
sh "gcc -Wall -W -O6 -g -I/usr/lib/ruby/1.8/i486-linux/ -fPIC -shared ackermann2.c -o ackermann2.so"
end
1 comment:
Just stumbled upon this blog via googling.
Although I am a bit confused, I appreciate it a lot that people put up the effort to try and explain how ruby internals work :-)
After writing ruby code for some years, it tends to become natural to dig deeper into the Ruby-C world (and as often is the case, documentation is a bit sparse)
Post a Comment