taw's blog: February 2007

Tuesday, February 27, 2007

Ruby and Judy

A few days ago I tried to make Ruby a bit faster. Optimizing Fixnum operations and tweaking rb_call0 seemed pretty straighforward, but specializing st_num_table must have seemed pretty weird. Every non-trivial program uses a lot of collections. Collections must basically support the following operations:

Getting element by key (Collection#[])
Setting element by key (Collection#[]=)
Iterating over all elements (Collection#each)
all kinds of collection-specific and helper methods.

In many ways, a single collection type is enough for a language. PHP has only array. Perl has arrays and hashes, but hashes can be indexed only by strings - that's a very serious limitation. Ruby also has two kinds of collections - Arrays and Hashes. Ruby Hashes are indexed by arbitrary objects, so they could in principle replace Arrays, at cost of some API changes, any possibly somewhat inferior performance (with good implementation, the difference wouldn't be high). Below the surface, Ruby uses many more collection kinds:

RArray - dense array indexed by small positive integers
st_table - a general purpose hash, which is specialized into three kinds:
* st_num_table - indexed by numbers
* st_str_table - indexed by strings
* st_obj_table - indexed by arbitrary Ruby objects

The specialization is done in a very inefficient way - using hand-coded vtable:

 struct st_hash_type {
   int (*compare)();
   int (*hash)();
};

struct st_table {
   struct st_hash_type *type;
   int num_bins;
   int num_entries;
   struct st_table_entry **bins;
};

st_obj_table is used in implementation of Hashes. It isn't very efficient, calling #eql? method to check object equality, but a few special cases let it handle Fixnums and Strings more efficiently. Because of st_obj_table's extra performance cost, and possibly also for API reasons, Ruby interpretter rarely uses it internally. It uses st_str_table quite a bit (mostly in uninteresting places), and st_num_table a lot. Why st_num_table instead of RArray ? Basically because st_num_table is sparse. When Java sees the following code:

double do_stuff(Complex x) {
   return x.real + x.imaginary;
}

It knows Complex has two fields:


   protected double imaginary;
   protected double real;

so the code get compiled to something like:

double do_stuff(Complex x) {
   return get_field(x, 1) + get_field(x, 0);
}

It's more difficult for duck-typed languages like Ruby. They don't even know internal representation of self, so the code:

def do_stuff
   return @real + @image
end

gets compiled to something like:

def do_stuff
   return get_field(self, real_id) + get_field(self, image_id)
end

where real_id and image_id equal :@real.to_i and :@image.to_i. In a particular session of irb I tried they were 26314 and 26322:

irb> a = Complex.new(1.0, 2.0)
=> Complex(1.0, 2.0)
irb> a.instance_variables
=> ["@image", "@real"]
irb> a.instance_variables.map{|s| s.to_sym.to_i}
=> [26314, 26322]
irb> a.instance_variable_get(:@real)
=> 1.0
irb> a.instance_variable_get(26322.to_sym)
=> 2.0

Ruby method calls work in a very similar way. So Ruby needs efficient collections indexed by small integers. Obviously, they need to be sparse - Array containing something at 26314 and 26322 would allocate memory for all elements from 0 to 26322. That's clearly not acceptable. Hash tables like st_num_table don't have this kind of overhead, so they work much better. However, st_num_table is used very often in Ruby, and hash tables have pretty high constant overhead. To demonstrate this I wrote an instance variable table printer (works with 1.8.5):

require 'complex'
require 'dl'
require 'dl/struct'

include DL::Importable

# Some bits copied from evil.rb

typealias "VALUE", nil, nil, nil, "unsigned long"

Basic = ["long flags", "VALUE klass"]
RBasic = struct Basic

RObject = struct(Basic + [
   "st_table *iv_tbl"
])

RClass = struct(Basic + [
 "st_table *iv_tbl",
 "st_table *m_tbl",
 "VALUE super"
])

St_table = struct([
   "st_hash_type *type",
   "int num_bins",
   "int num_entries",
   "st_table_entry **bins",
])

St_table_entry = struct([
   "uint hash",
   "long key",
   "VALUE record",
   #"st_table_entry *next",
   "long next",
])

def value_to_obj(value)
   return false if value == 0
   return true  if value == 2
   return nil   if value == 4
   # 6 is undef placeholder

   if (value & 3) == 0
       # Object
       ObjectSpace._id2ref((value - 2**32)/2)
   elsif (value & 3) == 2
       # Symbols
       (value >> 8).to_sym
   else
       # Integers
       return ObjectSpace._id2ref(value)
   end
end

def print_object_iv_tbl(obj)
   puts "IV Table of object #{obj.inspect}"
   addr = obj.object_id * 2
   obj_data = RObject.new(DL::PtrData.new(addr))
   puts "Flags = #{obj_data.flags}"
   puts "Class = #{value_to_obj(obj_data.klass).inspect}"
   iv_tbl = St_table.new(obj_data.iv_tbl)
   bins = iv_tbl.bins
   num_bins = iv_tbl.num_bins

   # x86-ish, assumes normal Object implemented as RObject, nothing fancy
   object_memory_use = 12 + 4 * num_bins

   puts "IV_TBL:"
   puts "* Bins: #{num_bins}"
   puts "* Entries: #{iv_tbl.num_entries}"

   bins.size = num_bins * 4
   bins.to_str.unpack("L*").each_with_index{|addr, i|
       if addr == 0
           puts "* Bin #{i}: empty"
       else
           puts "* Bin #{i}:"
           while addr != 0
               object_memory_use += 16
               table_entry = St_table_entry.new(DL::PtrData.new(addr))
               puts "  * Entry"
               puts "    * Hash: #{table_entry.hash}"
               puts "    * Key: #{table_entry.key.to_sym}"
               puts "    * Record: #{value_to_obj(table_entry.record).inspect}"
               addr = table_entry.next
           end
       end
   }
   puts "Memory use: #{object_memory_use} bytes"
end

a = Complex.new(1.0, 2.0)

print_object_iv_tbl(a)

which reports:

IV Table of object Complex(1.0, 2.0)
Flags = 2
Class = Complex
IV_TBL:
* Bins: 11
* Entries: 2
* Bin 0:
 * Entry
   * Hash: 10802
   * Key: @real
   * Record: 1.0
* Bin 1: empty
* Bin 2: empty
* Bin 3: empty
* Bin 4: empty
* Bin 5: empty
* Bin 6: empty
* Bin 7: empty
* Bin 8:
 * Entry
   * Hash: 10810
   * Key: @image
   * Record: 2.0
* Bin 9: empty
* Bin 10: empty
Memory use: 88 bytes

That's 88 bytes for the object + 2*16 more for storing contents of its fields (Floats take only 16 bytes as they don't have own iv_tbls). 120 bytes instead of 16 needed for storing just two doubles.

Benchmarking memory usage

One problem with benchmarking under Linux is lack of easy way of measuring memory use. Standard Unix /usr/bin/time tries to measure it, but as Linux kernel doesn't set the relevant fields of struct rusage after call to wait4, it always reports 0 kB used. Since 2.6.16 Linux at least provides /proc/<process_id>/smaps. The only thing needed is a convenient wrapper like time for measuring execution time. Live monitoring would be complicated, but as most programs do not free any memory before exiting (normally free()d or GC-ed memory returns to the memory allocator, but not to the operating system), it's a good enough to measure at exit. The simplest way of making program do something before exiting is LD_PRELOADing a shared library, and registering a destructor for it. Parsing text file /proc/<process_id>/smaps in C would be sick, so the library simply spawns a Ruby program which does all the work. Process.ppid is pid of parent process, so /proc/#{Process.ppid}/smaps is file containing parent process's memory use information. system("exec ./memory_benchmarker.rb") is necessary as libc system("./memory_benchmarker.rb") would spawn /bin/sh, which would in turn spawn ./memory_benchmarker.rb, and Process.ppid would return shell's pid, not the actual process's. So the Makefile:

memory_benchmarker.so: memory_benchmarker.c
       gcc -Wall -W -fPIC -shared memory_benchmarker.c -o memory_benchmarker.so

memory_benchmarker.c:

#include <stdlib.h>
#include <unistd.h>

void memory_benchmarker_fini() __attribute__ ((destructor));
void memory_benchmarker_fini()
{
   unsetenv("LD_PRELOAD");
   system("exec ./memory_benchmarker.rb");
}

/* If you don't want to trace memory use of child process, unset LD_PRELOAD here.
  Unfortunately doing so also affects exec()ed process.
void memory_benchmarker_init() __attribute__ ((constructor));
void memory_benchmarker_init()
{
   unsetenv("LD_PRELOAD");
}
*/

and memory_benchmarker.rb:

#!/usr/bin/ruby

smaps = File.read("/proc/#{Process.ppid}/smaps")
sz  = smaps.scan(/^Size:\s+(\d+) kB/).map{|x| x[0].to_i}.inject{|a,b| a+b}
rss = smaps.scan(/^Rss:\s+(\d+) kB/).map{|x| x[0].to_i}.inject{|a,b| a+b}
puts "Memory use: #{sz} kB"
puts "Rss: #{rss} kB"

Judy

Judy is an efficient sparse array library. It implements a few data structures, of which the most relevant is JudyL - a sparse array with word (32 or 64 bit) indexes and values. To use JudyL as replacement for st_num_array and new implementation of iv_tbl, m_tbl etc., its indexes will be interpretted as integers, and its keys as

VALUEs (arbitrary Ruby objects).

The replacement is pretty straightforward. There are two minor technical issues:

JudyL assumes there's only one pointer to it, so it can reallocate itself without complicating the API. The reallocating macros simply modify the passed pointer. To quickly retrofit Ruby with Judy, I used struct st_num_table containing just this pointer. It should be simple to get rid of this extra level of indirection later.
Unlike st_num_table, JudyL doesn't store number of elements. A lot of code assumes number of elements is easily accessible, and calls slow st_num_count_entries() even if it's just to check whether the table is empty or not. I replaced such calls by st_num_is_empty() and other faster calls.

Here's a patch against Ruby 1.8.5-p12. It contains all optimizations: rb_call0, Fixnum#equal, and Judy (and separate st_num_table required by it).

Results

Everything was run with CFLAS -O3 -march=athlon-xp. As someone pointed out, Ruby GC becomes unstable with -fomit-frame-pointer and -mtune is redundant. Numbers represent overhead compared to the most efficient implementation, in terms of memory:

	Plain 1.8.5-p12	`rb_call0`	`rb_call0` + `st_num_table`	`rb_call0` + Judy
app_answer	+1%	+1%	+1%	+0%
app_fib	+2%	+1%	+2%	+0%
app_mandelbrot	+13%	+14%	+14%	+0%
app_raise	+6%	+7%	+6%	+0%
app_strconcat	+0%	+0%	+1%	+0%
app_tak	+2%	+1%	+2%	+0%
app_tarai	+1%	+1%	+1%	+0%

and time:

	Plain 1.8.5-p12	`rb_call0`	`rb_call0` + `st_num_table`	`rb_call0` + Judy
app_answer	+1%	+1%	+0%	+0%
app_fib	+2%	+0%	+1%	+1%
app_mandelbrot	+7%	+5%	+4%	+0%
app_raise	+6%	+5%	+0%	+5%
app_strconcat	+4%	+3%	+0%	+2%
app_tak	+1%	+0%	+1%	+1%
app_tarai	+0%	+0%	+0%	+0%

The only benchmark showing significant improvement is app_mandelbrot. It's also the only one that uses anything object-oriented (require 'complex'), others jush crunch numbers, concatenate strings and raise exceptions in loops. It's no wonder they show little improvement. I thought that real Ruby applications would make many more iv_tbl/m_tbl operations than the benchmarks. The only problem was finding better benchmarks. For no particular reason I took Ruby unit tests (ones that get run if you do make test-all) as a "representative" app. The final results:

	Plain 1.8.5-p12	`rb_call0`	`rb_call0` + `st_num_table`	`rb_call0` + Judy
CPU	+5%	+2%	+2%	+0%
Memory	+4%	+4%	+4%	+0%

Conclusions

I'm not really sure how "representative" this benchmark is. All performance improvements so far resulted in Ruby being about 5% faster and taking 4% less memory. Performance of optimized subsystems got significantly improved, but the true bottleneck seems to be somewhere else - in slow rb_eval and related functions. It's up to Ruby maintainers whether they want to use my patch or not. On the one hand - performance improvement is not negligible and if they ever speed up rb_eval, st_num_table will need to be dealt with too. On the other hand, introducing dependency on just another library (Judy) is something more developers learn to appoach with caution. The code passes all unit tests, but it still requires more testing (including performance testing on different architectures) and significant clean-up.

Wednesday, February 21, 2007

Atomic Coding with Subversion

Better programming languages, more extensive libraries, and faster hardware are the most visible drivers of increase in programmer productivity. People are obviously more productive in Ruby than in Fortran, with CPAN than with C standard library, and on a Dual-Core Opteron with 1GB RAM than on a punched card big iron. There are also many factors improving productivity that are less in-your-face. For example in the 2000s any two programs on any two machines can communicate using TCP/IP, HTTP and XML. In the 1980s one would need to code application-specific data encodings, network protocols, and add explicit support for every possible network. People got so used to the Internet that they don't think much about it any more. Back in the old days programmers didn't know about refactoring, unit testing, test-driven development, YAGNI, or even Object-Oriented Programming. Another productivity factor are revision control software and atomic coding. There are a few people who still don't use revision control, some even used to be pretty well known for that. Back when everybody used CVS, I viewed revision control as necessary evil - CVS was getting in the way more often than helping. It all changed with Subversion. It's not perfect for every possible development model, but for personal repositories and small teams it's simply awesome. This post will be about two things - subversion basics for those who aren't using it much yet, and Atomic Coding. If you know SVN well, just jump to the Atomic Coding section, as it's a very important concept that will help your productivity stay ahead of the Yannis's Law.

Subversion basics

Subversion supports multiple protocols, but the sanest one is HTTPS. With HTTPS anonymous access requires no authorization, and you don't even need any SVN software - wget -r --no-parent or any browser are good enough. There are guides for quickly setting up SVN repository with HTTPS for Debian, Ubunut, Gentoo, FreeBSD, and maybe even Windows. The first important SVN command you need is svn help and svn help some_command. It's pretty useful when you forget details of some command. When you use subversion you first need to checkout with svn co URL. Then you can edit checked-out files, add new ones with svn add FILES (works with adding directories too), move them around (svn mv FROM TO or svn mv FILES DESTDIR), and occasionally detele them (svn rm FILE). After you're done you can commit. svn ci -m "Message" will commit all changes in current directory and its subdirectories. Very often you want to commit less than that, for example svn ci -m "Some quick change in README" README while leaving rest of the code uncommitted. You can take a look at changes since the last version commit by svn diff, or with some older revision by svn diff -r REVISION. To look at changes from other perspective (list of added/deleted/modified/not-in-repository files) use svn status. A few more things. If you keep PDF files in repository, you don't want to see binaries in output of svn diff. SVN tries to determine what's binary and what's text (it matters only for diff viewing, it won't "convert line endings" in your binaries like some broken revision control systems), but PDFs often start with some text commands before going binary and confuse SVN. To tell SVN they're really binaries, run svn propset svn:mime-type application/octet-stream *.pdf. I never had such problems with other binary formats like pictures. To get your copy of repository in sync with master repository run svn up. To get a log of changes affecting current directory and its subdirectories run svn log. Every now and then you'll need to revert one file (svn revert FILE) or everything in the directory svn revert. If you added something but didn't commit, revert the file instead of svn rming it. If you want to script SVN, most commands accept --xml modifier, and will output machine-readable markup instead of the default human-readable plain text. Another useful switch is -v to increase verbosity. I think it's much easier to run SVN from command line and parse XML than use language-specific SVN libraries. In SVN branching and tagging is done by copying (svn cp), and merging by svn merge, but most of the time I work on just the main branch and merge by plain Unix diff and patch.

Atomic Coding

But I wasn't going to write just another SVN tutorial. I want to say something about Atomic Coding. The idea is basically committing (to the branch you're working on - usually the main branch) as soon as you have a complete change, no matter how small. You've written a single unit test ? Commit. Made one unit test pass that didn't ? Commit. Changed README file ? Commit. Fixed a typo in some comments ? Commit. Breaking your work into small pieces is one of the most fundamental ideas in productivity ever. Programming han Unit Testing, project management has Getting Things Done's Next Actions and so on - they're all about breaking big and complex things into small and simple ones. It's pretty straightforward to apply the same principle to repository management, but it's much easier to talk about something when it has a name, so let's refer to it as "Atomic Coding". So why do Atomic Coding ? First because committing is so easy. Just say svn ci -m "Unit test for Kitten#meow" and you're done. Many people insist on meaningful commit messages, but if you commit very often you're going to end up with messages like "Some comments added", "Minor code cleanup". Don't feel bad about them - repository management is there to help you, not to oppress you. The most low-level benefit is you can use svn commands to do something useful. When you do Atomic Coding, svn revert will revert to the last working state without losing any useful modifications, svn diff will tell you what are you doing. On the medium-level, by keeping your code up to date, you will be able to get away from coding and get back to it much more easily. Interruptions happen many times a day - phone calls, instant messaging, people passing by, lunch time, and so on. Atomic Coding will save you a few minutes on every interruption, and it's going to add up to a huge productivity boost. On the high-level, Atomic Coding works really great with Unit Testing, Test-Driver Development, Getting Things Done and so on. They reinforce each other. Programming is really much more effective if you break it into small pieces instead of trying to do everything at once, and all your habits should support this instead of trying to get you away from it. If you want to see Atomic Coding in action, Bitscribe has a cool screencast about it. I used vague words like "small", "atomic", but let's get more specific. Ask yourself:

What's the typical (median) time between your commits ?

If the answer is anything more than 1-2 hours, you're not doing Atomic Coding. It's often difficult to get the right answer, so I wrote a short script that extracted the answer from SVN repository (the script isn't that short, but it's mostly because of pretty-printing).


require 'enumerator'
require 'magic_xml'
require 'time'

class Numeric
    def time_pp
        s = self.to_i
        return "#{self}s" if s < 60

        m = (s / 60).to_i
        s -= m*60
        return "#{m}m#{s}s" if m < 60

        h = (m / 60).to_i
        m -= h*60
        return "#{h}h #{m}m#{s}s" if h < 24

        d = (h / 24).to_i
        h -= d*24
        return "#{d}d #{h}h #{m}m#{s}s"
    end
end

log = XML.parse(STDIN)

summaries_by_author = Hash.new{|ht,k| ht[k] = {:dates => [], :sizes => []}}

log.descendants(:logentry) {|e|
    summaries_by_author[e[:@author]][:dates] << Time.parse(e[:@date])
    summaries_by_author[e[:@author]][:sizes] << e.descendants(:path).size
}

summaries_by_author.to_a.sort.each{|author, summary|
    dates = summary[:dates].enum_for(:each_cons, 2).map{|a,b| a-b}.sort
    sizes = summary[:sizes].sort

    puts "Activity of #{author}:"
    puts "Time between commits distribution:"
    puts "* 10% - #{dates[dates.size/10].time_pp}"
    puts "* 25% - #{dates[dates.size/4].time_pp}"
    puts "* median - #{dates[dates.size/2].time_pp}"
    puts "* 75% - #{dates[dates.size*3/4].time_pp}"
    puts "* 90% - #{dates[dates.size*9/10].time_pp}"
    puts "Median number of affected files: #{sizes[sizes.size/2]}"

    sizes_summary = Hash.new(0)
    sizes.each{|sz| sizes_summary[sz] += 1}
    sizes_summary.to_a.sort.each{|k,v|
        puts "* #{k} file#{(k == 1) ? '' : 's'} - #{v} time#{(v == 1) ? '' : 's'}"
    }
}

To run it do svn log --xml -v | svn_log_summary.rb (it requires magic/xml). The results for me are: Activity of taw: Time between commits distribution:

10% - 2m40s
25% - 11m17s
median - 38m13s
75% - 4h 45m34s
90% - 1d 1h 19m45s

Median number of affected files: 2

1 file - 520 times
2 files - 287 times
3 files - 156 times
4 files - 84 times
5 files - 47 times
6 files - 33 times
7 files - 14 times
8 files - 22 times
9 files - 8 times
10 files - 7 times

...

102 files - 1 time
107 files - 1 time
127 files - 1 time
198 files - 1 time
1274 files - 1 time
2743 files - 1 time

So half of the commits were less than 38m13s before the previous commit, and a quarter were less than 11m17s before the previous one. A few hour breaks most likely represents getting away from coding, as it's very rare for me to code for hours without committing. Most commits were on just a few files, and the big ones are most likely import, or automated changes (like "Tabs replaced by spaces" 46-file commit), not results of long coding sessions. It took me some time to get used to Atomic Coding, but just like with Unit Testing and Getting Things Done - I'm never going back.

Making Ruby faster

There wasn't that much code in my blog recently. So today I decided to change that by coding something cool. And what could be simpler than taking a popular language and making it much much faster ? So first I downloaded Ruby 1.8.5-p12, and some benchmarks. The benchmarks are the same as used in the blog post comparing different implementations of Ruby. I don't like the benchmarks much - they test microfunctionality and they cannot be realistically extrapolated to real programs, but I'm too lazy to make better ones just yet. I wanted to speed up Ruby. Some rules:

Under no circumstance can it break Ruby functionality (hacks like Ruby/DL may need updating).
Speed-up must be observable in benchmarks.
Speed-up should be significant (more than just 1-2%).
It must be simple to code - a few hours for a working patch. Programmer productivity is supposed to double every six years after all.
It should not make code much more complex than it currently is.

I'm not trying to compete with YARV - the point is a major speed-up right now. First I modified the benchmark driver to run everything 5 times, and report average of 3 middle results, like they do in many sports competitions. I ran everything on an old Duron 1400MHz box with 512MB memory. I limited set of benchmarks to just 7 "app" benchmarks (all except for pentomino, which was too slow and factorial, which raised stack overflow) to get faster feedback.

CFLAGS

The first idea was to simply play with CFLAGS. I have no idea why people aren't doing that. Isn't the very idea of coding in C about performance ? I didn't do anything fancy - just the standard -O3 -march=athlon -mtune=athlon -fomit-frame-pointer. Results show how much extra time it needs compared to the fastest implementation.

Benchmark	Ruby 1.8.4 from Ubuntu	Ruby 1.8.5-p12	Ruby 1.8.5-p12 with nice CFLAGS
app_answer	+86%	+0%	+10%
app_fib	+72%	+4%	+0%
app_mandelbrot	+44%	+0%	+1%
app_raise	+42%	+5%	+0%
app_strconcat	+49%	+7%	+0%
app_tak	+77%	+6%	+0%
app_tarai	+69%	+10%	+0%

Two surprises here. First, messing with CFLAGS brought just a few percent speed-up, and in one case even a 10% slow-down. Second, either 1.8.5 is massively faster than 1.8.4 or Ubuntu uses some really horrible compilation options. Assuming the former, that's a great thing. Anyway, that's kinda disappointing, as I was expecting a 5%-10% improvement pretty much for free. So, time to look for another easy target. Running the benchmarks with a profiler (CFLAGS=-O3 -march=athlon -mtune=athlon -pg), and compiling everything to assembly seemed like a good idea. A candidate I remembered from earlier was st.c - hash tables called functions like these many times during each operation, and used multiple pointer dereferences for it, instead of inlining them (and simplifying to pretty much no code):

static int
numcmp(x, y)
    long x, y;
{
    return x != y;
}

static int
numhash(n)
    long n;
{
    return n;
}

It seemed straighforward to simply specialize the hash table to numeric version, but first I wanted to check if it's really likely to be worthwhile. And indeed, st_lookup and st_foreach were important, at least in some benchmarks. What I didn't expect were fix_minus, fix_plus, and other Fixnum functions. These also seemed easy to improve (more about it later). Unfortunately the five most performance-critical functions were rb_eval, rb_call0, rb_call, rb_newobj and call_cfunc, that I had no clue about. So two candidates giving at best a minor improvement, and 5 candidates I have no clue about, sounds like fun.

Optimizing `rb_call0`

rb_call0 seemed like the easiest way of making a difference. It contained a big switch which was compiled into a binary search tree of comparisons and jumps. That's usually a good idea for sparse switches, but in rb_call0 two cases are far more common than the other six. They are - NODE_CFUN (most common) and NODE_SCOPE (significantly less so). Everything else is far behind the two. So the switch was replaced by "if first else if second else switch" construct. This is only likely to be worthwhile because rb_call0 is called so often. By the way a good JIT compiler should be able to optimize such cases on its own, but compilers for static languages like C need manual help. And the results:

Benchmark	Ruby 1.8.4 from Ubuntu	Ruby 1.8.5-p12	Ruby 1.8.5-p12 with nice CFLAGS	Nice CFLAGS + `rb_call0`
app_answer	+86%	+0%	+10%	+7%
app_fib	+77%	+6%	+3%	+0%
app_mandelbrot	+46%	+2%	+3%	+0%
app_raise	+43%	+6%	+0%	+0%
app_strconcat	+49%	+7%	+0%	+2%
app_tak	+80%	+8%	+2%	+0%
app_tarai	+72%	+12%	+2%	+0%

So Ruby got some 2%-3% faster already.

Specializing hash tables

There are 3 kinds of hash tables used in Ruby - number, string and object hash tables. It seems pretty excessive to triple amount of hash-related code in hope of optimizing it a bit, and it wouldn't really be necessary if Ruby was coded in C++, as templates would probably do the trick. I'm going to do something a bit less radical, and only specialize number tables. The optimization required quite a bit of code duplication and patching, and the speed-up is relatively modest.

Benchmark	Ruby 1.8.4 from Ubuntu	Ruby 1.8.5-p12	Ruby 1.8.5-p12 with nice CFLAGS	Nice CFLAGS + `rb_call0`	Nice CFLAGS + `rb_call0` + `st_num_table`
app_answer	+86%	+0%	+10%	+7%	+7%
app_fib	+77%	+6%	+3%	+0%	+0%
app_mandelbrot	+49%	+3%	+4%	+2%	+0%
app_raise	+43%	+6%	+0%	+0%	+2%
app_strconcat	+54%	+10%	+3%	+5%	+0%
app_tak	+81%	+8%	+2%	+1%	+0%
app_tarai	+74%	+14%	+3%	+1%	+0%

Normally I wouldn't feel too good about patches like that, but having separate data structure for numeric hashes makes it possible to use more efficient data structure like Judy.

Making `Fixnum` operations faster

Fixnum operations are executed a huge number of times even in short programs. Unlike most programming languages (like let's say - Java), Fixnums in Ruby are just normal objects, and the virtual machine has no special support for making them fast. Fixing the virtual machine would take too much work, so let's just make sure the methods perform well. Unfortunately not all of them do. Take a look at this simple method:

static VALUE
fix_equal(x, y)
    VALUE x, y;
{
    if (FIXNUM_P(y)) {
        return (FIX2LONG(x) == FIX2LONG(y))?Qtrue:Qfalse;
    }
    else {
        return num_equal(x, y);
    }
}

static VALUE
num_equal(x, y)
    VALUE x, y;
{
    if (x == y) return Qtrue;
    return rb_funcall(y, id_eq, 1, x);
}

Unfortunately what it really means is:

fix_equal:
        subl    $28, %esp
        movl    36(%esp), %edx
        movl    32(%esp), %ecx
        testb   $1, %dl        ; Check if y is a fixnum
        jne     .L610
        cmpl    %ecx, %edx     ; Check if x == y
        je      .L605          ; If yes, return Qtrue
        movl    id_eq, %eax    ; If not, call rb_funcall(y, id_eq, 1, x)
        movl    %ecx, 12(%esp)
        movl    $1, 8(%esp)
        movl    %edx, (%esp)
        movl    %eax, 4(%esp)
        call    rb_funcall
.L607:
        addl    $28, %esp
        ret
.L610:
        sarl    %ecx           ; Shift x by 1 bit
        sarl    %edx           ; Shift y by 1 bit
        xorl    %eax, %eax
        cmpl    %edx, %ecx     ; Check if shifted x == shifted y
        jne     .L607          ; If not, return Qfalse
        .p2align 4,,7
.L605:
        movl    $2, %eax       ; Return Qtrue
        addl    $28, %esp
        ret

The that two shifts are completely unnecessary. x and y are equal after shift if and only if they were equal before shift. And if y isn't a Fixnum, it's impossible for it to equal x.

static VALUE
fix_equal(x, y)
    VALUE x, y;
{
    if (FIXNUM_P(y))
        return (x == y)?Qtrue:Qfalse;
    else
        return rb_funcall(y, id_eq, 1, x);
}

fix_equal:
        subl    $28, %esp
        movl    36(%esp), %edx
        movl    32(%esp), %eax
        testb   $1, %dl
        je      .L2
        cmpl    %eax, %edx
        sete    %al
        addl    $28, %esp
        movzbl  %al, %eax
        addl    %eax, %eax
        ret
        .p2align 4,,7
.L2:
        movl    %eax, 12(%esp)
        movl    id_eq, %eax
        movl    $1, 8(%esp)
        movl    %edx, (%esp)
        movl    %eax, 4(%esp)
        call    rb_funcall
        addl    $28, %esp
        ret

And suddenly we have one conditional jump instead of three. Other Fixnum operations could be made faster too. They're common enough to be worth it.

Summary

Here's the patch. It passes make test-all and does well in benchmarks. rb_call0 and Fixnum#equal improvements seem pretty useful, and are too simple to break anything. I'm not sure about st_num_table - Ruby C code uses random type casts all over the place, and the code is pretty ugly. So it's quite likely I missed some st_table vs st_num_table thing. What's left to do:

Running more benchmarks, on more machines.
Verifying st_num_table

If it works well, the patch could be included in the next stable version of Ruby. And to improve performance even more without much work, one could:

Convert rb_eval from node hopping to real "bytecode".
Convert st_num_table to something faster and more memory-efficient (like Judy).
Two extensions seem to be using old numeric st_table. I converted syck, but I cannot even compile win32ole.c to verify whether it works or not.
Replace call_cfunc with compiled ABI trampolines (move to stack, move to stack, move to stack, unconditional jump). This sounds terrifying, but it should be pretty fast and simple.

Monday, February 19, 2007

Yannis's Law: Programmer Productivity Doubles Every 6 Years

Everybody and their dog have heard about Moore's Law. Yet few know about a far more important Yannis's Law, which states:

Programmer productivity doubles every 6 years

There's no question that today it's possible to get a lot more functionality in less time and with smaller teams than in the past. The really shocking part is magnitude of the difference. In 1972 paper "On the Criteria to Be Used in Decomposing Systems into Modules" David Parnas wrote:

The KWIC index system accepts an ordered set of lines, each line is an ordered set of words, and each word is an ordered set of characters. Any line may be "circularly shifted" by repeatedly removing the first word and appending it at the end of the line. The KWIC index system outputs a listing of all circular shifts of all lines in alphabetical order. This is a small system. Except under extreme circumstances (huge data base, no supporting software), such a system could be produced by a good programmer within a week or two.

Fast forward to 2003, Yannis Smaragdakis wrote:

I would not consider a programmer to be good if they cannot produce the KWIC system within an hour or two.

"As hour or two" seemed seriously excessive, so in 2007 I timed myself coding such a system. It took me 4 minutes and 11 seconds to create and test the following code:

res_lines = []
STDIN.each{|line|
    line.chomp!
    words = line.split(" ")
    n = words.size
    (0...n).each{|i|
        res_lines << (words[(i+1)..n] + words[0..i]).join(" ")
    }
}
res_lines.uniq!
res_lines.sort!
puts res_lines

That's 500 to 1200 times faster than a "good 1972 programmer". Maybe my Perl/Ruby background makes me a bit unrepresentative here, but would anybody call a person who cannot implement it in 15 minutes (like during a job interview) a "good programmer" ? The progress is not over - you can get a huge productivity boost by moving from Java to Ruby, from PHP or J2EE to Ruby on Rails, or by seriously doing test-driven development. That can get you from 1990s to 2000s. And there's plenty of people who are still in the productivity 1980s - not using SVN, not knowing regular expressions and Unix, manually recompiling their programs, some are even doing things are insane as coding application-specific network protocols and data storage formats! By the way, there's a cool movie by some NASA guy, which times development of a small website in multiple web frameworks - the difference is really significant. Some people actively deny this progress. The most famous is Fred Brooks's 1986 essay "No Silver Bullet - essence and accidents of software engineering", which claimed that there's simply no way 10x productivity increase could possibly ever happen. Guess what - since 1986 programmer productivity already increased by about 11x (according to Yannis's Law figure) ! So programmer productivity doubles every 6 years. Every 6 years it takes only half the time to do the same thing. How many other professions can claim something like that ? Is there any human activity for which it is happening consistently for decades ? (Actually I wouldn't be surprised if the answer was true, even if with somewhat slower pace of improvement.) This means a few things. First, because fewer and fewer people will be necessary to do today's work, programming must expand to new areas - and maybe you should also look there. Second, drop the things keeping you in the past and move on. Do you really want to waste your time coding C if you can code many times more effectively in Ruby ? Whether it's an Open Source project, a job, or your personal thing, it just makes no damn sense to accept this kind of productivity loss. And third - if you are some sort of person making technology decisions for your company or organization - "because we have always done it that way" is terrible reason for choosing any programming-related technology. What they're making today is without doubt better, and programmers are really good at adapting, even if they don't want to admit it.

Saturday, February 10, 2007

The right to criticize programming languages

There are words that I've heard many years ago, which really affected my thinking about discussions and criticism. I cannot recall them exactly, but they went something along the lines of:

You have no right to criticize a programming language, unless you are able to point three things in which it excels compared to your favourite language.

I feel they're pretty deep. For one, if something really sucks so much that you can't name anything good in it, why are you even wasting time talking about it ? Or maybe it has a few good points, and learning them is a better use of time than ignorant ranting ?

It's no secret that I hate every single programming language ever created, and probably most of the ones to come. Right now the one which I hate least is Ruby, and that by no means should imply that given chance I wouldn't change half of it.

Anyway, with this post I want to prove to myself that I can criticize every single programming language out there.

Python

The first language to get the praise is Python. It is 80% Ruby, and even though I miss the other 20%, I'd take Python over pretty much everything.

The first thing where Python is far better than Ruby is whitespace sensitivity. Getting rid of ends and }s makes the code a pleasure to look at, and saves lines that would get wasted otherwise. Higher code density, and more readability. Python syntax is overall cleaner. Now I don't really fancy the way whitespace sensitivity is implemented, as they had to get rid of blocks for it, but the idea is brilliant, and I'd love to see other languages use it.

Python's second win is PyGame. PyGame was the reason I even learned Python, and it's a pleasure to use it. I even tried to get Google fund Ruby version of it in the Summer of Code ?

Python's third win is simplicity. Most of the complex things about Ruby - syntactic, semantic, or whatever - are exceedingly important. Still, Python by being much simpler would be far better language for teaching peolpe to code. Sure, I know all the arguments for using Scheme or Java for it, and by legal overlook starting with C or C++ isn't in the Criminal Code yet, but if you love kids, you surely agree that Python is simply The Right First Language.

Perl

Now it's Perl's time for the praises. Perl is awesome. Without doubt it is historically at least as important as Smalltalk, Fortran, and the original Lisp. The very idea that language should try to be friendly to the programmer, and flexible enough to accommodate to various ways of thinking, was far ahead of its time. Even in 2007, twenty years after Perl's first release, most languages would rather enforce one true worldview. Some people are willing to accept languages' worldview and become Java code monkeys, or incurable C old-timers. Others value their freedom of thought and choose Perl (or nowadays, Ruby).

History aside, Perl still excels in many ways. For one it is pretty much the only language which got Unicode right. Ruby pretends the problem doesn't exist, Python tried it and screwed it, everything else is useless for text processing. Perl regexp engine is still far better than what most other languages have. I do use the features marked experimental, a lot. If you do i18n, defining your own Unicode character classes in regexps is useful more often than you think.

The second thing is LWP. Ruby's net/http isn't anywhere near it, and will never be. Of course, nobody uses net/http for Ruby. They connect to an actual browser with watir or firewatir, or run system "wget", ....

You probably guesed what the third thing is going to be. That's of course CPAN. Ruby with RubyGems is getting there, but it's going to be a very long way. Oh, and I think RubyGems, not Ruby on Rails, is the main reason of Ruby's surge in popularity in the last few years.

Java

Java - the most trendy language to flame these days. It's also the most popular language these days, and these two things seem to correlate. I'm not even going to repeat the list of sucking points now, just go to any other programming blog.

So the cool things about Java. How can I not say portability ? There was never a language as portable as Java. I can take any Java program, put it on any machine, from a mobile phone up, and it will Just Work. Ever tried that with C ? Good luck getting configure not to segfault on you. Even with semi-portable languages like Ruby and Python, you'd still have to tweak it a lot if it uses some non-standard library (as most programs do), is more than six months old, or (heavens forbid), you're trying to run it under a non-Unix box.

The second reason why I'm grateful to people who created Java, is saving the world from C++. C++ is crime against humanity, and its creator is the programming equivalent of Saddam Hussein. Java managed to get rid of C++ without destroying a small country, how not to be thankful for that ? Java managed to get the mainstream a few steps in direction of Smalltalk, Lisp, and sanity. If you look at all currently available languages, it is pretty clear none would be even close. It had to be free from vendor lock-in, actually portable (in practice, not just theory), perform reasonably well, and have familiar syntax and semantics. It's a short list, but in 2007 Java is still the only language on it.

And JVM. It's mind-boggling what people do with JVM bytecode. Sure, every other language has a virtual machine, but JVM is actually portable, supports plenty of languages, is fast, and there are tons of utilities doing cool things with JVM bytecode. Just take a look at AspectJ.

JavaScript

JavaScript is another language I kinda like, even if getting it to work consistently across browsers isn't fun, and the development tools aren't exactly what people learned to expect nowadays.

The greatest thing about JavaScript - it enabled the Web 2.0 revolution. JavaScript is pretty much the only practical way of creating cool portable GUI apps. People tried using Java and Flash and other languages for it, but for some reason it just doesn't work anywhere as well.

That's related thing, but XUL is just wonderful. Firefox plugins are the second way of creating cool portable GUI apps, and they're also JavaScript-based. Do you even realize how much work would it take to create something even as simple as a GMail notifier in anything other than JavaScript ?

Everyone loves JavaScript because it's so important for the Web, but it's also a pretty cool language on its own. It uses very elegant object-oriented system based on prototypes. Ruby singleton classes and some evilness can get halfway there, but no further

PHP

It feels like a faux pas to say there are things I like in PHP. I certainly don't like its SQL injection friendliness, and main namespace with 5121 (and counting) inconsistently named functions.

Still, there are a few cool things about it, like arrays ! Perl introduced arrays and hashes as separate data types, and every single language followed. Perl had little choice, as only context determines whether "1" and "01" are the same thing or not. In languages that don't follow Perl's braindead string-number unification, there's really little reason for keeping arrays and hashes conceptually separate. They're both just keyed containers, so why do we need two ?

PHP applications are very easy to deploy compared to Ruby on Rails, Java, Perl or any other web apps. It's partially because of PHP's ubiquity, and good integration with standard webservers and databases. Of course I'm assuming admin didn't tweak with global language rules (like register_globals).

PHP is very scalable. It is actually used in massively popular web applications like Wikipedia. Ruby on Rails isn't there yet.

Haskell

I don't like Haskell. It's a bondage and discipline language - either you think the way it thinks, or it will make your live a living hell.

The best part of Haskell is syntax. It got whitespace sensitivity right, so unlike Python it still has full power blocks. Laziness and pattern matching simplify programs quite a bit, so they often look like specifications, not like code.

Haskell is also actually extremely expressive. Just take a look at "Functional Specification of JPEG Decompression and an Implementation for Free" paper. Ruby does quite well too, especially with a good library like ActiveRecord, but I don't think it's that good.

Software Transactional Memory. I'm really more into concurrency based on message passing, but STM is cool.

Scheme

Macros. Do I really have to say more ? Macros are great. This point also includes simple syntax tree, and no - CL does not have a simple syntax tree if every code walker needs to understand (loop ...). Ever tried working with Ruby's ParseTree ? It does work, but it's really ugly.

It is just unbelievably semantically elegant. I don't know any other language with this level of power and such elegance and simplicity. It makes even Python looks complex and ugly (with bolted-on object-orientation it's hard not to look ugly, but anyway). Other Lisps are usually really horrible here.

Symbolic computations in Scheme are very easy. Just a single main data structure (linked list with symbols), macros, simple pattern matching, tail-call optimization (it makes many things simpler), etc. Ruby surely beats languages like Java here, but is far behind Scheme.

Erlang

Erlang seems pretty cool. Just look at Erlang: The movie at Google Video.

The absolutely coolest thing about it is builtin support for reloading code without losing state. I routinely code half-way solutions for this problem in Ruby, Perl, Python, or whatever language I'm using. In many cases it speeds up development by a factor of ten.

It can also do massively distributed computing based on message passing. Ruby barely does multithreading, and relies on RDBMS-based solutions for scalability.

If you expect Smalltalk/Ruby-style "everything is an object", Erlang is disappointing. But if "objects" are redefined to mean "lightweight processes", Erlang starts to make sense. I'm not really convinced "one object per process with strict message passing" model is better than Smalltalk/Ruby model, but I'd like to experiment with it a bit some day.

Smalltalk

Smalltalk program is a living being, not a dead object. I keep trying to make my programs a bit less dead, and Smalltalk is at the end of the road. Either that, or Erlang.

Smalltalk is object-oriented without any compromises or complications. The only operation is sending a message. Ruby makes a lot of compromises to feel more familiar. The compromises made it less flexible, more complex, less reflective, and also much slower. Do you know how exactly String#=== sets $1 ? Can your class do so to ? (It can, but the method must be coded in C, not Ruby). Can you take object's method, unbind it, rebind, and call ? (Try it on a cloned singleton) What are semantics of protected visibility ? Why so many kinds of class variables ? The complexity means it's much harder to metaprogram without accidentally breaking something in process.

Smalltalk is based on images, not on a bunch of text files. Text files are cool, but needing a few irb sessions, and cut and pasting code between irb and text editor is seriously annoying. And to think that some people have to manually compile programs and don't use svn.

C

The coolest thing about C is Unix. C makes Unix possible. If you don't know why Unix is so cool, just put aside anything you're doing right now, take a couple weeks' break, and learn it well. You're going to thank me later.

C is very pretty language for low-level memory manipulation. Providing only low-level memory access is of course one of its worst weaknesses, but at least they got it right. With C it's just so easy to create fancy data structures, and use every last bit of them. Other languages usually go the other way and provide only high-level memory access. For 99% of programs, it's a better choice. For the rest, we have C. A related issue is fairly simple semantics (at least without aggressive optimizations or multithreading, then it gets nasty). It lets you experiment with many things that are simply not possible in other languages. And few other languages have decent inline assembly support.

C can access a really huge number of libraries. It takes a lot of work to get them work well together (every non-trivial C library of them defines own string type, own pseudo-oo system, many even include own memory manager), they're not portable, they're not secure, and they tend to segfault at random, but the sheer number of available libraries is about as high as Perl's.

C++

C++ is easily the worst language ever. It's one thing to say good things about a language when language in question is Python, it's a completely different thing when it's C++. What next ? Praising Bush's handling of Iraq ? Still, I think I can say something good about it.

It can be used as better C. Low-level, fast (with template hackery easily faster than plain C), backwards compatible, and still significantly better. Few people use C++ that way. More often than not they create monstrous APIs and programs that break under weight of C++'s semantic complexity. But C++ has potential of being "better C".

Template metahackery is fun. Most of the time it's totally useless, and plain C would more than suffice, but who wouldn't want a new programming toy ?

C++ still seems like the only way of creating fast low-level data structures and algorithms with half-decent, and zero-overhead interfaces. C is just too ugly for that (you cannot even have a decent complex_t in C), and everything else adds a pretty high overhead.

Summary

I said a lot of good things about many languages, and I hate them all anyway. I tried to keep true to my feelings, and avoid simply repeating the party line why something is cool. I could write about Objective Caml, Prolog, C#, CL, and all other languages, but I wanted to keep it reasonably short. Maybe one day I'll write another part.

If you want to join the fun, go ahead and write some really cool things about languages you hate most in the comments. I'm not going to delete anything, even if you praise COBOL :-)

Translations

Alyona Lompar translated this post to Ukrainian.

Maria Ramos from Webhostinghub.com translated this post to Spanish.

taw's blog

Tuesday, February 27, 2007

Ruby and Judy

Benchmarking memory usage

Judy

Results

Conclusions

Wednesday, February 21, 2007

Atomic Coding with Subversion

Subversion basics

Atomic Coding

Making Ruby faster

CFLAGS

Optimizing `rb_call0`

Specializing hash tables

Making `Fixnum` operations faster

Summary

Monday, February 19, 2007

Yannis's Law: Programmer Productivity Doubles Every 6 Years

Saturday, February 10, 2007

The right to criticize programming languages

Python

Perl

Java

JavaScript

PHP

Haskell

Scheme

Erlang

Smalltalk

C

C++

Summary

Translations

About Me

Popular Posts

Blog Archive

Labels

My software

Creative Commons

taw's blog

Tuesday, February 27, 2007

Ruby and Judy

Benchmarking memory usage

Judy

Results

Conclusions

Wednesday, February 21, 2007

Atomic Coding with Subversion

Subversion basics

Atomic Coding

Making Ruby faster

CFLAGS

Optimizing rb_call0

Specializing hash tables

Making Fixnum operations faster

Summary

Monday, February 19, 2007

Yannis's Law: Programmer Productivity Doubles Every 6 Years

Saturday, February 10, 2007

The right to criticize programming languages

Python

Perl

Java

JavaScript

PHP

Haskell

Scheme

Erlang

Smalltalk

C

C++

Summary

Translations

About Me

Popular Posts

Blog Archive

Labels

My software

Creative Commons

Optimizing `rb_call0`

Making `Fixnum` operations faster