taw's blog: June 2007

Wednesday, June 27, 2007

Regular expressions and strings with embedded objects

Regular expressions are among the most powerful programming techniques ever invented. Real world "regular expressions" are only loosely related to Computer Science "regular expressions". Computer Science "regular expressions" can only provide yes/no answers to "does this string match that regular expression" type of questions. We are usually interested in much more than that - we want to extract data from strings, convert strings to other strings and so on. We call expressions used for this purpose "regular" too for historical reasons.

Regular expressions are extremely concise, but sometimes they don't suffice and we need to write a "real parser". Unfortunately even with the best parser generating tools parsers tend to be many times more complex and error prone than equivalent regular expressions. And if the problem is too complex for regular expressions, very often it is also too complex for whichever parser generator you're using, and needs a lot of nontrivial hacking around limitation of it, or even writing a parser by hand.

Fortunately for many problems there's an alternative to parsers and parser generating tools - regular expressions plus a few tricks. This blog post is about one of such trick.

For my bot which verifies links in Wikipedia I needed to extract data from SQL dumps. SQL dumps look something like that:

INSERT INTO `page` VALUES (1,0,'Astronomia','',1800,0,0,0.600461925007833,'20070601091320',8076762,8584,0), (2,0,'AWK','',329,0,0,0.487812640599732,'20070530195555',8058046,4265,0), (4,0,'Alergologia','',108,0,0,0.580574716050713,'20070520093413',7912844,292,0), ...
INSERT INTO `page` VALUES (14880,0,'Dźwignica_linotorowa','',26,0,0,0.597327036408081,'20060814072401',4282357,727,0), (14881,0,'Urządzenia_transportowe','',91,0,0,0.176666489966834,'20070527090143',2976610,1041,0), ...

Basically a bunch of INSERT statements with multiple tuples each. I wanted to iterate over the tuples. Extracting tuple data from the sql dump is a simple next unless line =~ /\AINSERT INTO `page` VALUES (.*)\Z/. Spliting this blob into tuples and tuple fields is almost trivial - /\),$/ separates tuples and , separates tuple fields. Unfortunately there are many strings inside, and some of them might contain commas.

Sure, it's posisble to write a single monster regular expression to do just that, but it would be quite error-prone. Wouldn't it be easier to simply treat whole SQL strings as single "objects" inside the string ? That's exactly what we're going to do. First we need to get rid of \-escapes. That's not really necessary, as a regular expression for matching strings with \-escapes inside wouldn't really be that complicated, but we can make it even simpler this way. So every /\\(.)/ becomes "\x00" + esc[$1], where values of esc[...] are "safe" characters which won't interfere with further parsing, like A, B, C, and so on.

At this point every ' marks either a beginning or an end of some string. So we can replace all strings by object ids like "\x01<1234>", where 1234 is a suitable number. After we do that we can split on /$,\(/ and , just like we wanted. Afterwards we need to convert embedded objects (like \x01<1234> and \x00A) back to their original form.

The complete code is here:

def hash_or_die(kw)
  Hash.new{|ht,k| raise "Unknown key: #{k}"}.merge(kw)
end

def parse(data)
  esc = hash_or_die "\\" => "A", "\"" => "B", "n" => "C", "'" => "D"
  rev_esc = hash_or_die "A" => "\\", 'B' => "\"", "C" => "n", "D" => "'"
  data = data.gsub(/\\(.)/) {"\x00" + esc[$1]}
  strs = []
  data = data.gsub(/('[^']*')/) { # '
    strs << $1
    "\x01<#{strs.size-1}>"
  }
  records = []
  data.scan(/\((.*?)\)/) {
    records << $1.split(/,/).map{|field|
      field.gsub(/\x01<(\d+)>/) {
        strs[$1.to_i]}.gsub(/\x00(.)/){ rev_esc[$1]
      }
    }
  }
  records
end

def sql_str_unquote(str)
  str =~ /\A'(.*)'\Z/ or raise "SQL string format is wrong: #{str}"
  $1.gsub(/\\(.)/) {$1}
end

page_fn = Dir["plwiki-*-page.sql"].sort[-1]
externallinks_fn = Dir["plwiki-*-externallinks.sql"].sort[-1]

pages = {}

File.open(page_fn).each{|line|
  next unless line =~ /\AINSERT INTO `page` VALUES (.*)\Z/
  parse($1).each{|id,ns,title,*stuff|
    next unless ns == "0"
    title = sql_str_unquote(title)
    pages[id] = title
  }
}

File.open(externallinks_fn).each{|line|
  next unless line =~ /\AINSERT INTO `externallinks` VALUES (.*)\Z/
  parse($1).each{|from,to,index|
    title = pages[from]
    next unless title
    to = sql_str_unquote(to)
    next unless to =~ /\Ahttp:\/\//
    puts "#{title}\t#{to}"
  }
}

The same technique can be used to parse many other things like parsing Lisp code:

require 'pp'

lisp_code = '(a (b c) (d (e) f g) (((h))))'
nodes = []

lisp_code.gsub!(/([a-z]+)/) {
  nodes << [:atom, $1]
  "<#{nodes.size-1}>"
}
lisp_code.gsub!(/\s/,"")
true while lisp_code.gsub!(/\(((?:<\d+>)*)\)/) {
  nodes << [:app, *$1.scan(/<(\d+)>/).map{|x,| nodes[x.to_i]}]
  "<#{nodes.size-1}>"
}
lisp_code =~ /<(\d+)>/
pp nodes[$1.to_i]
# Output:
# [:app,
#  [:atom, "a"],
#  [:app, [:atom, "b"], [:atom, "c"]],
#  [:app, [:atom, "d"], [:app, [:atom, "e"]], [:atom, "f"], [:atom, "g"]],
#  [:app, [:app, [:app, [:atom, "h"]]]]]

and math expressions:

require 'pp'

math_code = '(2 + 2 * 2) / ((2 + 2) * 2)'
nodes = []

math_code.gsub!(/(\d+)/) {
  nodes << $1.to_i
  "<#{nodes.size-1}>"
}
math_code.gsub!(/\s/,"")

until math_code =~ /\A<(\d+)>\Z/
  next if math_code.gsub!(/\((<\d+>)\)/) { $1 }
  next if math_code.gsub!(/<(\d+)>([\*\/])<(\d+)>/) {
    nodes << [$2, nodes[$1.to_i], nodes[$3.to_i]]
    "<#{nodes.size-1}>"
  }
  next if math_code.gsub!(/<(\d+)>([\+\-])<(\d+)>/) {
    nodes << [$2, nodes[$1.to_i], nodes[$3.to_i]]
    "<#{nodes.size-1}>"
  }
end
pp nodes[$1.to_i]
# Output:
# ["/", ["+", 2, ["*", 2, 2]], ["*", ["+", 2, 2], 2]]

Technique of embedding objects in strings and matching such strings with regular expressions is simple and very powerful. Objects can be represented in many ways. If numbers and some sort of brackets are not relevant for parsing, and "\x00" doesn't occur in the input (it almost never does), "\x00<ID>" is a good idea. Fancy Unicode private characters can be used too if the regular expression engine can handle them. If you want to treat different objects in different way you can represent them in some convenient form like \x00<CLASS;PROPERTY_A;PROPERTY_B;ID>. The technique works best in languages with integrated regular expression engines like Ruby, Perl, and RLisp. In others like Python and Java it's going to be somewhat uglier, but still better than full-blown "parsers".

Tuesday, June 26, 2007

Healthcare in Poland

Currently in Poland there's a lot of discussion about the healthcare system. Poland has a mostly state-run healthcare. Not only the health insurance is public, even the hospitals are operated by the government. The former is probably a good idea, considering how horrible the healthcare system is in the only developed country where it is privately run. The latter is a relic of the Communist era which refuses to die.

In the last few months doctors and other healthcare workers were constantly protesting, demanding increased government spending on health care - primarily demanding better salaries. The media completely failed to do their job and didn't provide any real data on health care or health care financing whatsoever. We bloggers need to fill in the gap.

The data wasn't particularly hard to find. It's available from WHO European health for all database (HFA-DB). The comparison is for all 12 "New EU" countries in group and individually, plus "Old EU" and "EU total". I didn't include old EU countries here, as their situation is significantly different, so it wouldn't be particularly meaningful.

Country	Life expectancy (2005)	Infant mortality per 1000 births (2005)	Healthcare spending as % of GDP (2004)	Healthcare spending in USD PPP (2004)
Bulgaria	72.6*	11.65*	8	671
Cyprus	79.54*	3.01*	5.8	1128
Czech Republic	76.19	3.39	7.3	1412
Estonia	72.89	5.44	5.3	752
Hungary	73.02	6.23	7.9	1308
Latvia	71.06	7.8	7.1	852
Lithuania	71.33	6.84	6.5	843
Malta	79.44	5.96	9.2	1733
Poland	75.12	6.42	6.2	814
Romania	71.88*	16.84*	5.1	433
Slovakia	74.3	7.2	7.2	1061
Slovenia	77.58	4.15	8.7	1815
Whole EU	78.44	5.17	8.7	2334.32
Old EU	79.63*	4.34*	9.29	2729.1
New EU	73.96	8.71	6.51	869.61

* - data for 2004

The spending figures are total spending - that is public + private. As you can see healthcare in Poland is somewhat better and somewhat cheaper than "new EU" average. It's still far behind the old EU, but it's not really a crisis situation, especially since the results are improving and the spending is increasing - between 2000 and 2005 life expectancy increased 73.95 to 75.12, infant mortality fell 8.11 to 6.42. Between 2000 and 2004 spending increased as percent of GDP from 5.7 to 6.2, and in USD PPP from 587 to 814.

So the protests seem to be primarily politically motivated. The situation is fairly good compared to other countries in similar situation and is steadily improving. Another thing strongly implying political motivation is the sudden support for the protesting doctors from the normally vehemently anti-union opposition party Platforma Obywatelska.

Making it easy for users to write quality bug reports

One of the coolest things about making your programs public is the user feedback you will get. Some is going to be "Awsum thx, your program just saved my life" and "I tried to run it on Amiga 500 and it crashed, you suck", but in my experience most of the feedback consists of very helpful suggestions and bug reports.

This is very valuable, and as a developer you have a great influence on quality of the feedback. For most Unix console programs and Ruby libraries nothing needed to be done, as normal stack traces get printed to the terminal, and Unix users tend to know how to write good bug reports, but for jrpg it wasn't that simple - many bugs were highly nondeterministic, and as most users ran it on Windows there was no console to print stack traces to. At first I kept telling users to retry with some sort of console turned on, but after two or three such cases I wrote the following code. In util.py:

def save_errormsg(trace_back):
    (tp,v,tb) = trace_back
    tbf = traceback.format_exception(tp,v,tb)
    f = open("errormsg.txt", "a")
    f.write("== ")
    f.write(time.asctime(time.localtime()))
    f.write(" ==\n")
    for line in tbf:
        f.write(line)
    f.write("\n")
    f.close()

And in jrpg.py:

try:
    mistakes = Mistakes()
    book = demonsoul.Book_of_demons()
    mhc  = Main_Hero_Controller()
    wm   = World_model()
    wv   = World_view()
    ui   = UI()

    main_hero = Chara("female-blue", position=(0, 0))
    main_hero.is_main = True

    wm.switch_map("world", (14,25))

    ui.change_text([U"Welcome to the 日本語RPG", U"", U"Press F1 for quick help"])

    ui.main_loop()
except SystemExit:
    pass
except Exception:
    util.save_errormsg(sys.exc_info())
    raise

All exceptions get saved to errormsg.txt. When user reports a bug, I can ask them to attach this file, and thanks to stack backtraces bugs are much easier to fix. For boring technical reasons we don't really want to capture SystemExit, so we let it through.

Another thing that increases feedback quality (and also user count) is good packaging and really good documentation on how to get started - that's where most of the problems seem to be. Listening to users also helps - as you wrote the program it's too easy for you to convince them that they're wrong, and very few users are going to argue with that. Often you're right and their ideas wouldn't work for some reason, but not always. What I found very helpful was explaining the rationale behind the things being the way they are in detail every time an user suggests a change. A couple of times it let me find out that the user was actually right. For example backspace on jrpg repeated significantly too fast - I did some calculations which showed the speed to be just right, and I got used to the way it worked so I didn't notice, but when I tried to explain that to an user I found out that there was a mistake in my calculations and it should really be slowed down a bit.

Being nice to users and replying quickly also improves feedback.

Toy C backend for RLisp compiler

I wrote a toy RLisp C "backend". It's not yet connected to the real compiler - the only thing it compiles is Ackermann function - of course you can make it compile something else by issuing right opcodes by hand.

First some support code. Symbol#<=> is already in rlisp_support.rb, String#ord is supposed to be a wrapper around different behaviour of Strings in 1.8 and 1.9. Symbol#mangle_c and Symbol#stringify_c convert Ruby symbols (which may contain funny characters like :"==\nblah!@#") to C variable names and strings. For brevity I'm not pasting the tests here.

class Symbol
  include Comparable
  def <=>(other)
    to_s <=> other.to_s
  end
  protected :<=>
end

class String
  # FIXME: 1.8 specific, add support for 1.9
  def ord
    self[0]
  end
end

class Symbol
  def mangle_c
    to_s.gsub(/([^a-zA-Z0-9])/) { sprintf "_%02x", $1.ord }
  end
  def stringify_c
    '"' + to_s.gsub(/([\\"])/) { "\\#{$1}" } + '"'
  end
end

Here's the code which calls the code generator, and then the ack function. It uses Ruby->C trampoline to support closure variables. The interface used by code generator is similar to one used by normal Ruby-backed RLisp code generator, but not identical. I'm going to refactor them both to match, so RLisp frontend can talk with either backend. I think C-compiling at least RLisp stdlib is a good idea. C-compiling REPL less so. Code generators are so simple that I hope to be able to maintain both without too much work.

class Stuff
  cg = RLispCodegenC.new(:ack, [:m, :n])
  t2, t3, t4, t5, t6, t7, t8, t9, t10, t11, t12, t13, t14, t15 = cg.tmps(14)

  cg.funcall(t2, :m, :"==", 0)

  cg.x_if(t2) {
    cg.funcall(t4, :n, :+, 1);
    cg.asg(t3, t4)
  }
  cg.x_else {
    cg.funcall(t5, :n, :==, 0);

    cg.x_if(t5) {
      cg.global_get(t7, :ack)
      cg.funcall(t8, :m, :-, 1);
      cg.funcall(t9, t7, :call, t8, 1);
      cg.asg(t6, t9)
    }
    cg.x_else {
      cg.global_get(t10, :ack)
      cg.funcall(t11, :m, :-, 1)
      cg.global_get(t12, :ack)
      cg.funcall(t13, :n, :-, 1)
      cg.funcall(t14, t12, :call, :m, t13)
      cg.funcall(t15, t10, :call, t11, t14)
      cg.asg(t6, t15)
    }
    cg.asg(t3, t6)
  }
  cg.ret(t3)

  inline do |builder|
    result = builder.c cg.build
  end
end

ack_m = Stuff.new.method(:ack)
globals = {}
closure = { :globals => globals }
globals[:ack] = lambda{|*args| ack_m.call(closure, args) }

print "retval = ", globals[:ack].call(3, 3), "\n"

RLispCodegenC builds one function at time. To use it you need to create a new RLispCodegenC instance, call some opcode methods, call build on the RLispCodegenC object, and compile it using inline. inline caches .so objects, so C compiler is called only when something changed.

require 'inline'

class RLispCodegenC
  def initialize(name, args)
    @name  = name
    @args  = args
    @syms  = {:globals => true}
    @ids   = {}
    @temps = []
    @code  = ""
  end

  def build
    temps = [:globals] + @args + @temps
    src = <<EOF
VALUE #{@name}(VALUE closure, VALUE args) {
  VALUE #{temps.join(", ")};
  #{static_init}
  #{arg_init}
  #{@code}
}
EOF
    src
  end

  def arg_init
    res = "globals = rb_hash_aref(closure, SYM_globals);\n"
    @args.each_with_index{|arg, i|
      res << "#{arg.mangle_c} = rb_ary_entry(args, #{i});\n"
    }
    res
  end

  # This code should be executed just once in Init_*, not every time the method is called
  def static_init
    ids = @ids.keys.sort
    syms = @syms.keys.sort
    res = ""
    res << %Q[int #{ids.map{|i| "ID_#{i.mangle_c}"}.join(", ")};\n] unless ids.empty?
    res << %Q[VALUE #{syms.map{|s| "SYM_#{s.mangle_c}"}.join(", ")};\n] unless syms.empty?
    ids.each{|i| res << "ID_#{i.mangle_c} = rb_intern(#{i.stringify_c});\n" }
    syms.each{|s| res << "SYM_#{s.mangle_c} = ID2SYM(rb_intern(#{s.stringify_c}));\n" }
    res
  end

  def tmps(sz)
    res = []
    sz.times{
      t = :"t#{@temps.size}"
      res << t
      @temps << t
    }
    res
  end

  def to_c(x)
    if x.is_a? Fixnum
      "INT2FIX(#{x})"
    # It means a C temporary, ***NOT*** Ruby symbol
    # FIXME: Mangling ?
    elsif x.is_a? Symbol
      x.mangle_c
    else
      raise "Don't know how to convert #{x.class}: #{x.inspect}"
    end
  end

  def funcall(asg_to, recv, mid, *args)
    @ids[mid] = true
    mid_s = "ID_#{mid.mangle_c}"
    args_m = args.map{|a| ", " + to_c(a)}.join
    @code << "#{to_c(asg_to)} = rb_funcall(#{to_c(recv)}, #{mid_s}, #{args.size}#{args_m});\n"
  end
  def x_if(tmp)
    @code << "if(RTEST(#{tmp})) {\n"
    yield
  end
  def x_else
    @code << "} else {\n"
    yield
    @code << "}\n"
  end
  def asg(to, from)
    @code << "#{to_c(to)} = #{to_c(from)};\n"
  end
  def global_get(to, sym)
    @syms[sym] = true
    sym_s = "SYM_#{sym.mangle_c}"
    @code << "#{to} = rb_hash_aref(globals, #{sym_s});\n"
  end
  def ret(val)
    @code << "return #{to_c(val)};\n"
  end
  def debug(msg, val=nil)
    @code << %Q[rb_funcall(self, rb_intern("print"), 1, rb_str_new2(#{msg.inspect})); /* DEBUG */\n]
    @code << %Q[rb_funcall(self, rb_intern("p"), 1, #{val}); /* DEBUG */\n] unless val.nil?
  end
end

Wednesday, June 20, 2007

Object-Oriented dialects of Lisp

Lisp is much older than object oriented programming. When OOP got popular people wanted to do OOP in Lisp too, but it wasn't obvious how to retrofit Lisp to make it object oriented. CLOS for Common Lisp became somewhat popular, but its quite far from what Smalltalkers would call "object oriented", and some people feel it wasn't very Lispy either. Many Lispers like Paul Graham simply gave up on OOP. Others like me decided to code their own object-oriented dialects of Lisp. Solutions they came up with are very different. I checked the following Object-Oriented dialects of Lisp:

RLisp - Lisp integrated with Ruby
e7 - Lisp integrated with Python
goo - Object-Oriented Lisp inspired by Scheme
CLOS - Object-Oriented system built on top of Common Lisp
I also wanted to check Coke, but it segfaulted at me, and the last thing I felt like doing was debugging broken C code.

In all four I tried to write the same snippet:

Define class Point representing 2D points of vectors
Define initializer for this class, which takes 2 arguments x and y and returns a Point instance
Define a method for stringifying Points, like Ruby's to_s and Python's __str__. If possible, I wanted it to be automatically called by REPL and (print a_point) or its equivalent.
Define a method for adding two Points. I wanted to call it + if possible.
Create two points, add them, and print the result.

RLisp code:

(let Point [Class new])
(class Point
  (attr-reader 'x)
  (attr-reader 'y)
  (method initialize (x y)
    (set! @x x)
    (set! @y y))
  (method to_s ()
    "<#{@x},#{@y}>")
  (method + (other)
    [Point new (+ [self x] [other x]) (+ [self x] [other y])]))

(let a [Point new 1 5])
(let b [Point new -2 9])
(print [a + b])

e7 code:

(class Point ()
  (def (init self x y)
    (set-self x x y y)))
(defmethod (+ (self Point) (other Point))
  (Point
    (+ self.x other.x)
    (+ self.y other.y)))
(defmethod (print (self Point) f)
  (fwrite f (format "<%s,%s>" self.x self.y)))

(def a (Point 1 5))
(def b (Point -2 9))

(print (+ a b))

goo code:

(dc <point> (<any>))
  (dp point-x (<point> => <num>))
  (dp point-y (<point> => <num>))

(dm point-new (x|<num> y|<num>)
  (new  point-x x point-y y))

(dm point-add (p1|<point> p2|<point> => <point>)
  (point-new
    (+ (point-x p1) (point-x p2))
    (+ (point-y p1) (point-y p2))))

(dm write-point (port|<out-port> x|<point>)
  (msg port "<%s,%s>" (point-x x) (point-y x)))

(dv a (point-new 1 5))
(dv b (point-new -2 9))

(write-point out (point-add a b))
(newline out)

CLOS code:

(defclass point ()
  ((x :reader point-x :initarg :x)
   (y :reader point-y :initarg :y)))
(defmethod make-point (x y)
  (make-instance 'point :x x :y y))
(defmethod point-add (a b)
  (make-point
    (+ (point-x a) (point-x b))
    (+ (point-y a) (point-y b))))

(defmethod point-print ((p point))
  (format t
    "<~s,~s>" (point-x p) (point-y p)))

(setf a (make-point 1 5))
(setf b (make-point -2 9))
(point-print (point-add a b))

The first thing I noticed is how different these snippets look in spite of doing pretty much the same thing. The second is that RLisp and e7 are much more concise than goo and CLOS. RLisp and e7 have more syntactic extensions for OO. RLisp supports [ ] syntax for method calls, self for message receiver, and @ivar for instance variables. e7 doesn't go that far and limits itself to obj.attr syntax for attribute access and method calls. In e7 like in Python all attributes are public. In RLisp like in Ruby all attributes are private, and (attr-reader 'x) must be explicitly called.

In RLisp and e7 attributes are dynamic and aren't predeclared anywhere. In goo and CLOS list of attributes is part of class definition. goo even requires attribute types, but will happily accept <any>. Only RLisp seemed to provide a clear way of stringifying objects. recurring-write in goo converts standard objects to strings, but write for some reason ignored extensions of it. In e7 the standard way is overloading print. It seems more limited than stringification method, but it should be possible to print to string buffer instead of a real file. I have no idea what CLOS uses to stringify objects.

The most important ideological difference is treatment of method calls. RLisp makes message sends and function calls separate operations much like Smalltalk-style OOP languages. All others define methods as some kind of generic functions. This means weaker encapsulation and some deep philosophical differences. Coke also separates function calls and message sends, but like I said it was segfaulting, so I was unable to take a closer look at it.

Only in RLisp creating a class and defining stuff inside it are separate operations. Ruby is ambiguous as class Foo can either define a new class or reopen existing class, and RLisp tried to avoid this ambiguity. Of course it's possible to do both with a simple macro. It increased verbosity of code somewhat.

In e7 class is also a constructor. In RLisp it's just an object. CLOS and goo seem to treat classes differently from other objects.

goo and CL provided only REPL environment by default, and didn't like running scripts. RLisp and e7 supported REPL and script mode without any extra hassle.

From a purely subjective point of view, I liked RLisp way most, what's not particularly surprising coming from RLisp's author ;-). Coding in e7 also felt good. On the other hand goo and especially CLOS felt rather unelegant and unpleasant.

Tuesday, June 19, 2007

COBOL and Fortran evidence for Yannis's Law

Many people find it hard to believe how much programmer productivity increased. For them I have two good examples - CGI script in Fortran, and Sudoku solver in COBOL.

People actually used to code stuff like:

      Write(6,189)
      Write(6,190) QS(7:7),QS(8:8),QS(9:9),QS(10:10),QS(11:11),QS(12:12)

 189  Format('<Table border=0><TR><TD bgcolor=' '#FFFFFF' '>')
 190  Format('Submitted color= 'AAAAAA' </TD></TR></Table>')

      Write(6,193)
 193  Format('<BR>*</TD></TR>')
      Go to 320
        else if(j.ne.0) then

and even:

           DISPLAY 'PT3-W700      = ' PT3-W700.
           DISPLAY 'WAGON-CNT-W700= ' WAGON-CNT-W700.

           DISPLAY 'ARRAY-C:'.

           PERFORM VARYING SUB17 FROM 1 BY 1
                 UNTIL SUB17 > 81
                   OR  ARRAY-C-NUM (SUB17) = ZERO
               DISPLAY ARRAY-C-NUM (SUB17) SPACE
                       ARRAY-C-POSS (SUB17)
           END-PERFORM.

It's not just Ruby vs Python vs Java vs Haskell vs Erlang vs Arc thing. There are things far worse than Java. I'm not sure about COBOL vs C++ though.

Sunday, June 17, 2007

Crosscompiling Python packages with VirtualBox and py2exe

A few weeks ago I complained that jrpg development pretty much ceased because Windows packaging became such a pain in the ass. Fortunately I managed to setup a VirtualBox environment in which I can cross-compile jrpg and other Python packages and create convenient zip files containing Windows executables. This means Windows users don't need to install Python and Pygame, and I don't need a separate Windows box.

The setup process was far from trivial, as it was my first contact with VirtualBox, and I knew very little about Cygwin and Windows in general. The first thing was to install VirtualBox. There are many freely downloadable virtual machines, this one just happened to have easily googlable Ubuntu package and made it perfectly clear what is it supposed to do, so I didn't look any further. VMware in contrast has many different packages (Server, Player, Workstation, whatever), and to a person who never used it before it isn't clear which one is meant for what.

VirtualBox packages for all major Linux distributions are available from here. Setting up VirtualBox for the first time is not simple, but the documentation (also /opt/VirtualBox-1.4.0/UserManual.pdf) described it very well. Documentation usable for non-experts is one of the things which corporation-backed software often gets better than Open Source.

After installing VirtualBox you will also need to install uml-utilities and
bridge-utils, add yourself to groups vboxusers and uml-net, log out, and in again. It will only be needed later, but it's easier to do it right now.

The next step is creating the VirtualBox. Run command VirtualBox for a pretty GUI. There's also VBoxManage CLI interface, but it's probably better to use GUI if it's your first time. Click on "New", and follow the creation wizard. Don't worry much about allocating diskspace - empty virtual diskspace doesn't take any space on the host hard drive. I gave it virtual 5GB, and it actually uses 1.5GB.

So the next step is installation of the operating system. Grab Windows XP install CD, or any other Windows install CD lying around, or torrent one in the unlikely case you cannot find any. You can tell VirtualBox to either use actual CD or an image file. Start the machine, and enjoy Windows installation. It's much less painful than real Windows installation, as you don't need any drivers or service packs. Oh, and for your sshing convenience select the same username as you have on your normal box.

After you're done start Internet Explorer on the VirtualBox. Go directly to http://www.opera.com/download/ and get a real browser (Firefox will work too). With a real browser install Python, Pygame, py2exe, WinSCP, and Cygwin. Neither Opera/Firefox nor WinSCP are strictly necessary, but I prefer Windows toasters that at least somewhat resemble real computers.

You will need to install a few more Cygwin packages - at least openssh and zip. So run the Cygwin shell. Does it work ? Great, now you need to configure ssh. The process is somewhat painful, but it's documented step by step. To verify that everything works, try sshing from Cygwin shell to localhost, and from Cygwin shell to your regular box. If it works, add ssh key from your regular box to authorized keys on the virutal box. scp real_box:~/.ssh/id_rsa.pub real_box_key; cat real_box_key >>.ssh/authorized_keys. If not, refer to Cygwin documentation. I'd describe it in detail, but .bash_history on the VirtualBox somehow evaporated, and I don't remember every single step, sorry ;-).

We are still not ready to ssh from the real machine to the VirtualBox. VirtualBox by default sets up networking in NAT mode, and connections are only possible from it, not to it. Run VirtualBox command, go to Settings, Network, switch from "NAT" to "host interface", and set interface to tap0. Now we actually need to setup some interface on the host. Make /dev/net/tun readable and writable to you, for example 0660 root.uml-net (you should have already added yourself to uml-net - remember that you need to logout and login again after you do that).

Now add to /etc/network/interfaces:

auto tap0
iface tap0 inet manual
  up ifconfig $IFACE 0.0.0.0 up
  down ifconfig $IFACE down
  tunctl_user taw

auto br0
iface br0 inet dhcp
  bridge_ports all tap0

And start both interfaces ifup tap0 br0. They are set to auto so they will start automatically the next time you start your computer. This part is very Ubuntu-specific, if you use another distribution refer to VirtualBox documentation.

Start the virtual machine again from GUI or from shell command VBoxManage startvm 'Machine Name'. Check its IP using command ipconfig. In my case host machine has IP 10.0.0.12 and the virtual machine has IP 10.0.0.6. Try logging in from the host - ssh 10.0.0.6. If everything went well we're almost done.

I use the following Rakefile task to build jrpg:

task :jrpg_windows_package do
  magicbox = "10.0.0.6"
  jrpg_package_path = FileList["/home/taw/everything/website/packages/jrpg-*.tar.gz"].sort[-1]
  jrpg_package_fn = File.basename(jrpg_package_path)
  jrpg_package_fn =~ /(\d{4}-\d{2}-\d{2})/
  windows_package_fn = "jrpg-windows-#{$1}.zip"
  windows_package_path = "/home/taw/everything/website/packages/#{windows_package_fn}"

  unless File.exists?(windows_package_path)
    # Please turn on VBox:
    # $ VBoxManage startvm 'Magic XP box
    sh "scp", jrpg_package_path, "#{magicbox}:"
    sh "ssh", magicbox, "rm -rfv jrpg/"
    sh "ssh", magicbox, "tar -xvzf '#{jrpg_package_fn}'"
    sh "scp", "/home/taw/everything/jrpg/setup.py", "#{magicbox}:jrpg/"
    sh "ssh", magicbox, "cd jrpg; c:/Python25/python.exe setup.py py2exe -b 2 -O2"
    sh "ssh", magicbox, "cd jrpg; mv dist jrpg"
    sh "ssh", magicbox, "cd jrpg; zip -r '../#{windows_package_fn}' jrpg/"
    sh "scp", "#{magicbox}:#{windows_package_fn}", windows_package_path
    # You can safely turn off the VBox
  end
end

And the following setup.py:

#!/usr/bin/python

# Run under cmd: setup.py py2exe -b 2 -O2

from distutils.core import setup
import py2exe
import glob

setup(
  windows=["jrpg.py"],
  data_files=[
    ("maps", glob.glob("maps/*")),
    ("data", glob.glob("data/*")),
    ("images", glob.glob("images/*")),
    ("font", glob.glob("font/*")),
    "README",
    "DESIGN",
  ]
)

It should be easy to customize them both to your needs. I haven't yet found how to conveniently turn the VirtualBox on and off - it can supposedly use snapshots, but they don't seem to work well with the openssh server. In any case the pain of packaging became much less unbearable. If you have any questions, go ahead and ask.

Tuesday, June 12, 2007

How to be productive

Personal productivity, Getting Things Done, or whatever you want to call it, became one of the biggest things on the Internet in the last few years.

Everybody seems to feel they're not achieving as much as they could. Most people feel overwhelmed by having too many things to do. They procrastinate on important things and feel guilty later. They would like to do something about it, but aren't exactly sure what can be done. Maybe they even tried something in the past and failed.

Now would be a great time to try again. The Internet contains all the producticity resources you will ever need, and so many new fun distractions that at some point you will simply have to act.

There are many productivity resources, each telling a somewhat different story. To a large extend they simply focus on different things, but some techniques recommended by some and strongly discouraged by others. This situation is not really much different from other fields. For example in software engineering pretty much everybody agrees about using revision control and test-driven development, but about things like which web framework to use or wheather to write specifications or not, there are as many opinions as people asked.

This post is not attempting to create The Grand Unified Theory of Productivity. It wouldn't really work, just like no Grand Unified Theory of Software Development is possible at this point. I think live and work of web developers, brain surgeons, and newspaper cartoonists differ too much, people's personalities differ too much, and even if a single "right" answer exists, we are likely not to know it yet.

Here's a quick walkthrough on things that people all seem to agree on.

You can do something about it. Being productive is a skill, not a innate ability.
Simple tricks may bring you some improvement, but if you want a major improvement, systematic approach is needed.
Never keep anything in your head. Human brain is great at thinking but sucks as a storage place. Paper is better. Plain text files are better. Personal wikis are better. Everything is better as a storage place than your head.
Always think on paper. Drawing mindmaps using colored pens works particularly well. Just like nobody creates software or does math in their heads, planning without some writable surface to put your plans on is just horribly ineffective. Personally I think that paper is far better than any computer-based writing system so far, but computer-based systems at least beat doing it in your head.
When in doubt, toss it.
Throw away reference material that you aren't sure if you will need.
Throw away good ideas. Write about them on your blog so someone else may use them or just toss them. Throwing away good ideas creates mental space for great ideas. Throw away bad ideas too of course.
Simple flat reference systems work better than complicated hierarchical systems. If you forget when exactly you put something in flat system there are only a few places to check. In a hierarchical system number of places is much larger.
Measuring efficiency in percents and trying to be "100% efficient" is one of the worst ideas you could have. Even if you spend all your time doing your tasks with the maximum humanly possible efficiency, are they the best things you could do ? Is it really impossible to have more impact, do more important things, earn more money, or whatever ? For example programmers are getting twice as productive every 6 years. How can you assign percent value to their current productivity ?
Perfectionism leads to a huge waste of time. 20% of effort generates 80% of value. So the other 80% of effort generates just 20% of value. Anything worth doing is worth doing poorly, getting it done fast, and improving it with time.
Things show in your life as amorphous "stuff". For each thing which shows up decide up front what is it, if you want to do something about it, what is your desired outcome, and what's the very next action leading there.
After you decide it, put it on some lists which you will look at often enough to keep it out of your head. Your world and your priorities change very quickly, so review this list often, at least once a week.
"Important" and "urgent" are not the same thing. Many things are very important but never get urgent - like learning new skills. Other things seem urgent but have very little long-term value. Spend more time doing the important, and less time doing the not so important but urgent things.
Every single productivity expert agrees that single-tasking is far more effective than multi-tasking. I'm not really convinced here. Maybe it's a personality issue, with some peolpe having much shorter attention spans than others. Even if you're not going to single-task it's a good idea to limit number of things you're doing at the same time to only a few.
If something can be done in less than 2 minutes, do it when it shows up or toss it, even if it's not very important. Managing such tasks takes more time more than doing them up front.
Some things are boring, but you want to do them anyway, and they don't take that much time. You know, like packaging and documenting software. So just do them right now and be done with them.
Detailed plans, deadlines, ABC priority coding and so on would only work well if things weren't changing fast. In other words - they don't well work for anyone these days.
Complex tasks should be split into small steps. "Make Ruby faster than Java" is a nice goal, but how do you even start ? "Create some benchmarks measuring performance of Ruby", "Profile Ruby interpretter to identify bottlenecks", and "Replace Ruby hash table implementation by Judy" are something you can actually do.
Whatever system you have, it will need regular review. Without it even the best systems would break and you'd be back where you started.
Take a lot of notes, throw away 80% of them later. Making too many notes it better than not making enough. Always have some pen and paper with you.
Don't accept too many commitments. If you don't care about something, just say no.
Eliminate floating information. Every information should have a "home", and be easy to retrieve.
Have a single system.
If you have something on your list of things to do, but avoid doing it, ask yourself why is it on your list.
Humans cannot work productively more than about 40 hours a week.
Feedback is important. Unit testing results, money, number of people who visit your blog, any feedback.
People give negative feedback much more often than positive feedback. It might be a good idea to change it.

The best known productivity book these days is Getting Things Done by David Allen. It's available on paper, PDF, and audio, and the GTD system is described on thousands of websites and blogs. Another popular one is First Things First by Stephen Covey.

High productivity is achievable for everyone, in every context, including contexts not usually associated with "productivity", what you can see on the following video by Melissa Gira from Sexerati (CC-NC-ND):

Click To Play

Code Geass sucks. Most of anime sucks.

I've seen the first few episodes of Code Geass on Baka Y2K6 and I loved it. There was a dystopian world with a pretty facade, and a guy who somehow acquired special powers and decided to use them to completely rebuild everything. It felt just like Death Note except with mecha.

Unfortunately instead of a morally provocative anime with an interesting plot, Code Geass quickly turned into just another awfully boring mecha anime with every single character being a walking cliché, and the plot consisting mostly of filler material and randomly generated events. Things happen not because of some world logic or actions of characters, but because the whole thing would collapse under its own lameness otherwise. Having run out of plot ideas the authors should at least have the decency to finish the series, but no - apparently the second season is already scheduled.

What's happening with anime anyway ? Other than Death Note I haven't seen anything interesting lately. There's a lot of generic "cute" and "funny" stuff like REC and School Rumble, but almost no series is trying to do anything original. And the competition in form of live action series is far better than in the 1990s - almost each new series manages to create a new genre or radically redefine an existing one, just like Heroes did with the superheroes genre.

Anime needs to change fast or it will become irrelevant outside of Japan and a narrow Western otaku circles.

Monday, June 11, 2007

How the boring stuff gets done in Open Source ? It doesn't !

Writing code is fun. Debugging used to be painful, especially with all the segfaults, but unit tests made it quite fun too, or at least bearable. Not only are these activities fun, there's also an instant payoff - something that didn't work or didn't work right is now doing its job. There's just one issue:

How the hell do people make themselves package software and write documentation ? They simply don't.

It's easy to consider packaging and documenting a non-issue. If you think of the few most popular programs like Apache and Firefox, they have plenty of packagers and documenters. These however are not typical programs. The vast majority of Open Source programs are small, have just a few users, are barely documented, and require manual installation.

Just SourceForge has a whooping 150,442 registered projects, and there's probably an order of magnitude more programs whose authors didn't even bother to publish them on any widely available website. The most popular Linux distribution Ubuntu contains 21,428 packages in all repositories. That's not the number of programs, as many programs are split into multiple packages - like mozilla-firefox-locale-all (which isn't even a separate program, just part of Firefox) which generates 47 packages - mozilla-firefox-locale-pl-pl, mozilla-firefox-locale-ga-ie, mozilla-firefox-locale-ca, mozilla-firefox-locale-sl-si, mozilla-firefox-locale-es-ar, mozilla-firefox-locale-nso, mozilla-firefox-locale-es-es, mozilla-firefox-locale-af, mozilla-firefox-locale-ar, mozilla-firefox-locale-fi-fi, mozilla-firefox-locale-gu-in, mozilla-firefox-locale-he-il, mozilla-firefox-locale-pt-pt, mozilla-firefox-locale-da-dk, mozilla-firefox-locale-fr-fr, mozilla-firefox-locale-hu-hu, mozilla-firefox-locale-eu, mozilla-firefox-locale-zh-tw, mozilla-firefox-locale-sk, mozilla-firefox-locale-en-gb, mozilla-firefox-locale-zu, mozilla-firefox-locale-mn, mozilla-firefox-locale-bg-bg, mozilla-firefox-locale-pt-br, mozilla-firefox-locale-cs-cz, mozilla-firefox-locale-sv-se, mozilla-firefox-locale-hr, mozilla-firefox-locale-ja-jp, mozilla-firefox-locale-el, mozilla-firefox-locale-fy-nl, mozilla-firefox-locale-de-de, mozilla-firefox-locale-ku, mozilla-firefox-locale-tr-tr, mozilla-firefox-locale-nb-no, mozilla-firefox-locale-ru-ru, mozilla-firefox-locale-lt, mozilla-firefox-locale-it-it, mozilla-firefox-locale-pa-in, mozilla-firefox-locale-ro-ro, mozilla-firefox-locale-bn-bd, mozilla-firefox-locale-nn-no, mozilla-firefox-locale-mk-mk, mozilla-firefox-locale-zh-cn, mozilla-firefox-locale-nl-nl, mozilla-firefox-locale-ko, mozilla-firefox-locale-ka-ge, and mozilla-firefox-locale-bn-in. Counting different source packages reduces the count of packaged programs to 12,768.

That's just the most popular Linux distribution. If by "adequate packaging" we expected support for at least the top 5 Linux distros, Windows, and Mac OS X, the number of "adequately packaged" programs would be a really tiny fraction of all Open Source. And documentation ? I don't have any hard numbers but we all know that the majority of software is barely documented, and the existing documentation is usually out of date anyway. The problem of inadequate packaging and documentation affects even some big programs like OpenSolaris and Omniscient Debugger - I'm pretty sure they would be much more popular if only someone bothered to package, document, and advertise them.

Why people don't do it ? It's a lot of hard work. It's unbelievably boring. It brings little short-term results - if someone managed to make themselves do it, most likely the next user would use a different distro, the next update would make the documentation obsolete, and they would give up soon. Unlikely coding these activities aren't even considered cool or respected by most people, so why bother ?

So what can be done ? One thing which alleviates the problem somewhat is Ubuntu and other Debian spin-offs becoming dominant Linux distributions. If they manage to push everything else out of the mainstream (not very likely, but weirder things happened) and stay reasonably compatible with each other, maybe .deb will become the one Linux package format. Unfortunately .deb s are very painful to create, Linux distributions fork faster than they die, and the previous candidate for the one Linux package format .rpm failed to be universally adopted.

Another partial solution are packaging formats on top of whatever distributions are using like RubyGems. They don't exactly mix well with normal distribution code, but at least they are far easier to build than .debs and work on everything including Mac OS X and Windows.

Or someone might finally make packaging "Not Boring", just like Google managed to do with Internet searching and Web email. It's a well-understood problem, so it really shouldn't be that hard to make a single program which can be used to create reasonable packages in 10 or so different formats in just a few minutes for 99% of programs.

As for documentation, we should simply accept that it will never be compete and up to date. One solution is striving to make unit tests readable - they have a much greater chance of being complete and up to date than the docs. Unfortunately I've seen programs in which tests made absolutely zero sense unless you knew the codebase very well. Even better solution is not needing the docs. Do 99% of programs really need detailed installation instructions ? Why can't the Rakefile / Makefile / gemspec / debian/rules / whatever do it all for us ? The only documentation people really desperately need is how to get started, and if getting started is simply and the program is prepackaged it won't be that much. And as long as you follow the conventions, what many programs unfortunately don't do, a lot less documentation is needed.

There are also a few solutions for embedding some of the most vital documentation in the program, in hope of making it less likely to go out of date. One that I particularly like is Ruby optparse and similar packages for other languages - output of --help is one of the most important pieces of documentation, and it's a great idea to make it less likely to go out of date.

Any other ideas ?

Thursday, June 07, 2007

Solutions to Kingmaker "Leadership Test"

The 1994 abandonware strategy game Kingmaker contains a hugely annoying "Leadership Test", which is supposed to test that you really have the paper manual around. There's probably a crack available somewhere, but I made a cheat sheet by screenshoting the game and some Ruby+ImageMagic scripts. Here it is, for all nostalgia gamers:

Wednesday, June 06, 2007

Ruby and methods with weird names

RLisp identifiers don't have to be Ruby identifiers - field-get and .. are perfectly valid in RLisp but not in Ruby. Normally that's not a problem, Ruby Symbols can contain arbitrary strings, names of RLisp variables and functions are used only inside the compiler, and define_method(:method, ...) and send(:method, ...) couldn't care less about what :method looks like stringified.

There's just one tiny problem - using them breaks Delegator and other classes which use text-based runtime code generation.

class Delegator
 def initialize(obj)
   preserved = ::Kernel.public_instance_methods(false)
   preserved -= ["to_s","to_a","inspect","==","=~","==="]
   for t in self.class.ancestors
     preserved |= t.public_instance_methods(false)
     preserved |= t.private_instance_methods(false)
     preserved |= t.protected_instance_methods(false)
     break if t == Delegator
   end
   preserved << "singleton_method_added"
   for method in obj.methods
     next if preserved.include? method
     begin
       eval <<-EOS
         def self.#{method}(*args, &block)
           begin
             __getobj__.__send__(:#{method}, *args, &block)
           rescue Exception
             $@.delete_if{|s| /:in `__getobj__'$/ =~ s} #`
             $@.delete_if{|s| /^\\(eval\\):/ =~ s}
             ::Kernel::raise
           end
         end
       EOS
     rescue SyntaxError
       raise NameError, "invalid identifier %s" % method, caller(4)
     end
   end
 end
end

Delegator iterates over all methods of an object (obj.methods), and for each of them generates and compiles a simple wrapper:

def self.#{method}(*args, &block)
 begin
   __getobj__.__send__(:#{method}, *args, &block)
 rescue Exception
   $@.delete_if{|s| /:in `__getobj__'$/ =~ s} #`
   $@.delete_if{|s| /^\\(eval\\):/ =~ s}
   ::Kernel::raise
 end
end

If Ruby had real macros like RLisp it wouldn't be a problem. Unfortunately Ruby code is generated by simple string substitution, and if method is anything else than a Ruby identifier it breaks.

In Ruby def is not accessible in any way other than through eval. We cannot use define_method(method, ...) as in Ruby 1.8 it cannot define methods which take blocks. And it has different argument-handling semantics than def, so it wouldn't work anyway.

Every use of text-based runtime code generation is a failure of language's reflection model.

It's weird how Ruby can be so well-knows for its metaprogrammability, and at the same time have most of it inaccessible through any means other than text-based eval. I think improving metaprogrammability is a much more important for future viability of Ruby than improving performance or other popular whining points.

So what can RLisp do:

Accept that Delegator and everything that uses it like tmpfile and webrick is broken.
Mangle all non-Ruby symbols into valid Ruby symbols. So (method foo-bar ...) would really define foo_bar etc.
Monkey-patch Delegator to ignore such methods instead of raising an exception. Ugly.
Monkey-patch Delegator to delegate such methods with slightly incorrect define_method instead of raising an exception.
Improve Ruby metaprogrammability.

Extracting backlinks to all posts in the blog

Blogger software gathers information on backlinks to blog posts, but the only way of accessing it seems to be looking at pages of individual posts. I was kinda interested who's linking to my blog - that's a good way of finding out other cool blogs. Of course I wasn't going to look at 160 pages. The task looked trivial - just download 160 pages, hpricot links out of them, and put them in a single HTML document. Unfortunately something was missing in the HTML sources:

<span class='post-backlinks post-comment-link'>

</span>

I guess that's to make the links invisible to search engines and to discourage spam. Unfortunately it also makes them invisible to all kinds of useful bots, Web 2.0 feels so wrong sometimes. Without FireBug I'd probably go "oh, screw that", fortunately thanks to FireBug I was able to find out what's going on quite fast - JavaScript was POSTing a request, and the results executed themselves:

try {
_WidgetManager._HandleControllerResult('Blog1', 'backlinks',{'numBacklinks': 1, 'backlinksLabel': 'Links to this post', 'authorLabel': 'Posted by', 'timestampLabel': 'at', 'createLinkLabel': 'Create a Link', 'createLinkUrl': 'http://www.blogger.com/blog-this.g', 'backlinks': [{'url': 'http://bytesized.labnotes.org/post/1570490', 'author': '', 'title': 'taw's blog: How to test glue code ?', 'snippet': 'taw's blog: How to test glue code ? Indeed. How? ', 'timestamp': '2:58 AM', 'deleteUrl': 'http://www.blogger.com/delete-backlink.g?blogID\u003d27488238&postID\u003d7234750665665792269&backlinkURL\u003dhttp%3A%2F%2Fbytesized.labnotes.org%2Fpost%2F1570490', 'adminClass': 'pid-695331528'}]});
} catch (e) {
  if (typeof log != 'undefined') {
    log('HandleControllerResult failed: ' + e);
  }
}

That's bad, as the code is in full-blown JavaScript, not JSON. Fortunately it was possible to convert it to something JSON enough with a few regular expresions, and there were no obstacles further on. Here's the code:

require 'rubygems'
require 'hpricot'
require 'json'

def wget(url, file=url.match(/([^\/]*)\Z/)[0])
   retval = system "wget", "--no-verbose", url, "-O", file unless File.exists?(file)
   File.read(file)
end

post_pages = []

doc = Hpricot(wget("http://t-a-w.blogspot.com/", "blog-main"))
(doc/"div#ArchiveList a.post-count-link").each{|archive_link|
  url = archive_link.attributes["href"]
  next unless url =~ %r[\Ahttp://t-a-w\.blogspot\.com/[0-9_]*_archive.html\Z]
  archive_doc = Hpricot(wget(url))
  (archive_doc/".post").each{|post|
    post_link   = post.at("h3.post-title a")
    link = post_link.attributes["href"]
    title = post_link.inner_text
    post_id = post.at("a").attributes["name"]
    post_pages << [title, link, post_id]
  }
}

puts "<html><head><meta http-equiv='Content-Type' content='text/html; charset=UTF-8' /><title>Backlinks</head><body><ul>"
post_pages.each{|title, url, post_id|
  unless File.exists?("backlinks-#{post_id}.js")
    system "wget", url, "-O", "backlinks-#{post_id}.js", "--post-data",
           "action=backlinks&postID=#{post_id}&ampresponseType=js&widgetId=Blog1&ampwidgetType=Blog"
  end
  data = File.read("backlinks-#{post_id}.js")
  data =~ /\n(.*)/
  data = $1.sub(/\A_WidgetManager\._HandleControllerResult\(/, "[").sub(/\);/,"]").gsub(/\'/,'"')
  data = JSON.parse(data)[2]["backlinks"]
  if data
    # Self links are not very interesting
    data = data.select{|backlink| backlink['url'] !~ %r[\Ahttp://t-a-w.blogspot.com/] }
    next if data.empty?
    puts "<li><a href='#{url}'>#{title}</a><ul>"
    data.each{|backlink|
      # backlink['snippet']
      puts "<li><a href='#{backlink['url']}'>#{backlink['title']}</a> - #{backlink['snippet']}</li>"
    }
    puts "</ul></li>"
  end
}
puts "</ul></body></html>"

Tuesday, June 05, 2007

Mock C backend for RLisp and its performance

When RLisp was first created it was purely interpreted, with an interpreter written in Ruby. It was about 50x slower than Ruby. Then it went through unsuccessful stage of mixed interpreted/compiled-to-Ruby mode, finally getting a real compiler with Ruby back-end. That's where it is now, and it is about 5x slower than Ruby.

The code generated by RLisp compiler doesn't need Ruby power, it just needs a few simple operations and access to Ruby runtime. That is - it could be RubyC code for matz's Ruby, or Java code for JRuby. If using C/Java compiler was too slow, the RLisp compiler could generate LLVM, x86 asm, or JVM code directly. It would be a little bit slower than going through C/Java, as we wouldn't run the optimizations, but the overhead is in Ruby runtime anyway, and I doubt the difference would be more than a few percent.

Before implementing the RubyC backend I wanted to measure how fast would it be and what would it look like. For that I used the Ackermann benchmark from the Great Programming Language Shootout, which Ruby by the way drastically loses. Ruby and RLisp implementations looks like that:

def ack(m, n)
  if m == 0 then
    n + 1
  elsif n == 0 then
    ack(m - 1, 1)
  else
    ack(m - 1, ack(m, n - 1))
  end
end

NUM = Integer(ARGV.shift || 1)
print "Ack(3,", NUM, "): ", ack(3, NUM), "\n"

(defun ack (m n)
  (cond (eq? m 0) (+ n 1)
        (eq? n 0) (ack (- m 1) 1)
                  (ack (- m 1) (ack m (- n 1)))
  )
)
(let num [[ARGV shift] to_i])
(print "Ack(3," num "): " (ack 3 num))

Compiled RLisp code, after some reformatting or readability:

module ::RLispC7054184
  def self.ack(globals)
    Proc.new do |*args|
      m, n, = *args
      t2 = m.send(:==, 0)
      if t2
        t4 = n.send(:+, 1)
        t3 = t4
      else
        t5 = n.send(:==, 0)
        if t5
          t7 = globals[:ack]
          t8 = m.send(:-, 1)
          t9 = t7.call(t8, 1)
          t6 = t9
        else
          t10 = globals[:ack]
          t11 = m.send(:-, 1)
          t12 = globals[:ack]
          t13 = n.send(:-, 1)
          t14 = t12.call(m, t13)
          t15 = t10.call(t11, t14)
          t6 = t15
        end
        t3 = t6
      end
      t3
    end
  end
end
t16 = ::RLispC7054184::ack(globals)
globals[:ack] = t16

t17 = ::ARGV.send(:shift)
t18 = t17.send(:to_i)
globals[:num] = t18
t19 = globals[:print]
t20 = globals[:num]
t21 = globals[:ack]
t22 = globals[:num]
t23 = t21.call(3, t22)
t24 = t19.call("Ack(3,", t20, "): ", t23)

It should be pretty straightforward globals is a Hash containing RLisp global variables, the three-level module/def/Proc.new procedure creation is used to

Make sure procedures with the same name won't accidentally overwrite each other.
Make sure stack backtraces contain name ack (instead of ack_CHECKSUM.
To build a closure if necessary. As ack is global the only captured variable is globals.

I compiled the ack function to RubyC by hand. It follows generated Ruby code statement-by-statement, except that I didn't bother compiling things outside the ack function, which are executed only once and don't affect performance anyway. So the code consists of two files, Ruby file ackermann_driver.rb:

require 'ackermann'

globals = {}

closure_ack = RLispCompiledCode.build_closure_ack_CHECKSUM(globals)
globals[:ack] = lambda{|*args| RLispCompiledCode.ack_CHECKSUM(closure_ack, self, args, nil)}

num = (ARGV.shift || 1).to_i
res = globals[:ack].call(3, num)
print "Ack(3,", num, "): ", res, "\n"

and RubyC file ackermann.c:

#include <ruby.h>
#include <intern.h>

int ID_EQEQ, ID_PLUS, ID_MINUS, ID_call;
VALUE SYM_ack, SYM_globals;

static VALUE ack_CHECKSUM(VALUE rlisp_compiled_module, VALUE closure, VALUE self, VALUE args, VALUE block_arg)
{
  VALUE globals, m, n, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11, t12, t13, t14, t15;

  globals = rb_hash_aref(closure, SYM_globals);

  m = rb_ary_entry(args, 0);
  n = rb_ary_entry(args, 1);

  t2 = rb_funcall(m, ID_EQEQ, 1, INT2FIX(0));

  if(t2) {
    t4 = rb_funcall(n, ID_PLUS, 1, INT2FIX(1));
    t3 = t4;
  } else {
    t5 = rb_funcall(n, ID_EQEQ, 1, INT2FIX(0));
    if(t5) {
      t7 = rb_hash_aref(globals, SYM_ack);
      t8 = rb_funcall(m, ID_MINUS, 1, INT2FIX(1));
      t9 = rb_funcall(t7, ID_call, 2, t8, INT2FIX(1));
      t6 = t9;
    } else {
      t10 = rb_hash_aref(globals, SYM_ack);
      t11 = rb_funcall(m, ID_MINUS, 1, INT2FIX(1));
      t12 = rb_hash_aref(globals, SYM_ack);
      t13 = rb_funcall(n, ID_MINUS, 1, INT2FIX(1));
      t14 = rb_funcall(t12, ID_call, 2, m, t13);
      t15 = rb_funcall(t10, ID_call, 2, t11, t14);
      t6 = t15;
    }
    t3 = t6;
  }
  return t3;
}

static VALUE build_closure_ack_CHECKSUM(VALUE rlisp_compiled_module, VALUE globals)
{
  VALUE closure = rb_hash_new();
  rb_hash_aset(closure, SYM_globals, globals);
  return closure;
}

void Init_ackermann()
{
  VALUE mRLispCompiledCode;
  ID_EQEQ     = rb_intern("==");
  ID_PLUS     = rb_intern("+");
  ID_MINUS    = rb_intern("-");
  ID_call     = rb_intern("call");
  SYM_ack     = ID2SYM(rb_intern("ack"));
  SYM_globals = ID2SYM(rb_intern("globals"));

  mRLispCompiledCode = rb_define_module("RLispCompiledCode");
  rb_define_module_function(mRLispCompiledCode, "ack_CHECKSUM", ack_CHECKSUM, 4);
  rb_define_module_function(mRLispCompiledCode, "build_closure_ack_CHECKSUM", build_closure_ack_CHECKSUM, 1);
}

The code is so scary that I don't even know where to start with explanations. The huge problem is that the compiled function ack_CHECKSUM needs to take closure argument somehow, but Ruby is not going to provide it. So we need an ugly wrapper globals[:ack] = lambda{...} in the driver.

As it turns out the performance is horrible, still over 2x slower than Ruby:

Ruby: 1.18s
RLisp: 5.93s (x5.0)
Mock binary RLisp: 2.48s (x2.1)

Many optimizations are possible, but I thought the guilty party is the ugly wrapper lambda. This or some other wrapper needs to be called every time we call RLisp function from Ruby, but what if we somehow optimized compile RLisp to compiled RLisp calls ? To make it really work we would probably subclass Proc to CompiledRLispProc and in the compiled code check wheather a Proc is of this particular kind, and if so, call it directly instead of using a wrapper. In this code we simply do that for calls do ack and nothing else. Here's the new driver ackermann_driver2.rb:

require 'ackermann2'

globals = {}

closure_ack = RLispCompiledCode.build_closure_ack_CHECKSUM(globals)
globals[:ack] = lambda{|*args| RLispCompiledCode.ack_CHECKSUM(closure_ack, self, args, nil)}

num = (ARGV.shift || 1).to_i
res = globals[:ack].call(3, num)
print "Ack(3,", num, "): ", res, "\n"

and the new hand-compiled code:

#include <ruby.h>
#include <intern.h>

int ID_EQEQ, ID_PLUS, ID_MINUS, ID_call;
VALUE SYM_ack, SYM_globals;

static VALUE ack_CHECKSUM(VALUE rlisp_compiled_module, VALUE closure, VALUE self, VALUE args, VALUE block_arg)
{
  VALUE globals, m, n, t2, t3, t4, t5, t6, t8, t9, t11, t13, t14, t15;
  VALUE new_args;

  globals = rb_hash_aref(closure, SYM_globals);

  m = rb_ary_entry(args, 0);
  n = rb_ary_entry(args, 1);

  t2 = rb_funcall(m, ID_EQEQ, 1, INT2FIX(0));

  if(t2) {
    t4 = rb_funcall(n, ID_PLUS, 1, INT2FIX(1));
    t3 = t4;
  } else {
    t5 = rb_funcall(n, ID_EQEQ, 1, INT2FIX(0));
    if(t5) {
      t8 = rb_funcall(m, ID_MINUS, 1, INT2FIX(1));

      new_args = rb_ary_new();
      rb_ary_push(new_args, t8);
      rb_ary_push(new_args, INT2FIX(1));
      t9 = ack_CHECKSUM(rlisp_compiled_module, closure, self, new_args, Qnil);

      t6 = t9;
    } else {
      t11 = rb_funcall(m, ID_MINUS, 1, INT2FIX(1));
      t13 = rb_funcall(n, ID_MINUS, 1, INT2FIX(1));

      new_args = rb_ary_new();
      rb_ary_push(new_args, m);
      rb_ary_push(new_args, t13);
      t14 = ack_CHECKSUM(rlisp_compiled_module, closure, self, new_args, Qnil);

      new_args = rb_ary_new();
      rb_ary_push(new_args, t11);
      rb_ary_push(new_args, t14);
      t15 = ack_CHECKSUM(rlisp_compiled_module, closure, self, new_args, Qnil);;

      t6 = t15;
    }
    t3 = t6;
  }
  return t3;
}

static VALUE build_closure_ack_CHECKSUM(VALUE rlisp_compiled_module, VALUE globals)
{
  VALUE closure = rb_hash_new();
  rb_hash_aset(closure, SYM_globals, globals);
  return closure;
}

void Init_ackermann2()
{
  VALUE mRLispCompiledCode;
  ID_EQEQ     = rb_intern("==");
  ID_PLUS     = rb_intern("+");
  ID_MINUS    = rb_intern("-");
  ID_call     = rb_intern("call");
  SYM_ack     = ID2SYM(rb_intern("ack"));
  SYM_globals = ID2SYM(rb_intern("globals"));

  mRLispCompiledCode = rb_define_module("RLispCompiledCode");
  rb_define_module_function(mRLispCompiledCode, "ack_CHECKSUM", ack_CHECKSUM, 4);
  rb_define_module_function(mRLispCompiledCode, "build_closure_ack_CHECKSUM", build_closure_ack_CHECKSUM, 1);
}

It's over 15 times faster than current RLisp and over 3 times faster than Ruby:

Ruby: 1.18s
RLisp: 5.93s (x5.0)
Mock binary RLisp: 2.48s (x2.1)
Mock binary RLisp 2: 0.38s (x0.3)

The road from a mock RubyC code to real RubyC back-end is a long one, but basically it proves what I always knew - RLisp can be much faster than Ruby.

Oh and the Rakefile:

task :default => ["ackermann.so", "ackermann2.so"]

def time_cmd(cmd)
  cmd = "/usr/bin/time -f '%U+%S' #{cmd} 2>&1 >/dev/null"
  raw_time = `#{cmd}`.chomp
  raw_time =~ /\A(\d+\.\d*)\+(\d+\.\d*)\Z/ or raise "Cannot parse time: #{raw_time}"
  $1.to_f + $2.to_f
end

desc "Run benchmarks"
task :benchmark do
  ruby_time = time_cmd "ruby ./ackermann.ruby 6"
  rlisp_time = time_cmd "./rlisp.rb ./ackermann.rl 6"
  binrl_time = time_cmd "./ackermann_driver.rb 6"
  binrl2_time = time_cmd "./ackermann_driver2.rb 6"

  print "Ruby: #{ruby_time}s\n"
  print "RLisp: #{rlisp_time}s (x#{sprintf "%0.1f", (rlisp_time.to_f/ruby_time)})\n"
  print "Mock binary RLisp: #{binrl_time}s (x#{sprintf "%0.1f", (binrl_time.to_f/ruby_time)})\n"
  print "Mock binary RLisp 2: #{binrl2_time}s (x#{sprintf "%0.1f", (binrl2_time.to_f/ruby_time)})\n"
  print "\n"
end

desc "Compile ackermann.c"
file "ackermann.so" => "ackermann.c" do
  sh "gcc  -Wall -W -O6 -g -I/usr/lib/ruby/1.8/i486-linux/ -fPIC -shared ackermann.c -o ackermann.so"
end

desc "Compile ackermann2.c"
file "ackermann2.so" => "ackermann2.c" do
  sh "gcc  -Wall -W -O6 -g -I/usr/lib/ruby/1.8/i486-linux/ -fPIC -shared ackermann2.c -o ackermann2.so"
end

Monday, June 04, 2007

Some thoughts on RLisp style

Established languages tend to have more or less official style guides. Here are links to a few random style guides - Ruby, Python, Perl, Common Lisp, Smalltalk [PDF]. Style guides would be pointless for languages like C and Java, as coding in them requires utter disregard for aesthetics.

RLisp is still too young for a definite style guide, but I have a few tips. They're not "binding" in any way, not even for patch submitters for RLisp, and I am likely change my mind many times before 1.0 (if there will ever be 1.0).

Use fn/hd/tl instead of lambda/car/cdr
Use 2-space indentation. Or maybe 4-space indentation. In any case don't use tabs.
Function and variable naming should mostly follow Ruby and Scheme style:
- Functions which return boolean values should end with ? , like (empty? obj).
- "Dangerous" functions should end with !. Like sort for returning sorted container versus sort! for sorting it in-place.
Message passing should usually be done with [obj method arguments] instead of (send obj 'method arguments). In cases where it's possible to use either a message send [a == b] or a function call (eq? a b), pick your choice.
I consider strings of car/cdrs a bad style. Usually you can pattern match, or use something like (nth a_list index), or (ntl a_list index), or [a_list get index], or [a_list get [index .. -1]]. I think any of these look better than a string of cars and cdrs.
RLisp doesn't support non-ASCII (that is UTF-8, other character sets would definitely not be supported) characters in identifiers. I'm not decided yet if it would be a good idea to support them or not.
There's no need to use excessive indentation. Because of different (let ...) semantics RLisp programs can often be indented a lot less than Scheme/Common Lisp programs.
RLisp doesn't do tail call optimization yet, and Ruby stack isn't that big. That means Scheme-style excessive recursion is not a good idea. It should be fixed in the future.
There's a simple code formatter for RLisp, but I don't think you should trust it too much.
I sometimes place )s and ]s on their own lines, and sometimes together with the last expression line. A rough guide would be - place ) on its own line if you'd use end in Ruby, and place it together with expression line if in Ruby you'd use }. That's not very helpful, as end vs } is a matter of personal style. Here's an example from the RLisp test suite, which might make it a bit more clear. Only ) for (test-suite ... got its own line, and I think it would look weird if it didn't. Other )s don't really need own lines.

(test-suite Text_Processing
  (test parse_ip
    (assert (parse-ip "1.2.3.4") == '(1 2 3 4))
    (assert (parse-ip "64.233.183.104") == '(64 233 183 104)))
  (test parse_ip_2
    (assert (parse-ip-2 "1.2.3.4") == '(1 2 3 4))
    (assert (parse-ip-2 "64.233.183.104") == '(64 233 183 104)))
  (test parse_ip_3
    (assert (parse-ip-3 "1.2.3.4") == '(1 2 3 4))
    (assert (parse-ip-3 "64.233.183.104") == '(64 233 183 104)))
)