The best kittens, technology, and video games blog in the world.

Tuesday, July 20, 2010

We need syntax for talking about Ruby types

Koteczek by kemcio from flickr (CC-NC)

All this is about discussing types in blog posts, documentation etc. None of that goes anywhere near actual code (except possibly in comments). Ruby never sees that.

Statically typed languages have all this covered, and we need it too. Not static typing of course - just an expressive way to talk about what types things are - as plain English fails here very quickly. As far as I know nothing like that exists yet, so here's my proposal.

This system of type descriptions is meant for humans, not machines. It focuses on the most important distinctions, and ignores details that are not important, or very difficult to keep track of. Type descriptions should only be as specific as necessary in given context. If it makes sense, there rules should be violated.

In advance I'll say I totally ignored all the covariance / contravariance / invariance business - it's far to complicated, and getting too deeply into such issues makes little sense in a language where everything can be redefined.

Basic types

Types of simple values can be described by their class name, or any of its superclasses or mixins. So some ways to describe type of 15 would be Fixnum (actual class), Integer (superclass), Comparable (mixin), or Object (superclass all the way up).

In context of describing types, everything is considered an Object, and existence of Kernel, BasicObject etc. is ignored.

So far, it should all be rather obvious. Examples:
  • 42 - Integer
  •  - Time
  • Dir.glob("*") - Enumerable
  • STDIN - IO

nil and other ignored issues

nil will be treated specially - as if it was of every possible type. nil means absence of value, and doesn't indicate what type the value would have if it was present. This is messy, but most explicitly typed languages follow this path.

Distinction between situations that allow nils and those that don't will be treated as all other value range restrictions (Integer must be posibile, IO must be open for writing etc.) - as something outside the type system.

For cases where nil means something magical, and not just absence of value, it should probably be mentioned.

Checked exceptions and related non-local exits in Ruby would be a hopeless thing to even attempt. There's syntax for exceptions and catches used as control structures if they're really necessary.


We will also pretend that Boolean is a common superclass of TrueClass and FalseClass.

We will also normally ignore distinction between situations where real true/false are expected, and situations where any object goes, but acts identically to its boolean conversion. Any method that acts identically on x and !!x can be said to take Boolean.

On the other hand if some values are treated differently than their double negation, that's not really Boolean and it deserves a mention. Especially if nil and false are not equivalent - like in Rails's #in_groups_of (I don't think Ruby stdlib ever does thing like that).

Duck typing

If something quacks like a Duck convincingly enough, it can be said to be of type Duck, it being object's responsibility that its cover doesn't get blown.

In particular, Ruby uses certain methods for automatic type conversion. In many contexts objects implementing #to_str like Pathnames will be treated as Strings, objects implementing #to_ary as Arrays, #to_hash as Hashes, and to_proc as Procs - this can be used for some amazing things like Symbol#to_proc.

This leads to a big complication for us - C code implementing Ruby interpreter and many libraries is normally written in a way that calls these conversion functions automatically, so in such contexts Symbol really is a Proc, Pathname really is a String and so on. On the other hand, in Ruby code these methods are not magical, and such conversions will only happen if explicitly called - for them Pathname and String are completely unrelated types. Unless Ruby code calls C code, which then autoconverts.

Explicitly differentiating between contexts which expect a genuine String and those which expect either that or something with a valid #to_str method would be highly tedious, and I doubt anyone would get it exactly right.

My recommendation would be to treat everything that autoconverts to something as if it subclassed it. So we'll pretend Pathname is a subclass of String, even though it's not really. In some cases this will be wrong, but it's not really all that different from subclassing something and then introducing incompatible changes.

This all doesn't extend to #to_s, #to_a etc - nothing can be described as String just because it has to_s method - every object has to_s but most aren't really strings.

Technical explanation of to_str and friends

This section is unrelated to post's primary subject - skip if uninterested.

Ruby uses special memory layout for basic types like strings and arrays. Performance would be abysmal if string methods had to actually call Ruby code associated with whatever [] happened to be redefined to for every character - instead they ask for a certain C data structure, and access that directly (via some macros providing extra safety and convenience to be really exact).

By the way this is a great example of C being really slow - if Ruby was implemented on a platform with really good JIT, it could plausibly have every single string function implemented in term of calls to [], []=, size, and just a few others, with different subclasses of String providing different implementations, and JIT compiling inlining all that to make it really fast.

It would make it really simple to create class representing a text file, and =~ /regexp/ that directly without reading anything more than required to memory, or maybe even gsub! it in a way that would read it in small chunks, saving them to another file as soon as they're ready, and then renaming in one go. All that without regexp library knowing anything about it all. It's all just my fantasy, I'm not saying any such JIT actually exists.

Anyway, strings and such are implemented specially, but we still want these types to be real objects, not like what they've done in Java. To make it work, all C functions requiring access to underlying storage call a special macro which automatically calls a method like to_str or to_ary if necessary - so such objects can pretend to be strings very effectively. For example if you alias method to_str to path on File code like system"/bin/hostname") will suddenly start working. It really makes sense only for things which are "essentially strings" like Pathname, URI, Unicode-enhanced strings, proxies for strings in third party libraries like Qt etc.

To complicate things further objects of all classes inheriting from String automatically use String's data representation - and C code will access that, never calling to_str. This leaves objects which duck type as Strings two choices:
  • Subclass String and every time anything changes update C string data. This can be difficult - if you implement an URI and keep query part as a hash instance variable - you need to somehow make sure that your update code gets run every time that hash changes - like by not exposing it at all and only allowing query updates via your direct methods, or wrapping it in a special object that calls you back.
  • Don't subclass String, define to_str the way you want. Everything works - except your class isn't technically a String so it's not terribly pretty OO design.
You probably won't be surprised that not subclassing is the more popular choice. As it's all due to technical limitations not design choices, it makes sense to treat such objects as if they were properly subclassed.

Pussy by tripleigrek from flickr (CC-SA)


Back to the subject. For collections we often want to describe types of their elements. For simple collections yielding successive elements on #each, syntax for type description is CollectionType[MemberType]. Examples:
  • [42.0, 17.5] - Array[Float]
  • Set["foo","bar"] - Set[String]
  • 5..10 - Range[Integer]
When we don't care about collection type, only about element types, descriptions like Enumerable[ElementType] will do.

Syntax for types of hashtables is Hash[KeyType, ValueType] - in general collections which yield multiple values to #each can be described as CollectionType[Type1, Type2, ..., TypeN].

For example {:foo => "bar"} is of type Hash[Symbol, String].

This is optional - type descriptions like Hash or Enumerable are perfectly valid - and often types are unrelated, or we don't care.

Not every Enumerable should be treated as collection of members like that - File might technically be File[String] but it's usually pointless to describe it this way. In 1.8 String is Enumerable, yielding successive lines when iterated - but String[String] make no sense (no longer a problem in 1.9).

Classes other than Enumerable like Delegator might need type parameters, and they should be specified with the same syntax. Their order and meaning depends on particular class, but usually should be obvious.

Literals and tuples

Ruby doesn't make distinction between Arrays and tuples. What I mean here is a kind of Array which shouldn't really be treated as a collection, and in which different members have unrelated type and meaning depending on their position.

Like method arguments. It really wouldn't be useful to say that every method takes Array[Object] (and an optional Proc) - types and meanings of elements in this array should be specified.

Syntax I want for this is [Type1, Type2, *TypeRest] - so for example Hash[Date, Integer]'s #select passes [Date, Integer] to the block, which should return a Boolean result, and then returns either Array[[Date, Integer]] (1.8) or Hash[Date, Integer] (1.9). Notice double [[]]s here - it's an Array of pairs. In many contexts Ruby automatically unpacks such tuples, so Array[[Date,Integer]] can often be treated as Array[Date,Integer] - but it doesn't go deeper than one level, and if you need this distinction it's available.

Extra arguments can be specified with *Type or ... which is treated here as *Object. If you want to specify some arguments as optional suffix their types with ? (the most obvious [] having too many uses already, and = not really fitting right).

In this syntax [*Foo] is pretty much equivalent to Array[Foo], or possibly Enumerable[Foo] (with some duck typing) - feel free to use that if it makes things clearer.

Basic literals like true, false, nil stand for themselves - and for entire TrueClass, FalseClass, NilClass classes too as they're their only members. Other literals such as symbols, strings, numbers etc. can be used too when needed.

To describe keyword arguments and hashes used in similar way, syntax is {Key1=>Type1, Key2=>Type2} - specifying exact key, and type of value like {:noop=>Boolean, :force=>Boolean}.

It should be assumed that keys other than those listed are ignored, cause exception, or are otherwise not supported. If they're meaningful it should be marked with ... like this {:query=>String, ...}. Subclasses often add extra keyword arguments, and this issue is ignored.


Everything so far was just a prelude to the most important part of any type system - types for functions. Syntax I'd propose it: ArgumentTypes -> ReturnType (=> being already used by hashes).

I cannot decide if blocks should be specified in Ruby-style notation or a function notation, so both  & {|BlockArgumentTypes| BlockReturnType} and &(BlockArgumentTypes->BlockReturnType) are valid. & is necessary, as block are passed separately from normal arguments, however strong the temptation to reuse -> and let the context disambiguate might be.

Blocks that don't take any arguments or don't return anything can drop that part, leaving only something like &{|X|}, &{Y}, &{}, or in more functional notation &(X->), &(Y), &().

Because of all the [] unpacking, using [] around arguments, tuple return values etc. is optional - and just like in Ruby () can be used instead in such contexts.

If function doesn't take any arguments, or returns no values, these parts can be left - leaving perhaps as little as ->.

  • In context of %w[Hello world !].group_by(&:size) method #group_by has type Array[String]&{|String| Integer}->Hash[Integer,String]
  • has type Numeric -> Time
  • String#tr has type [String, String] -> String
  • On a collection of Floats, #find would have type Float?&(Float->Boolean)->Float
  • Function which takes no arguments and returns no values has type []->nil
If you really need to specify exceptions and throws, you can add raises Type, or throws :kind after return value.  Use only for control structure exceptions, not for actual errors exceptions. It might actually be useful if actual data gets passed around.
  • Find.find has type [String*]&(String->nil throws :prune)->nil

A standalone Proc can be described as (ArgumentsTypes->ReturnType) just as with notation for functions. There is no ambiguity between Proc arguments and block arguments, as blocks are always marked with |.

Type variable and everything else

In addition to names of real classes, any name starting with an uppercase letter should be consider a type. Unless it's specified otherwise in context, all such unknown  names should be considered class variables with big forall quantifier in front of it all.

  • Enumerable[A]#partition has type &(B->Boolean)->[Array[A], Array[A]]
  • Hash[A,B]#merge has type Hash[A,B]&(A,B,B->B)->Hash[A,B]
  • Array[A]#inject has either type B&(B,A->B)->B or &(A,A)->A. This isn't just a usual case of missing argument being substituted by nil - these are two completely different functions.
To specify that multiple types are allowed (usually implying that behaviour will be different, otherwise there should be a superclass somewhere, or we could treat it as common duck typing and ignore it) join them with |. If there's ambiguity between this use and block arguments, parenthesize. It binds more tightly than ,, so it only applies to one argument. Example:
  • String#index in 1.8 has type (String|Integer|Regexp, Integer?)->Integer (and notice how I ignored Fixnums here).
For functions that can be called in multiple unrelated ways, just list them separately - | and parentheses will work, but they are usually top level, and not needed anywhere deeper.

If you want to specify type of self, prefix function specification with Type#:
  • #sort has type like Enumerable[A]#()&(A,A->1|0|-1)->Array[A]

To specify that something takes range of values not really corresponding to a Ruby class, just define such extra names somewhere and then use like this:
  • File#chown has type (UnixUserId, UnixUserId)->0 - with UnixUserId being a pretend subclass of Integer, and 0 is literal value actually returned.

To specify that something needs a particular methods just make up a pretend mixin like Meowable for #meow.

Any obvious extensions to this notation can be used, like this:
  • Enumerable[A]#zip has type (Enumerable[B_1], *Enumerable[B_i])->Array[A, B_1, *B_i] - with intention that B_is will be different for each argument understood from context. (I don't think any static type system handles cases like this one reasonably - most require separate case for each supported tuple length, and you cannot use arrays if you mix types. Am I missing something?)

The End

Well, what I really wanted to do what talk about Ruby collection system, and how 1.9 doesn't go far enough in its attempts at fixing it. And without notation for types talking about high order functions that operate on collections quickly turns into a horrible mess. So I started with a brief explanation of notation I wanted to use, and then I figured out I can as well do it right and write something that will be reusable in other contexts too.

Most discussion of type systems concerns issues like safety and flexibility, which don't concern me at all, and limit themselves to type systems usable by machines.

I need types for something else - as statements about data flow. Type signature like Enumerable[A]#()&(A->B)->Hash[A,B] doesn't tell you exactly what such function does but narrows set of possibilities extremely quickly. What it describes is a function which iterates over collection in order while building a Hash to be returned, using collection's elements as keys, and values returned by the block as values. Can you guess the function I was thinking about here?

Now a type like that is not a complete specification - a function that returns an empty hash fits it. As does one which skips every 5th element. And one that only keeps entries with unique block results. And for that matter also one that sends your email password to NSA - at least assuming it returns that Hash afterwards.

It was still pretty useful. How about some of those?
  • Hash[A,B] -> Hash[B, Array[A]]
  • Hash[A,B] &(A,B->C) -> Hash[A,C]
  • Hash[A, Hash[B,C]] -> Hash[[A,B], C]
  • Hash[A,B] &(A,B->C) -> Hash[C, Hash[A,B]]
  • Enumerable[Hash[A,B]] &(A,B,B->B) -> Hash[A,B]
  • Hash[A,Set[B]] -> Hash[Set[A], Set[B]]

Even these short snippets should give a pretty good idea what these are all about.

That's it for now. Hopefully it won't be long until that promised 1.9 collections post.

Sunday, July 18, 2010

If only Ruby had macros

Kicius Gustaw Czarowny by K0P from flickr (CC-NC-ND)

Blogger will most likely totally destroy code formatting again, sorry about that.

Ruby annoys me a lot - the code gets so close to being Just Right, with only that last little bit of wrongness that won't go away no matter what. With everything except Ruby at least I know it will be crap no matter what, so I never get this.

For example it's so easy to make a function generating successive values on each call:

def counter(v)
  return counter(v, &:succ) unless block_given?
  proc{ v = yield(v) }

But you must give it value before the first - and sometimes such a thing doesn't exist, like with generating successive labels "a", "b", "c" ... A counter starting from the first value passed isn't exactly difficult, it just doesn't feel right:

def counter(v)
  return counter(v, &:succ) unless block_given?
  proc{ old, v = v, yield(v); old }

Useless variables like old that only indicate control flow just annoy me. Not to mention lack of default block argument. I'm undecided if this tap makes things better or worse.

def counter(v)
  return counter(v, &:succ) unless block_given?
  proc{v.tap{ v = yield(v) }}

Another example. This wrapper for Ruby executable makes rubygems and -r compatible. It's so close to being able to use Array#map, and yet so far away:

args = []
while arg = ARGV.shift
  if arg =~ /\A-r(.*)\z/
    lib = $1.empty? ? ARGV.shift : $1
    args << "-e" << "require 'rubygems'; require '#{lib}'"
    args << arg
exec "ruby", *args

Yes, these are tiny things, but it's frustrating to get almost there. By the way, -r should just call require, another thing which is almost right but no.

I could go on with these small examples, but I want to talk about something bigger. A very common pattern in all programming languages is something like this:

  if item.test_1
  elsif item.test_2
  elsif item.test_3

Or a very similar:

case item
  when pattern_1
  when pattern_2
  when pattern_3

Tests and actions are all next to each other, where they belong. But what if instead of executing an action on a single item at a time, we wanted to do so on all matching items together?

If Ruby had proper macros it would be totally trivial - unfortunately Ruby forces us to choose one of bad options. First, the most straightforward:

yes1, no1 = collection.partition{|item| item.test_1}
yes2, no12 = no1.partition{|item| item.test_2}
yes3, no123 = no12.partition{|item| item.test_3}


Rather awful. Or perhaps this?

groups = collection.group_by{|item|
if item.test_1 then 1
  elsif item.test_2 then 2
  elsif item.test_3 then 3
  else 4

By the way we cannot use a series of selects here - action_3 should apply only to items which pass test_3 but not test_1 or test_2.

We can imagine adding extra methods to Enumerable to get syntax like this:

proc{|item| item.test_1}, proc{|group| group.action_1},
  proc{|item| item.test_2}, proc{|group| group.action_2},
  proc{|item| item.test_3}, proc{|group| group.action_3},
                            proc{|group| group.otherwise})

Or maybe like this (looks even worse if you need to assign groups to a variable before performing the relevant action):

tmp = collection.dup
tmp.destructive_select!{|item| item.test_1}.action_1
tmp.destructive_select!{|item| item.test_2}.action_2
tmp.destructive_select!{|item| item.test_3}.action_3

#destructive_select! being a method in style of Perl's splice - removing some items from collection, and returning removed values.

Possibly wrapping it in something like:

collection.filter{|item| item.test_1}.action{|group| group.action_1}.
          .filter{|item| item.test_2}.action{|group| group.action_2}.
          .filter{|item| item.test_3}.action{|group| group.action_3}.
                                     .action{|group| group.otherwise}

It's Kicius by starlightexpress from flickr (CC-NC-ND)

A few more bad ideas (David Allen says the way you can tell a highly creative person is that they generate bad ideas faster than anyone else). With instance_eval we could do something like this, with item and group being appropriate method calls.

  rule{ item.test_1 }
  action{ group.action_1 }

  rule{ item.test_2 }
  action{ group.action_2 }

  rule{ item.test_3 }
  action{ group.action_3 }

  action{ group.otherwise }

It would be pretty hard to do that while still being able to have inner blocks with your current object's context. By the way trying this out I found out that it's impossible to call a block specifying self, and call a block passing arguments at the same time - it's only one or the other - and no combination of the two makes it work. Those tiny limitations are just infuriating.

I also tried overriding ===. Now that would only work for a small subset of cases but was worth a try:

collection.run_for_each_group{|item, group|
  case item
  when pattern_1
  when pattern_2
  when pattern_3

This item would actually be a special object, calling === on which would callcc, partition collection in two, and resume twice modifying group variable (initially set to the entire collection). That would be pretty cool - except Ruby doesn't use double dispatch, so === is not a CLOS style generic function - it's a method, set on pattern objects, and while adding new pattern types is easy, making old patterns match new kinds of objects is hard. It would require manually finding out every pattern, and manually overriding it to handle our magic item type - and then a lot of hackery to make Regexp#=== work, and then it would fail anyway, as Range#=== and such seem to be handled specially by Ruby.

There was a related possibility of not doing anything weird to item, but requiring special patterns:

collection.run_for_each_group{|item, group, all|
  case item
  when all[pattern_1]
  when all[pattern_2]
  when all[pattern_3]

We're not actually using item here all, so we don't really need to pass it:

collection.run_for_each_group{|group, all|
  if all[pattern_1]
  elsif all[pattern_2]
  elsif all[pattern_3]

Totally implementable, only somewhat ugly with all these all[]s. There are two good ways to implement it - all function would test all items, and if all returned the same value it would just return. Otherwise, it would divide the collection, and in one implementation use callcc, or in alternative implementation, throw something, and restart the whole block twice - this assumes tests are cheap and deterministic.

It looks good, but it doesn't make me happy, as I want all kinds of tests, not just pattern matches. And eventually I came up with this:

collection.run_for_each_group{|item, group, all|
  if all[item.test_1]
  elsif all[item.test_2]
  elsif all[item.test_3]

This way, you can do any test on item you want - just pass the result to all[] before proceeding.

How is it implemented? I could callcc for every element, but unlike Scheme's, Ruby's callcc is rather expensive. And not every version of Ruby has it. So it's the naive throw-and-restart-twice instead. This means tests on each item can be rerun many times, so they better be cheap. Determinism is also advised, even though my implementation caches the first value returned to avoid troubles.

Well, first some usage example you can actually run:

require "pathname"
files = Pathname("/etc").children
  if all[]
    puts "Subdirectories: #{xs*' '}"
  elsif all[x.symlink?]
    puts "Symlinks: #{xs*' '}"
  elsif all[x.size > 2**16]
    puts "Big files: #{xs*' '}"
    puts "The rest: #{xs.size} files"

Doesn't it look a lot lot better than a long cascade of #partitions?

And now #run_for_in_group:

module Enumerable 
  def run_for_each_group(expected=[], &blk)
    return if empty?
    xst, xsf = [], []
      answers = expected.dup
      catch :item_tested do
        yield(it, self, proc{|v|
          if answers.empty?
            (v ? xst : xsf) << it
            throw :item_tested
    xst.run_for_each_group([true, *expected], &blk)
    xsf.run_for_each_group([false, *expected], &blk)

It shouldn't be that difficult to understand. expected tracks the list of expected test results for all items in current collection. Now we iterate, passing each element, the entire group, and all callback function.

The first few times all is called, it just returns recorded answers - they're the same for every element. If after all recorded answers all is called again - we record its result, throw out of the block, and rerun it twice with expanded expectations.

On the other hand if we didn't get any calls to all other than those already recorded, it means we reached the action - group it sees is every element with the same test history. This must only happen once for group, so we return from function.

Total number of block calls is - 1x for each action, 2x for directories, 3x for symlinks, 4x for big files, and also 4x for everything else. Avoiding these reruns would be totally possible with callcc - but it's rather ugly, and often these tests aren't an issue.

So problem solved? Not really. I keep finding myself in situations where a new control structure would make a big difference, and there just doesn't seem to be any way of making it work in Ruby without enough boilerplate code to make it not worthwhile.

I'll end this post with some snippets of code which are just not quite right. Any ideas for making them suck less?

urls = Hash[{|line| id, url = line.split; [id.to_i, url]}] 
each_event{|type, *args| 
  case type
  when :foo
    one, two = *args
    # ...
  when :bar
    one, = *args
    # ...

if dir
  Dir.chdir(dir){ yield(x) }

Saturday, July 17, 2010

Ruby is now more popular than C++

啊?! by bc cf from flickr (CC-ND)

There are many ways to measure popularity, and I'll take one proposed by creator of C++ Bjarne Stroustrup. According to him there are only two kinds of languages - those that nobody uses and those that everybody bitches about - so counting google results for "<language> sucks" is a perfectly good way of measuring popularity.

I did an identical experiment exactly 4 years ago, so it's interesting to see what changed since then.
  • D 147000 - mostly bogus matches for "3D sucks" etc.
  • Java 56300
  • C 48900 - possibly also many bogus matches
  • PHP 34500
  • Ruby 25900
  • Ruby on Rails 18100 - included for comparison only
  • Scheme 14900 - my blog is #1, also many bogus matches
  • C++ 14000
  • Visual Basic 11600
  • Python 8930 
  • Perl 5450
  • Lisp 3510
  • C# 3310
  • Ada 1240
  • OCaml 1070
  • SML 784
  • Erlang 750
  • Cobol 641
  • Fortran 476
  • Haskell 416
  • Smalltalk 176
  • Prolog 161
Some things just don't change. Ignoring queries with too many false positives, the list is dominated by Java and PHP, just as four years ago - with Java lead now being a lot stronger. Most niche languages like OCaml, Smalltalk, and Prolog are still niche languages - although many get a lot of bitching these days (like OCaml's 17x increase).

On the other hand some things changed. Perl used to be very high in the sucking charts - at about 15x as many sucks as Ruby and Python, but isn't anywhere close to the top now, losing almost half of the sucks in that time, as old ones die in link rot, and new ones stop being generated.

The second biggest success story is Python, which sucks 12x as much now, finally overtaking Perl.

But the biggest success is a spectacular explosion of popularity of Ruby. My first list was released only half a year after Rails 1.0, when many people were intrigued by Ruby, but few were actually using it. In those four years Ruby suckiness levels exploded 43x - not even counting Rails bitching, a lot of which is as much about Ruby as about Rails themselves.

Ruby is now a lot more popular than C++ - according to the very metric endorsed by C++ creator. What alternative explanation is there? That C++ is used a lot, it just happens to suck less? Come on.

People complaining about scientific validity of this post are sent here.

Another example of Ruby being awesome - %W

cuteness by sparkleice from flickr (CC-NC-ND)

And there I was thinking I knew everything about Ruby, at least as far as its syntax goes...

As you might have figured out from my previous posts, I'm totally obsessed about string escaping hygiene - I would never send "SELECT * FROM reasons_why_mysql_sucks WHERE reason_id = #{id}" to an sql server even if I was absolutely totally certain that id is a valid integer and nothing can possibly go wrong here. Sure, I might be right 99% of time, but it only takes a single such mistake to screw up the system. And not only with SQL - it's the same with generated HTML, generated shell commands and so on.

And speaking of shell commands - system function accepts either a string which it then evaluates according to shell rules (big red flag), or a list of arguments which it uses to fork+exec right away. Of course we want to do that - except it's really goddamn ugly. Faced with a choice between this insecure but reasonably looking way of starting MongoDB shard servers:

system "mongod --shardsvr --port '#{port}' --fork --dbpath '#{data_dir}' \
--logappend --logpath '#{logpath}' --directoryperdb"

And this secure but godawful (to_s is necessary as port is an integer, and system won't take that):

system *["mongod", "--shardsvr", "--port", port, "--fork",
"--dbpath", data_dir, "--logappend",
"--logpath", logpath, "--directoryperdb"].map(&:to_s)

Even I have my doubts.

And then I found something really cool in Ruby syntax that totally solves the problem. Now I was totally aware of %w[foo bar] syntax Ruby copied from Perl's qw[foo bar], and while useful occasionally, is really little more than constructing a string, and then calling #split on that.

And I though I was also aware of %W - which obviously would work just like %w except evaluating code inside. Except that's not what it does! %W[foo #{bar}] is not "foo #{bar}".split - it's ["foo", "#{bar}"]! And using a real parser of course, so you can use as many spaces inside that code block as you want.

system *%W[mongod --shardsvr --port #{port} --fork --dbpath #{data_dir}
--logappend --logpath #{logpath} --directoryperdb]

There's nothing in Perl able to do that. Not only it's totally secure, it looks every better than the original insecure version as you don't need to insert all those 's around arguments (which only half-protected them anyway, but were better than nothing), and you can break it into multiple lines without \s.

%W always does the right thing - %W[scp #{local_path} #{user}@#{host}:#{remote_path}] will keep the whole remote address together - and if the code block returns an empty string or nil, you'll get an empty string there in the resulting array. I sort of wish there was some way of adding extra arguments with *args-like syntax like in other contexts, but %W[...] + args does exactly that, so it's not a big deal.

By the way, it seems to me that all % constructors undeservingly get a really bad reputation as some sort of ugly Perl leftover in Ruby community. This is so wrong - what's ugly is excessive escaping with \ which they help avoid. Which regexp for Ruby executables looks less bad, the one with way too many \/s - /\A(\/usr|\/usr\/local|\/opt|)\/bin\/j?ruby[\d.]*\z/, or one which avoids them all thanks to %r - %r[\A(/usr|/usr/local|/opt|)/bin/j?ruby[\d.]*\z]?

By the way - yes I used []s inside even though they were the big demarcator. That's another great beauty of % constructions - if you demarcate with some sort of braces like [], (), <>, or {} - it will only close once every matched pair inside is closed - so unlike traditional singly and doubly quoted strings % can be nested infinitely deep without a single escape character! (Perl could do that one as well)

And speaking of things that Ruby copied from Perl, and then made them much more awesome, here's a one-liner to truncate a bunch of files after 10 lines, with optional backups. Which language gets even close to matching that? ($. in both Perl and Ruby will keep increasing from file to file, so you cannot use that)

ruby -i.bak -ple 'ARGF.skip if ARGF.file.lineno > 10' files*.txt

Friday, July 16, 2010

Arrays are not integer-indexed Hashes

Cabooki by elycefeliz from flickr (CC-NC-ND)

We use a separate Array type even though Ruby Hashes can be indexed by integers perfectly well (unlike Perl hashes which implicitly convert all hash keys to strings, and array keys to integers). Hypothetically, we could get rid of them altogether and treat ["foo", "bar"] as syntactic sugar for {0=>"foo", 1=>"bar"}.

Now there are obviously some performance reasons for this - these are mostly fixable and a single data structure can perform well in both roles. And it would break backwards compatibility rather drastically, but let's ignore all that and imagine we're designing a completely fresh language which simply looks a lot like Ruby.

What would work

First, a lot of things work right away like [], []=, ==, size, clear, replace, and zip.

The first incompatibility is with each - for hashes it yields both keys and values, for arrays only values, and we'd need to decide one way or the other - I think yielding both makes more sense, but then there are all those non-indexable enumerables which won't be able to follow this change, so there are good reasons to only yield values as well. In any case, each_pair, each_key, and each_value would be available.

Either way, one more change would be necessary here - each and everything else would need to yield elements sorted by key. There are performance implications, but they're not so bad, and it would be nicer API.

Hash's methods keys, values, invert, and update all make perfect sense for Arrays. With keys sorted, first, last, and pop would work quite well. push/<< would be slightly nontrivial - but making it add #succ of the last key (or 0 for empty hashes) would work well enough.

Collection tests like any?, all?, one?, none? are obvious once we decide each, and so is count. map/collect adapts to hashes well enough (yielding both key and value, and returning new value).

Array methods like shuffle, sort, sample, uniq, and flatten which ignore indexes (but not their relative positions) would do likewise for hashes, so flattening {"a"=>[10,20], "b"=>30} would result in [10,20,30] ("a" yields before "b").

Enumerable methods like min/max/min_by/max_by, find, find_index, inject would do likewise.

include? checks values for Arrays and keys for hashes - we can throw that one out (or decide one way or the other, values make more sense to me), and use has_key?/has_value? when it matters.

reverse should just return values, but reverse_each should yield real keys.

I could go on like this. My point is - a lot of this stuff can be made to work really well. Usually there's a single behavior sensible for both Arrays, and Hashes, and if you really need something different then keys, values, or pairs would usually be a suitable solution.

What doesn't work

Unfortunately some things cannot be made to work. Consider this - what should be the return value of {0 => "zero", 1 => "one"}.select{|k,v| v == "one"}?

If we treat it as a hash - let's say a mapping of numbers to their English names, there is only one correct answer, and everything else is completely wrong - {1=>"one"}.

On the other hand if we treat it as an array - just an ordered list of words - there is also only one correct answer, and everything else is completely wrong - {0=>"one"}.

These two are of course totally incompatible. And an identical problem affects a lot of essential methods. Deleting an element renumbers items for an array, but not for a hash. shift/unshift/drop/insert/slice make so sense for hashes, and methods like group_by and partition have two valid and conflicting interpretations. It is, pretty much, unfixable.

So what went wrong? Thinking that Arrays are indexed by integers was wrong!

In {0=>"zero",1=>"one"} association between keys and values is extremely strong - key 0 is associated with value "zero", and key 1 with value "one". They exist as a pair and everything that happens to the hash happens to pairs, not to keys or values separately - there are no operations like insert_value, delete_value which would just shift remaining values around from one key to another. This is the nature of hashes.

Arrays are not at all like that. In ["zero", "one"] association between 0 and "zero" is very weak. The real keys are not 0, and 1 - they're two objects devoid of any external meaning, whose only property is their relative partial order.

To implement array semantics on top of hashes, we need a class like, less_than=nil). Then a construction like this would have semantics we desire.

arr = {}

arr[, nil)] = "zero"

arr[, nil)] = "one"

If we use these instead of integers, hashes can perform all array operations correctly.

# shift


# unshift

arr[, arr.first_key)] = "minus one"

# select - indexes for "zero" and "two" in result have correct order

["zero", "one", "two"].select{|key, value| value != "one"}

# insert - nth_key only needs each

arr[, arr.nth_key(1))] = "one and half"

And so the theory is satisfied. We have a working solution, even if highly impractical one. Of course all these Index objects are rather hard to use, so the first thing we'd do is subclassing Hash so that arr[i] would really mean arr[arr.nth_key(i)] and so on, and there's really no point yielding them in #each and friends... oh wait, that's exactly where we started.

In other words, unification of arrays and hashes is impossible - at least unless you're willing to accept a monstrosity like PHP where numerical and non-numerical indexes are treated differently, and half of array functions accept a boolean flag asking if you'd rather have it behave like an array or like a hash.

Random sampling or processing data streams in Ruby

7 14 10 by BernieG10 from flickr (CC-NC-ND)

It might sound like I'm tackling a long solved problem here - sort_by{rand}[0, n] is a well known idiom, and in more recent versions of Ruby you can use even simpler shuffle[0, n] or sample(n).

They all suffer from two problems. The minor one is that quite often I want elements in the sample to be in the same relative order as in the original collection (this in no way implies sorted) - what can be dealt with by a Schwartzian transform to [index, item] space, sampling that, sorting results, and transforming out to just item.

The major problem is far worse - for any of these to work, the entire collection must be loaded to memory, and if that was possible, why even bother with random sampling? More often than not, the collection I'm interested in sampling is something disk-based that I can iterate only once with #each (or twice if I really really have to), and I'm lucky if I even know its #size in advance.

By the way - this is totally unrelated, but I really hate #length method with passion - collections have sizes, not "lengths" - for a few kinds of collections we can imagine them arranged in a neat ordered line, and so their size is also length, but it's really lame to name a method after special case instead of far more general "size" - hashtables have sizes not lengths, sets have sizes not lengths, and so on - #length should die in fire!

When size is known

So we have a collection we can only iterate once - for now let's assume we're really lucky and we know exactly how many elements it has - this isn't all that common, but it happens every now and then. As we want n elements out of size, probability of each element being included is n/size, and so select{ n > rand(size) } will nearly do the trick - even keeping samples in the right order... except it will only return approximately n elements.

If we're sampling 1000 out of a billion we might not really care all that much, but it turns out it's not so difficult to do better than that. Sampling n elements out of [first, *rest] collection neatly reduces to: [first, *rest.sample(n-1)] with n/size probability, or rest.sample(n) otherwise. Except Ruby doesn't have decent tail-call optimization, so we'll use counters for it.

module Enumerable
  def random_sample_known_size(wanted, remaining=size)
    if block_given?
        if wanted > rand(remaining) 
          wanted -= 1
        remaining -= 1
      rv = []
      random_sample_known_size(wanted, remaining){|it| rv.push(it) }

This way of sampling has an extra feature that it can yield samples one at a time and never needs to store any in memory - something you might appreciate if you want to take a couple million elements out of 10 billions or so, and you will not only avoid loading them to memory, you will be able to use the results immediately, instead of only when the entire input finishes.

This is only possible if collection size is known - if we don't know if there's 1 element ahead or 100 billion, there's really no way of deciding what to put in the sample.

If you cannot fit even the sample in memory at once, and don't know collection size in advice - it might be the easiest thing to iterate twice, first to compute the size, and then to yield random records one at a time (assuming collection size doesn't change between iterations at least). CPU and sequential I/O are cheap, memory and random I/O are expensive.

Russian Blue by Adam Zdebel from flickr (CC-NC-ND)

When size is unknown

Usually we don't know collection size in advance, so we need to keep a running sample - initialize it with the first n elements, and then for each element that arrives replace a random one from the sample with probability n / size_so_far.

The first idea would be something like this:

module Enumerable
  def random_sample(wanted)
    rv = []
    size_so_far = 0
      size_so_far += 1
      j = rand(size_so_far)
      rv.delete_at(j) if wanted == rv.size and wanted > j
      rv.push(it) if wanted > rv.size

It suffers from a rather annoying performance problem - we're keeping the sample in a Ruby Array, and while they're optimized for adding and removing elements at both ends, deleting something from the middle is a O(size) memmove.

We could replace rv.delete_at(j); rv.push(it) with rv[j] = it to gain performance at cost of item order in the sample... or we could do that plus Schwarzian transform into [index, item] space to get correctly ordered results fast. This only matters once sample size reaches tens of thousands, before that brute memmove is simply faster than evaluating extra Ruby code.

module Enumerable
  def random_sample(wanted)
    rv = []
    size_so_far = 0
      size_so_far += 1
      j = wanted > rv.size ? rv.size : rand(size_so_far)
      rv[j] = [size_so_far, it] if wanted > j
    }{|idx, it| it}

This isn't what stream processing looks like!

The algorithms are as good as they'll get, but API is really not what we want. When we actually do have an iterate-once collection, we usually want to do more than just collect a sample. So let's encapsulate such continuously updated sample into Sample class:

class Sample
  def initialize(wanted)
    @wanted = wanted
    @size_so_far = 0
    @sample = []
  def add(it)
    @size_so_far += 1
    j = @wanted > @sample.size ? @sample.size : rand(@size_so_far)
    @sample[j] = [@size_so_far, it] if @wanted > j
  def each
    @sample.sort.each{|idx, it| yield(it)}
  def total_size
  include Enumerable

It's a fully-featured Enumerable, so it should be really easy to use. #total_size will return count of all elements seen so far - calling that #size would conflict with the usual meaning of number of times #each yields. You can even nondestructively access the sample, and then keep updating it - usually you wouldn't want that, but it might be useful for scripts that run forever and periodically save partial results.

To see how it can be used, here's a very simple script, which reads a possibly extremely long list of URLs, and prints a sample of 3 by host. By the way notice autovivification of Samples inside the Hash - it's a really useful trick, and Ruby's autovivification can do a lot more than Perl's.

require "uri"
sites ={|ht,k| ht[k] =}
  host = URI.parse(url).host rescue next
sites.sort.each{|host, url_sample|
  puts "#{host} - #{url_sample.total_size}:"
  url_sample.each{|u| puts "* #{u}"}

So enjoy your massive data streams.

Thursday, July 15, 2010

Synchronized compressed logging the Unix way

tiny tiny kitten 4 weeks old by GeorgeH23 from flickr (CC-NC-ND)
In good Unix tradition if a program generates some data, in general it should write it to STDOUT, and you'll redirect it to the right file yourself.

There are two problems with that, both easily solvable in separation:
  • If it's a lot of data, you want to store it compressed. It would be bad Unix to put compression directly in the program - the right way is to pipe its output through gzip with program | gzip >logfile.gz. gzip is really fast, and usually adequate.
  • You want to be able to see what were the last lines written out by the program at any time. Especially if it appears frozen. Sounds trivial, but thanks to a horrible misdesign of libc, and everything else based on it, data you write gets buffered before being actually written - a totally reasonable thing - and there are no limits whatsoever how long it can stay in buffers! Fortunately it is possible to turn this misfeature off with a single line of STDOUT.sync=true or equivalent in other languages.
Unfortunately while both fixes involve a single line of obvious code - there's no easy way to solve them together. Even if you flushed all data from the program to gzip, gzip can hold onto it indefinitely. Now unlike libc which is simply broken, gzip has a good reason - compression doesn't work on one byte at a time - it takes a big chunk, compresses it, and only then writes it all out.

Still, even if it has good reasons not to flush data as soon as possible, it can and very much should flush it every now and then - with flushing every few seconds reduction in compression ratio will be insignificant, and it will be possible to find out why the program frozen almost right away. The underlying zlib library totally has this feature - unfortunately command line gzip utility doesn't expose it.

So I wrote this:

#!/usr/bin/env ruby

require 'thread'
require 'zlib'

def gzip_stream(io_in, io_out, flush_freq)
  fh = Zlib::GzipWriter.wrap(io_out)
  lock ={
    while true
        return if fh.closed?
        fh.flush if fh.pos > 0
      sleep flush_freq

gzip_stream(STDIN, STDOUT, 5)

It reads lines on stdin, writes them to stdout, and flushes every 5 seconds (or whatever you configure) in a separate Ruby thread. Ruby green threads are little more than a wrapper over select() in case you're wondering. The check that fh.pos is non-zero is required as flushing before you write something seems to result in invalid output.

Now you can program | gzip_stream >logfile.gz without worrying about data getting stuck on the way (if you flush in your program that is).

Friday, July 02, 2010

Palestinian problem fixed

good Shabbos by anomalous4 from flickr (CC-BY)

This solution is really really simple, has plenty of precedent in history, and yet I haven't heard it proposed seriously by anyone before. So here is goes:
Palestinians should convert to Judaism

According to Israeli laws, anybody who converts to Judaism can trivially get Isreali citizenship. Once sufficient number of Palestinians gets citizenship, not only will they no longer be persecuted as much (Israeli Arabs are still treated as second class citizen, but level of discrimination is much much less) - together with existing non-hawkish Israeli citizens they will constitute majority of voters who will ensure that Israel will become a country that's much more peaceful and friendly toward its neighbours. As a bonus on top of that such mass conversion would immensely piss off radicals on both sides.

Now you will very likely complain about freedom of religion and such stuff. But why are these people Muslim in the first place? Because their ancestors converted to religion of previous conquerors of these lands - some were forced to, others out of opportunism, perhaps some even out of genuine conviction - it doesn't matter. Is following religious choices of ancestors really worth all the death and suffering that exists in Palestine right now? I say no. Especially since the most painful part of such conversion - genital mutilation - they all underwent already.

This wouldn't be without precedent - not only religions like Islam and Christianity, but also languages like Latin, English, and Spanish, national identities, and cultures have all spread largely by members of the losing group adapting habits of members of the winning group. In fact it would be a lot more difficult to point to any major religion or nationality which didn't spread in such a way.

It's a topic for whole another post, but ignoring this is the primary reason why I hate most analysis of humans in terms of evolutionary biology so much - variance of human behaviour is almost entirely culture-driven, not gene-driven like they all pretend, and culture doesn't only spread from parents to children.

Anyway, it wouldn't be necessary for everyone to convert - as long as the portion of Palestinians who do is high enough - let's say half - it would completely alter current conflict dynamics enough to solve it. The die-hard few who care too much about Islam can then safety practice their faith in the emerging unified single state (with a huge Jewish majority, even if a lot of that would be rather unenthusiastic converts). As the law doesn't forbid it - children of Palestinian converts could choose whichever religion they wanted - Judaism, Islam, Catholicism, Baha'i, Spaghetti-Monsterism, or whatever. In practice vast majority would simply follow religion of their parents.

jewish cat by Shira Golding from flickr (CC-NC)


Everything else having been tried and having failed already - there is really only one alternative to such mass conversion - waiting a few decades for both Israel and Palestine (and if possible all other countries of the region) to undergo demographic transition from short lives and big families to long lives and small families. Countries after demographic transition are a lot less belligerent.

This is very much happening - but very slowly. According to CIA World Factbook total fertility (number of children per women) between 1990 and 2010 fell from 7.0 to 4.9 for Gaza Strip (still ridiculously high), from 5.0 to 3.12 for West Bank, and from 2.9 to 2.72 for Israel - moving them away from values typical for the worlds' war zones like Afghanistan's 5.50 and Democratic Republic of Congo's 6.11 and more like civilized world's 2.1 and less.

One day it will very likely happen - all will undergo demographic transition, and while they might still hate each other as much as they do now - they will express it like Japanese and Koreans - by flaming each other on the Internet - not in the good old way of bombs and bullets.

This will of course take many decades. My solution solves the whole problem overnight. Any takers?