Friday, July 16, 2010

Arrays are not integer-indexed Hashes

Cabooki by elycefeliz from flickr (CC-NC-ND)

We use a separate Array type even though Ruby Hashes can be indexed by integers perfectly well (unlike Perl hashes which implicitly convert all hash keys to strings, and array keys to integers). Hypothetically, we could get rid of them altogether and treat ["foo", "bar"] as syntactic sugar for {0=>"foo", 1=>"bar"}.

Now there are obviously some performance reasons for this - these are mostly fixable and a single data structure can perform well in both roles. And it would break backwards compatibility rather drastically, but let's ignore all that and imagine we're designing a completely fresh language which simply looks a lot like Ruby.

What would work


First, a lot of things work right away like [], []=, ==, size, clear, replace, and zip.



The first incompatibility is with each - for hashes it yields both keys and values, for arrays only values, and we'd need to decide one way or the other - I think yielding both makes more sense, but then there are all those non-indexable enumerables which won't be able to follow this change, so there are good reasons to only yield values as well. In any case, each_pair, each_key, and each_value would be available.


Either way, one more change would be necessary here - each and everything else would need to yield elements sorted by key. There are performance implications, but they're not so bad, and it would be nicer API.


Hash's methods keys, values, invert, and update all make perfect sense for Arrays. With keys sorted, first, last, and pop would work quite well. push/<< would be slightly nontrivial - but making it add #succ of the last key (or 0 for empty hashes) would work well enough.


Collection tests like any?, all?, one?, none? are obvious once we decide each, and so is count. map/collect adapts to hashes well enough (yielding both key and value, and returning new value).


Array methods like shuffle, sort, sample, uniq, and flatten which ignore indexes (but not their relative positions) would do likewise for hashes, so flattening {"a"=>[10,20], "b"=>30} would result in [10,20,30] ("a" yields before "b").


Enumerable methods like min/max/min_by/max_by, find, find_index, inject would do likewise.

include? checks values for Arrays and keys for hashes - we can throw that one out (or decide one way or the other, values make more sense to me), and use has_key?/has_value? when it matters.

reverse should just return values, but reverse_each should yield real keys.


I could go on like this. My point is - a lot of this stuff can be made to work really well. Usually there's a single behavior sensible for both Arrays, and Hashes, and if you really need something different then keys, values, or pairs would usually be a suitable solution.

What doesn't work


Unfortunately some things cannot be made to work. Consider this - what should be the return value of {0 => "zero", 1 => "one"}.select{|k,v| v == "one"}?


If we treat it as a hash - let's say a mapping of numbers to their English names, there is only one correct answer, and everything else is completely wrong - {1=>"one"}.


On the other hand if we treat it as an array - just an ordered list of words - there is also only one correct answer, and everything else is completely wrong - {0=>"one"}.

These two are of course totally incompatible. And an identical problem affects a lot of essential methods. Deleting an element renumbers items for an array, but not for a hash. shift/unshift/drop/insert/slice make so sense for hashes, and methods like group_by and partition have two valid and conflicting interpretations. It is, pretty much, unfixable.


So what went wrong? Thinking that Arrays are indexed by integers was wrong!


In {0=>"zero",1=>"one"} association between keys and values is extremely strong - key 0 is associated with value "zero", and key 1 with value "one". They exist as a pair and everything that happens to the hash happens to pairs, not to keys or values separately - there are no operations like insert_value, delete_value which would just shift remaining values around from one key to another. This is the nature of hashes.


Arrays are not at all like that. In ["zero", "one"] association between 0 and "zero" is very weak. The real keys are not 0, and 1 - they're two objects devoid of any external meaning, whose only property is their relative partial order.


To implement array semantics on top of hashes, we need a class like Index.new(greater_that=nil, less_than=nil). Then a construction like this would have semantics we desire.

arr = {}

arr[Index.new(arr.last_key, nil)] = "zero"

arr[Index.new(arr.last_key, nil)] = "one"



If we use these instead of integers, hashes can perform all array operations correctly.

# shift

arr.delete(arr.first_key)

# unshift

arr[Index.new(nil, arr.first_key)] = "minus one"

# select - indexes for "zero" and "two" in result have correct order

["zero", "one", "two"].select{|key, value| value != "one"}

# insert - nth_key only needs each

arr[Index.new(arr.nth_key(0), arr.nth_key(1))] = "one and half"


And so the theory is satisfied. We have a working solution, even if highly impractical one. Of course all these Index objects are rather hard to use, so the first thing we'd do is subclassing Hash so that arr[i] would really mean arr[arr.nth_key(i)] and so on, and there's really no point yielding them in #each and friends... oh wait, that's exactly where we started.

In other words, unification of arrays and hashes is impossible - at least unless you're willing to accept a monstrosity like PHP where numerical and non-numerical indexes are treated differently, and half of array functions accept a boolean flag asking if you'd rather have it behave like an array or like a hash.

3 comments:

  1. Anonymous12:21

    Great article and I agree with you. The distinction between Arrays and Hashs is very valuable.

    But I was wondering about one part:

    {0 => "zero", 1 => "one"}.select{|k,v| v == "one"}

    Shouldn't this return (for a hash)

    {0 => {1 => "one}}

    Because select always returns Arrays — or Hashes in our case.

    ReplyDelete
  2. Anonymous: Enumerable#select is supposed to return part of collection that matches your block.

    As Enumerable interface requires only #each to exist, generic #select cannot really do much more than put selected results in an Array. It's a perfectly good fallback.

    On the other hand Hash#select should return another Hash, as it has a well defined concept of a sub-collection.

    And in 1.9 they fixed it.

    On the other hand Set#select still returns an Array.

    ReplyDelete
  3. Anonymous13:24

    taw: Oh, you're right. I'm sorry. I was somehow confused about the behavior of select.

    ReplyDelete