All this is about discussing types in blog posts, documentation etc. None of that goes anywhere near actual code (except possibly in comments). Ruby never sees that.
Statically typed languages have all this covered, and we need it too. Not static typing of course - just an expressive way to talk about what types things are - as plain English fails here very quickly. As far as I know nothing like that exists yet, so here's my proposal.
This system of type descriptions is meant for humans, not machines. It focuses on the most important distinctions, and ignores details that are not important, or very difficult to keep track of. Type descriptions should only be as specific as necessary in given context. If it makes sense, there rules should be violated.
In advance I'll say I totally ignored all the covariance / contravariance / invariance business - it's far to complicated, and getting too deeply into such issues makes little sense in a language where everything can be redefined.
Basic types
Types of simple values can be described by their class name, or any of its superclasses or mixins. So some ways to describe type of
15 would be
Fixnum (actual class),
Integer (superclass),
Comparable (mixin), or
Object (superclass all the way up).
In context of describing types, everything is considered an
Object, and existence of
Kernel,
BasicObject etc. is ignored.
So far, it should all be rather obvious. Examples:
- 42 - Integer
- Time.now - Time
- Dir.glob("*") - Enumerable
- STDIN - IO
nil and other ignored issues
nil will be treated specially - as if it was of every possible type.
nil means absence of value, and doesn't indicate what type the value would have if it was present. This is messy, but most explicitly typed languages follow this path.
Distinction between situations that allow
nils and those that don't will be treated as all other value range restrictions (
Integer must be posibile,
IO must be open for writing etc.) - as something outside the type system.
For cases where
nil means something magical, and not just absence of value, it should probably be mentioned.
Checked exceptions and related non-local exits in Ruby would be a hopeless thing to even attempt. There's syntax for exceptions and catches used as control structures if they're really necessary.
Booleans
We will also pretend that
Boolean is a common superclass of
TrueClass and
FalseClass.
We will also normally ignore distinction between situations where real
true/
false are expected, and situations where any object goes, but acts identically to its boolean conversion. Any method that acts identically on
x and
!!x can be said to take
Boolean.
On the other hand if some values are treated differently than their double negation, that's not really
Boolean and it deserves a mention. Especially if
nil and
false are not equivalent - like in Rails's
#in_groups_of (I don't think Ruby stdlib ever does thing like that).
Duck typing
If something quacks like a
Duck convincingly enough, it can be said to be of type
Duck, it being object's responsibility that its cover doesn't get blown.
In particular, Ruby uses certain methods for automatic type conversion. In many contexts objects implementing
#to_str like
Pathnames will be treated as
Strings, objects implementing
#to_ary as
Arrays,
#to_hash as
Hashes, and
to_proc as
Procs - this can be used for some amazing things like
Symbol#to_proc.
This leads to a big complication for us - C code implementing Ruby interpreter and many libraries is normally written in a way that calls these conversion functions automatically, so in such contexts
Symbol really is a
Proc,
Pathname really is a
String and so on. On the other hand, in Ruby code these methods are not magical, and such conversions will only happen if explicitly called - for them
Pathname and
String are completely unrelated types. Unless Ruby code calls C code, which then autoconverts.
Explicitly differentiating between contexts which expect a genuine
String and those which expect either that or something with a valid
#to_str method would be highly tedious, and I doubt anyone would get it exactly right.
My recommendation would be to treat everything that autoconverts to something as if it subclassed it. So we'll pretend
Pathname is a subclass of
String, even though it's not really. In some cases this will be wrong, but it's not really all that different from subclassing something and then introducing incompatible changes.
This all doesn't extend to
#to_s,
#to_a etc - nothing can be described as
String just because it has
to_s method - every object has
to_s but most aren't really strings.
Technical explanation of to_str and friends
This section is unrelated to post's primary subject - skip if uninterested.
Ruby uses special memory layout for basic types like strings and arrays. Performance would be abysmal if string methods had to actually call Ruby code associated with whatever
[] happened to be redefined to for every character - instead they ask for a certain C data structure, and access that directly (via some macros providing extra safety and convenience to be really exact).
By the way this is a great example of C being really slow - if Ruby was implemented on a platform with really good JIT, it could plausibly have every single string function implemented in term of calls to
[],
[]=,
size, and just a few others, with different subclasses of
String providing different implementations, and JIT compiling inlining all that to make it really fast.
It would make it really simple to create class representing a text file, and
=~ /regexp/ that directly without reading anything more than required to memory, or maybe even
gsub! it in a way that would read it in small chunks, saving them to another file as soon as they're ready, and then renaming in one go. All that without regexp library knowing anything about it all. It's all just my fantasy, I'm not saying any such JIT actually exists.
Anyway, strings and such are implemented specially, but we still want these types to be real objects, not like what they've done in Java. To make it work, all C functions requiring access to underlying storage call a special macro which automatically calls a method like
to_str or
to_ary if necessary - so such objects can pretend to be strings very effectively. For example if you alias method
to_str to
path on
File code like
system File.open("/bin/hostname") will suddenly start working. It really makes sense only for things which are "essentially strings" like
Pathname,
URI, Unicode-enhanced strings, proxies for strings in third party libraries like Qt etc.
To complicate things further objects of all classes inheriting from
String automatically use
String's data representation - and C code will access that, never calling
to_str. This leaves objects which duck type as
Strings two choices:
- Subclass String and every time anything changes update C string data. This can be difficult - if you implement an URI and keep query part as a hash instance variable - you need to somehow make sure that your update code gets run every time that hash changes - like by not exposing it at all and only allowing query updates via your direct methods, or wrapping it in a special object that calls you back.
- Don't subclass String, define to_str the way you want. Everything works - except your class isn't technically a String so it's not terribly pretty OO design.
You probably won't be surprised that not subclassing is the more popular choice. As it's all due to technical limitations not design choices, it makes sense to treat such objects as if they were properly subclassed.
Collections
Back to the subject. For collections we often want to describe types of their elements. For simple collections yielding successive elements on
#each, syntax for type description is
CollectionType[MemberType]. Examples:
- [42.0, 17.5] - Array[Float]
- Set["foo","bar"] - Set[String]
- 5..10 - Range[Integer]
When we don't care about collection type, only about element types, descriptions like
Enumerable[ElementType] will do.
Syntax for types of hashtables is
Hash[KeyType, ValueType] - in general collections which yield multiple values to
#each can be described as
CollectionType[Type1, Type2, ..., TypeN].
For example
{:foo => "bar"} is of type
Hash[Symbol, String].
This is optional - type descriptions like
Hash or
Enumerable are perfectly valid - and often types are unrelated, or we don't care.
Not every
Enumerable should be treated as collection of members like that -
File might technically be
File[String] but it's usually pointless to describe it this way. In 1.8
String is
Enumerable, yielding successive lines when iterated - but
String[String] make no sense (no longer a problem in 1.9).
Classes other than
Enumerable like
Delegator might need type parameters, and they should be specified with the same syntax. Their order and meaning depends on particular class, but usually should be obvious.
Literals and tuples
Ruby doesn't make distinction between
Arrays and tuples. What I mean here is a kind of
Array which shouldn't really be treated as a collection, and in which different members have unrelated type and meaning depending on their position.
Like method arguments. It really wouldn't be useful to say that every method takes
Array[Object] (and an optional
Proc) - types and meanings of elements in this array should be specified.
Syntax I want for this is
[Type1, Type2, *TypeRest] - so for example
Hash[Date, Integer]'s
#select passes
[Date, Integer] to the block, which should return a
Boolean result, and then returns either
Array[[Date, Integer]] (1.8) or
Hash[Date, Integer] (1.9). Notice double
[[]]s here - it's an
Array of pairs. In many contexts Ruby automatically unpacks such tuples, so
Array[[Date,Integer]] can often be treated as
Array[Date,Integer] - but it doesn't go deeper than one level, and if you need this distinction it's available.
Extra arguments can be specified with
*Type or
... which is treated here as
*Object. If you want to specify some arguments as optional suffix their types with
? (the most obvious
[] having too many uses already, and
= not really fitting right).
In this syntax
[*Foo] is pretty much equivalent to
Array[Foo], or possibly
Enumerable[Foo] (with some duck typing) - feel free to use that if it makes things clearer.
Basic literals like
true,
false,
nil stand for themselves - and for entire
TrueClass,
FalseClass,
NilClass classes too as they're their only members. Other literals such as symbols, strings, numbers etc. can be used too when needed.
To describe keyword arguments and hashes used in similar way, syntax is
{Key1=>Type1, Key2=>Type2} - specifying exact key, and type of value like
{:noop=>Boolean, :force=>Boolean}.
It should be assumed that keys other than those listed are ignored, cause exception, or are otherwise not supported. If they're meaningful it should be marked with
... like this
{:query=>String, ...}. Subclasses often add extra keyword arguments, and this issue is ignored.
Functions
Everything so far was just a prelude to the most important part of any type system - types for functions. Syntax I'd propose it:
ArgumentTypes -> ReturnType (
=> being already used by hashes).
I cannot decide if blocks should be specified in Ruby-style notation or a function notation, so both
& {|BlockArgumentTypes| BlockReturnType} and
&(BlockArgumentTypes->BlockReturnType) are valid.
& is necessary, as block are passed separately from normal arguments, however strong the temptation to reuse
-> and let the context disambiguate might be.
Blocks that don't take any arguments or don't return anything can drop that part, leaving only something like
&{|X|},
&{Y},
&{}, or in more functional notation
&(X->),
&(Y),
&().
Because of all the
[] unpacking, using
[] around arguments, tuple return values etc. is optional - and just like in Ruby
() can be used instead in such contexts.
If function doesn't take any arguments, or returns no values, these parts can be left - leaving perhaps as little as
->.
Examples:
- In context of %w[Hello world !].group_by(&:size) method #group_by has type Array[String]&{|String| Integer}->Hash[Integer,String]
- Time.at has type Numeric -> Time
- String#tr has type [String, String] -> String
- On a collection of Floats, #find would have type Float?&(Float->Boolean)->Float
- Function which takes no arguments and returns no values has type []->nil
If you really need to specify exceptions and throws, you can add
raises Type, or
throws :kind after return value. Use only for control structure exceptions, not for actual errors exceptions. It might actually be useful if actual data gets passed around.
- Find.find has type [String*]&(String->nil throws :prune)->nil
A standalone
Proc can be described as
(ArgumentsTypes->ReturnType) just as with notation for functions. There is no ambiguity between
Proc arguments and block arguments, as blocks are always marked with
|.
Type variable and everything else
In addition to names of real classes, any name starting with an uppercase letter should be consider a type. Unless it's specified otherwise in context, all such unknown names should be considered class variables with big forall quantifier in front of it all.
Examples:
- Enumerable[A]#partition has type &(B->Boolean)->[Array[A], Array[A]]
- Hash[A,B]#merge has type Hash[A,B]&(A,B,B->B)->Hash[A,B]
- Array[A]#inject has either type B&(B,A->B)->B or &(A,A)->A. This isn't just a usual case of missing argument being substituted by nil - these are two completely different functions.
To specify that multiple types are allowed (usually implying that behaviour will be different, otherwise there should be a superclass somewhere, or we could treat it as common duck typing and ignore it) join them with
|. If there's ambiguity between this use and block arguments, parenthesize. It binds more tightly than
,, so it only applies to one argument. Example:
- String#index in 1.8 has type (String|Integer|Regexp, Integer?)->Integer (and notice how I ignored Fixnums here).
For functions that can be called in multiple unrelated ways, just list them separately -
| and parentheses will work, but they are usually top level, and not needed anywhere deeper.
If you want to specify type of
self, prefix function specification with
Type#:
- #sort has type like Enumerable[A]#()&(A,A->1|0|-1)->Array[A]
To specify that something takes range of values not really corresponding to a Ruby class, just define such extra names somewhere and then use like this:
- File#chown has type (UnixUserId, UnixUserId)->0 - with UnixUserId being a pretend subclass of Integer, and 0 is literal value actually returned.
To specify that something needs a particular methods just make up a pretend mixin like
Meowable for
#meow.
Any obvious extensions to this notation can be used, like this:
- Enumerable[A]#zip has type (Enumerable[B_1], *Enumerable[B_i])->Array[A, B_1, *B_i] - with intention that B_is will be different for each argument understood from context. (I don't think any static type system handles cases like this one reasonably - most require separate case for each supported tuple length, and you cannot use arrays if you mix types. Am I missing something?)
The End
Well, what I really wanted to do what talk about Ruby collection system, and how 1.9 doesn't go far enough in its attempts at fixing it. And without notation for types talking about high order functions that operate on collections quickly turns into a horrible mess. So I started with a brief explanation of notation I wanted to use, and then I figured out I can as well do it right and write something that will be reusable in other contexts too.
Most discussion of type systems concerns issues like safety and flexibility, which don't concern me at all, and limit themselves to type systems usable by machines.
I need types for something else - as statements about data flow. Type signature like
Enumerable[A]#()&(A->B)->Hash[A,B] doesn't tell you exactly what such function does but narrows set of possibilities extremely quickly. What it describes is a function which iterates over collection in order while building a
Hash to be returned, using collection's elements as keys, and values returned by the block as values. Can you guess the function I was thinking about here?
Now a type like that is not a complete specification - a function that returns an empty hash fits it. As does one which skips every 5th element. And one that only keeps entries with unique block results. And for that matter also one that sends your email password to NSA - at least assuming it returns that
Hash afterwards.
It was still pretty useful. How about some of those?
- Hash[A,B] -> Hash[B, Array[A]]
- Hash[A,B] &(A,B->C) -> Hash[A,C]
- Hash[A, Hash[B,C]] -> Hash[[A,B], C]
- Hash[A,B] &(A,B->C) -> Hash[C, Hash[A,B]]
- Enumerable[Hash[A,B]] &(A,B,B->B) -> Hash[A,B]
- Hash[A,Set[B]] -> Hash[Set[A], Set[B]]
Even these short snippets should give a pretty good idea what these are all about.
That's it for now. Hopefully it won't be long until that promised 1.9 collections post.