The best kittens, technology, and video games blog in the world.

Saturday, May 12, 2007

Syntactic tradeoffs in functional languages

Bear & Avery by `Pacdog from flickr (CC-BY)It's difficult to design a syntax for functions (blocks, closures, anonymous inner classes) as first-class values which looks natural in traditional (aka Algol-like) syntax. There have been three solutions:

  • Use non-traditional syntax - like Scheme, Smalltalk, OCaml, Haskell
  • Accept cumbersome syntax for function values to keep traditional syntax - Python, Java
  • Use traditional syntax, but add some sugar for functions which take just one functional argument - Ruby
Languages from the first group were never too popular, and languages from the second were never too "functional" (I'm not talking about purity, just being able to use high-order functions and stuff naturally). Right now Ruby is the most popular kinda-functional language.

SyntaxNo block argumentsOne block argumentMultiple block arguments
Scheme-likeOKOKOK
Ruby-likeGreatGreatUgly
Python-likeGreatUglyUgly
So Ruby-like syntax is a pure win over Python-like syntax - better for one-block functions, and equally ugly for no-block and multiple-block functions. It is generally considered a win over nontraditional syntax for no-block and one-block functions.

Compared to non-traditional syntax, there's a trade-off. It looks better for no-block and one-block functions, but loses for multiple-block functions.

To find out how common functions taking various numbers of blocks are I grepped OCaml library interface files. I took OCaml because as language with non-traditional syntax it wouldn't have anti-functional bias typical in Algol-like languages, and unlike Scheme/Smalltalk/etc. it is statically typed so it wouldn't take too much work to find out how many blocks functions take. SML and Haskell would probably work too here.

I copied /usr/lib/ocaml/3.09.2/ and extracted inferred type information.
$ ruby -e 'Dir["**/*.mli"].each{|fn| system %Q[ocamlc -i #{fn} >#{fn.sub(/\.mli$/,".t")}]}'

I was only interested in functions so I preprocessed it a bit:

$ cat *.t */*.t |
ruby -e 'print STDIN.read.gsub(/(:|->|=)\n\s*/, "\\1 ")' |
gr -v '^\s*(type|mutable|\}|\||module|end|exception|and|")' |
ruby -pe '$_.sub!(/=.*\n/,"")' |
gr -- '->' >FUNCTIONS

$ cat FUNCTIONS | wc -l
2239
$ cat FUNCTIONS | gr '\([^\)]*->[^(]*\)' | wc -l
286
$ cat FUNCTIONS | gr '\([^\)]*->[^(]*\).*\([^\)]*->[^(]*\)' | wc -l
16
So among 2239 functions defined in OCaml standard library (and whatever libraries I have installed), 87.2% take no blocks, 12.1% take one block, and 0.7% take two or more blocks.

This means:
  • Functions which take one block are common enough for Ruby syntax to be a major win over Python syntax
  • Functions which take multiple blocks are so rare that better syntax for them won't make enough different to offset non-traditional syntax in other cases
  • Ruby-style syntax wins (at least unless other languages have macros)
Here are the functions taking multiple blocks found. In camlinternalOO.mli:
  • val make_class : string array -> (table -> Obj.t -> t) -> t * (table -> Obj.t -> t) * (Obj.t -> t) * Obj.t
  • val dummy_class : string * int * int -> t * (table -> Obj.t -> t) * (Obj.t -> t) * Obj.t
In format.mli
  • val set_formatter_output_functions : (string -> int -> int -> unit) -> (unit -> unit) -> unit
  • val get_formatter_output_functions : unit -> (string -> int -> int -> unit) * (unit -> unit)
  • val set_all_formatter_output_functions : out:(string -> int -> int -> unit) -> flush:(unit -> unit) -> newline:(unit -> unit) -> spaces:(int -> unit) -> unit
  • val get_all_formatter_output_functions : unit -> (string -> int -> int -> unit) * (unit -> unit) * (unit -> unit) * (int -> unit)
  • val make_formatter : (string -> int -> int -> unit) -> (unit -> unit) -> formatter
  • val pp_set_formatter_output_functions : formatter -> (string -> int -> int -> unit) -> (unit -> unit) -> unit
  • val pp_get_formatter_output_functions : formatter -> unit -> (string -> int -> int -> unit) * (unit -> unit)
  • val pp_set_all_formatter_output_functions : formatter -> out:(string -> int -> int -> unit) -> flush:(unit -> unit) -> newline:(unit -> unit) -> spaces:(int -> unit) -> unit
  • val pp_get_all_formatter_output_functions : formatter -> unit -> (string -> int -> int -> unit) * (unit -> unit) * (unit -> unit) * (int -> unit)
In printf.mli:
  • external index_of_int : int -> index val scan_format : string -> 'a array -> index -> int -> (index -> string -> int -> 'b) -> (index -> 'c -> 'd -> int -> 'b) -> (index -> 'e -> int -> 'b) -> (index -> int -> 'b) -> (index -> ('f, 'g, 'h, 'i) format4 -> int -> 'b) -> 'b
  • val sub_format : (string -> int) -> (string -> int -> char -> int) -> char -> string -> int -> int
In equeue/equeue.mli:
  • val create : ?string_of_event:('a -> string) -> ('a t -> unit) -> 'a t
In netstring/cgi.mli:
  • val init_mt : (unit -> unit) -> (unit -> unit) -> unit
And the last one was false positive of regular expression ((unit -> (unit -> unit) * (unit -> unit)) -> unit), but I kept it in the statistics, as filtering only false positives and not false negatives would bias the results more than no filtering (and it doesn't change the conclusion).

Perl-compatible regular expressions

In the code above I used gr which isn't a standard Unix command. I'm simply sick of non-Perl-compatible regular expressions. grep should die, die, die horrible death. Unfortunately pcregrep program has annoyingly long name, so I aliased it gr. Do yourself a favour, run:
$ sudo apt-get install pcregrep
$ echo "alias gr=pcregrep" >>~/.bashrc
and never use grep again.

22 comments:

Anonymous said...

Hi

You can run "grep -P" to use perl regexes.

taw said...

andre: -P doesn't work on Debian/Ubuntu.

$ grep -P '\d+'

grep: The -P option is not supported

Slava Pestov said...

Why should one strive for 'traditional' syntax? I never learned Python and Ruby, never will, and I don't see why their syntax offers anything over Scheme or Ocaml.

Unknown said...

You can put Erlang in the second group too, I think. I just wrote about this today.

Natalie said...

Slava: If you haven't learned them, no wonder you can't see what's good about their syntax.

Paddy3118 said...

Any examples for your initial table on syntaxes?
- Paddy.

Anonymous said...

Smalltalk:

seq size
seq at: x put: y
seq map: [...]
seq map: [...] reduce: [...]

Note:

Space instead of dot: seq size == seq.size()

at:put: is one method of two arguments, so is map:reduce:

[...] == lambda{...}

Anonymous said...

As a previous poster mentioned, Smalltalk probably deserves some consideration here. Multi-block methods seem more common than in other languages. Especially if instead of "number defined", you look at "number of calls."

Not to mention, Smalltalk syntax scales very elegantly from the no-block to the one-block and finally to the multi-block.

Mike McNally said...

The claim that Ruby is the most popular "kinda functional" language is, I believe, not true. Javascript is almost trivially demonstrable as being more "popular", and Javascript supports true first-class functions.

Now, it's probably the case that only a small minority of those who make Javascript so popular know that it is in fact a "kinda functional" programming language. And perhaps it gets "ugly" marks across the board ...

Anonymous said...

Why is Lisp-like syntax only "OK"? (I won't even ask why you call it "Scheme-like"...)

I think the most elegant for this would be:
Lisp-like
Smalltalk-like
Haskell-like

These win anything Python and Ruby have to offer.
And they're more elegant ;)

Anthony said...

This is some great analysis, but, despite being a big fan, I think Ruby's syntax still leaves something to be desired. For example, binding precedence can be confusing due to optional parentheses.

I often find myself missing the simplicity of Lisp syntax when dealing with a lot of higher order functions. In fact, I don't think my experience is that Ruby's syntax is ideal for 99% of circumstances as your OCaml investigation might suggest, but I'm not sure what percentage it is.

taw said...

Mike McNally: First-class function values are certainly required for the language to be "kinda-functional", but that's a very low standard of being "kinda-functional" - even Perl has them. Javascript code I've seen tends to contain rather few high-order functions, so I don't really count it here.

And I'm not sure if it's still that much more popular than Ruby. It used to be so not that long ago, but TIOBE thinks the difference is rather small today (not that I trust TIOBE, but flawed data is better than no data).

Anonymous said...

I dislike your whole "traditional syntax" view point. There's no such thing, it's a popular syntax, but calling it traditional implies it's better and older, neither is the case.

You attach too much value to "traditional syntax" as if it's worth a lot to try and keep it. It's not. The C'ish syntax mostly sucks, is totally inelegant, full of exceptions and stuff you have to memorize, and hard to parse and generate.

Smalltalk and Lisps syntax on the other hand, are elegant, simple, easy to parse and generate, and simply better in every objective measure of merit in comparison to C'ish languages.

"Traditional syntax" doesn't deserve your adulation, rather, it needs to be abolished in favor of the superior alternatives already available.

taw said...

Anonymous: I call it "Scheme-like" to be precise. Scheme code tends to be much more functional than code written in other Lisps like Emacs Lisp and CL.

Elegance is one thing, but languages with non-traditional syntax very rarely get popular.

taw said...

Anthony: I compared only one point - syntax for block arguments. It's a very important one, but not the only.

I think most future languages should follow Ruby here - make sure one-block case is handled well, and not sacrifice familiarity just to handle very rare multi-block case.

taw said...

Ramon Leon: We can call alternative syntaxes "superior" or "more elegant", but languages are much more likely to become popular if they follow established traditions.

Traditions can be broken if you get something more valuable in exchange - like macros. I'm just pointing out that supporting block arguments and functional-style programming is not one of such things, as one-block case is handled by fairly traditional Ruby syntax, and multiple-blocks case is too rare to matter.

Anthony said...

taw, I understand the focus of your analysis, and agree with where you're coming from. What I'm saying is, that agreement already granted, I think Ruby syntax has some potentially rough patches with nested invocations of functions that accept a single block.

A more idle consideration is whether or not functions that take multiple blocks will become more common. The (somewhat) interesting point being that such functions are less necessary (using that word loosely) if you have elegant support for nested invocation of functions that accept a single block.

On another note, I disagree with your characterization of Javascript as not containing many higher order functions. Callback-type designs are so common in Javascript that higher order functions are all over the place. There are also a good number of functional programming academics porting techniques to Javascript, which I think will encourage a continued uptake of functional style in the Javascript community.

Anonymous said...

andre: grep -P is non-standard, not everyone uses bloated crapware filled with excess "features". And there's absolutely no need to use PCR for this (not PCRE, they are not regular expressions at all), grep is fine. Perl's regexs are very bad, you should learn actual regular expressions instead:
http://swtch.com/~rsc/regexp/regexp1.html

And taw, you win the useless use of cat award. Why is it that nobody can be bothered to learn the basics of unix anymore? You aren't concatenating anything, so don't invoke cat!

taw said...

Anonymous: There are some extreme cases where grep or pcregrep is much faster than another, but nobody cares because they never happen in real life.

On the other hand limitations of grep are a big problem all the time.

"cat FILE | complicated_command" is idiomatic Unix usage. Shells should have learned to handle such simple cats internally by now. If they haven't (and they haven't), that's a bug. I'm hopeful, however - someone finally started caring about shell usability. Recently shells started telling which package needs to be installed if some unknown command was issued. Some day maybe cp, mv and > will stop overwriting files without being asked to. (you can get there today with alias cp='cp -i', libtrash and so on). It's not like 1980s Unix is the best we can do.

Anonymous said...

What limitations of grep? Give an actual example. There is no case where pcregrep is significantly faster than grep, ever. Naive backtracking regex implimentations perform exponentially worse in many situations, and if you've had any serious experience with these bad regex implimentations you would be familiar with the common workarounds you need to use to create equivilent but non-pathalogical patterns so performance doesn't deteriorate exponentially as the input grows. You should actually read that link instead of dismissing it with nonsense like that.

cat somefile | some_command is NOT idiomatic unix, its idiotic unix. grep takes a file as its argument. grep 'pattern' file, not cat file| grep 'pattern'. And if something doesn't take a file as an argument, then redirect its stdin: grep 'pattern' < somefile.

What do you want your shell to do about the fact that you run commands for no reason? Its not a bug, its doing exactly what you asked it to do. Its not the shell's fault you told it to do something dumb. Unix is very user friendly and easy to learn and use. Just because you don't like the way it works doesn't mean unix needs to change, perhaps you just need to find an OS that suits your desires better. If you want needless questioning when you command (look up the definition of command) the system to do something, then tell it to ask you. mv means move the file. That's what it does. mv -i means move the file but ask me questions first. And that's what it does. 1980's unix isn't bad, its far better than the unfortunate messes that linux distros are. plan9 is even nicer, than unix, but its mouse-centric interface is a real disapointment.

taw said...

Anonymous: Some limitations of grep are not supporting character classes like \d and \p{Hiragana}, ()s, backreferences, and all other useful PCRE features. It's not true that GNU grep is always faster - GNU grep used to be painfully slow in UTF-8 mode, something like 10x slower than pcregrep, but they apparently fixed that already.

"mv" should mean "move file". Unfortunately Unix is braindead enough to interpret it as "remove whatever was there first and move file". The removing part is intended behaviour in maybe 1% of cases - they you could confirm or use -f. In 99% if something is there it means a mistake was made, and you want to cancel. It's one of the worst cases of Unix brain damage ever - causing severe loss of user data just because the design fuckup is old and Unix fanboys consider all old design fackups sacrosanct.

In any case if you want 1980s Unix, you know where to find it. The rest of us moved on since then.

Anonymous said...

Of the "limitations" you mentioned, only one is real, and that's backreferences. The rest are just you not wanting to learn regular expressions. Not co-incidently, backreferences are the reason pcre is so horribly slow. Of course, you didn't use backreferences, and they are very rarely needed. If you really wanted a good grep that has backreferences, you could install one. There are grep implimentations that support backreferences, but still impliment any regexes without backreferences in the smart, effecient way. But hey, just use the lowest common denominator shitty software written by morons who still haven't figured out how to copy 40 year old algorithms, that sounds smart.

And while it may be nice to try to dismiss anyone who disagrees with you as a fanboy, it makes you look pretty dumb when you ignore what I've said while you do it. If I thought unix's flaws were sacrosanct, I wouldn't suggest that the modern replacement for unix which fixes many of those flaws is superior would I? I guess the cynics are right, you're not really a troll, you actually are that dumb.

Just because *you* think 99% of the time mv should behave like you are an idiot doesn't mean everyone thinks that. 99% of the time *I* want mv to behave just as it does, and the times I don't I use mv -i. I've never seen anyone think otherwise except for linux noobs like yourself. You know, the ones who pretend they know unix, and that unix is so great, but they have't figured out the basics of how to use their shell yet.

If want a horrible, bloated shitware OS that pretends to be unix, but ignores every single fundamental aspect of unix, then you should really be looking forward to the hurd so your kernel can suck as badly as your userland. Hopefully it will impliment nagging directly in the filesystem so you have an extra layer of stupid confirmations to protect you from yourself.