The best kittens, technology, and video games blog in the world.

Monday, October 20, 2008

Put your variables on diet

Svanspervot on a Chair by Steffe from flickr (CC-NC-SA)

There are all kinds of categorization schemes for programming languages, by paradigms or checklists of supported features. Categorizations criteria tend to be highly subjective (e.g. "builtin support for regular expressions"), useless (e.g. "significant indentation"), or both (e.g. "does it have a standard").

I want to propose a new categorization - objective, easy to evaluate, and at the same time exposing something very deep about programming languages.

I will divide languages into:

  • thin variable languages - where variables refer to data.
  • fat variable languages - where variables contain data. Variables can also contain references to data, but there's a distinction between direct and indirect access.

This division is very old. Assembly language is obviously a fat variable language, even though its variable system is very simple - registers and memory locations contain stuff directly, or contain references (memory addresses) to stuff. As languages need to be compiled to assembly plenty of high performance languages follow this road. Fortran and C variables are just assembly variables plus types. C++ didn't break up with it, it made container variables much fatter and much more complicated - RAII, copy constructors, assignment operators, and all the related mess. Java in spite of superficial similarity to C++ is definitely a thin variable language.

Thin variable languages are also very old. The original Lisp was the first language with thin variables, and all Lisp dialects, just like all ML and Haskell dialects, are thin variable languages. I don't think a single seriously functional language uses fat variables.

All object-oriented languages, most popular of them being Smalltalk, Ruby, Javascript, and (let's be charitable) Java - use thin variables too. There are some fat languages like C++ and Perl with objects bolted on top of them, but they are definitely not object-oriented, they just support some limited object functionality.

Scripting languages are interesting. Old languages like Unix shell have very fat variables, even though all variables are simple strings. Perl and PHP continue this tradition, but Python and Ruby are soundly in the thin variable camp.


Having categorized all popular languages let's do some observations.

  • All (pure and impure, strict and lazy) functional languages are thin variable.
  • All honestly object-oriented languages are thin variable.
  • Thin variable languages and garbage collected languages are very closely related categories. There are some reference counted languages in both camps (Perl, PHP on thick side, Python on thin side), but there seem to be no thin language with manual memory allocation or thick language with full GC.
  • All segfaulting languages (assembly, C, C++) use fat variables, but many fat variable languages are non-segfaulting (Fortran, bash, Perl, PHP).
  • Dynamic typing is on both sides (Perl/PHP vs Lisp/Ruby).
  • Explicit static typing is also on both sides (C/C++ vs Java).
  • Implicit static typing is only no the thin side (ML/Haskell) and is actually quite popular there.
  • Almost all thin languages have closures. A few languages like Python and Java have less than full closures, in form of named inner functions or anonymous inner classes. In both cases it's a syntactic not semantic limitation.
  • Almost no thick language has closures. A big exception is Perl, which has full closures.
  • Lexical and global scope exists on both sides, in almost every language.
  • Dynamic scope is unusual, but is supported on both thick (Perl), and thin (some Lisp dialects including the original Lisp, Emacs Lisp, and Common Lisp, but not Scheme) side.
  • Rich literal notation is supported (Perl vs Python/Ruby/Lisp/ML/etc.) and not supported (C/C++ vs Java) on both sides.
  • Macros exists only on the thin side (Lisps, Dylan, Nemerle). There doesn't seem to be any obvious reason for it.

I could go on. It actually surprises me how many semantic differences follow the thin vs fat divide, with Perl and Java being the biggest outliers (and also their derivatives like PHP and C#). These outliers are very interesting. Java's lack of power is definitely syntactic not semantic and there is plenty of JVM languages which are little more than fully compatible alternative syntaxes for Java with more expressive power. Nothing like that ever happened to popular fat variable languages like C/C++, which fail for semantic not syntactic reasons.

The biggest outlier on the fat side in Perl. While Perl was able to get almost 100% score on supported features checklist, it seems to be an evolutionary dead end. Every new thin variable language steals ideas from Perl, but Perl 6 effort was never able to transform the language, and Perl programmers have been leaving for Ruby and Python for years now.

If you're writing a new language today, and every programmer should do that, just forget about fat variables. They have one big advantage of allowing explicit memory management, what can still result in more memory efficient programs, but that's about it. Expressiveness of fat variable languages have been pushed to the limits by Perl, and it seems it cannot be pushed any further. Thin side is already far ahead, with Ruby, Scheme, Haskell and all the small research languages you've never heard of.


David R. MacIver said...

I think D and BitC are both examples of fat variable languages with type inference.

David R. MacIver said...

For that matter, given that it supports structs C# probably counts too.

Anonymous said...

You said all object-oriented languages are thin variable languages, but Objective-C is clearly a fat variable language (seeing as how it's a superset of C). Of course, the object-oriented part is entirely thin - every object reference is a reference. But the entire language is fat variable.

taw said...

Objective C is one language embedded in another. "Inner Objective C" is thin variable and object-oriented, "Outer Objective C" is fat variable and not really object-oriented.

Java has primitive variable types, but as they're all immutable they behave thinly.

C# seems to do the same, except it also extends the primitive variable types to user-defined mutable structs. It seems to me that if you never change their fields (except in constructor or when you build one manually) you can pretend they're thin too. I'm pretty sure it will badly bite you if you try modifying them. It seems like a very interesting case.

BitC has Scheme syntax, but indeed it has fat variables and type inference. I wonder how well it works in practice with functional idioms or macros.

Iwan said...

The JVM itself does not support true closures since there's no way to preserve stack frames when they go out of scope.

An inner class is not a closure. Javac compiles it to a regular class which takes the values it needs to access as arguments to the constructor, which copies them to final member variables. It does not actually access the outer (closed) scope.

AFAIK, all JVM languages that do support closures do so by implementing a synthetic stack which is used instead of the JVM call stack. This imposes a performance hit compared to regular java.

Also, C has macros, so this isn't limited to 'thin' languages.

taw said...

Iwan: Obviously you cannot have full closures on the stack. But stack is just an optimization hack, so it doesn't bother me at all.

As for C "macros", if you think C preprocessor has anything to do with genuine macros, then you seriously need to spend some time playing with real macros.