The best kittens, technology, and video games blog in the world.

Thursday, June 11, 2015

Ruby 3 should merge Strings and Symbols

my_pic211 by takenzen from flickr (CC-NC-ND)

It seems that every programming language has idiosyncratic ways to divide what everybody else considers one data type into a bunch of variants. Python has lists and tuples, Java splits everything into primitive/object versions, C++ has like a billion versions of what are essentially String and Array, every database ever made has at least twenty timestamp types, and even Ruby does that with String and Symbol.

The excuse is usually some low level optimization or finer points of semantics every other language somehow manages without, but let's be honest - pretty much none of such distinctions are worth added complexity, and somehow other languages never "see the light" and rush to copy such subtypes.

In Ruby 1.x strings were just String objects. Sure, it exposed this Symbol kinda-string thing, because Ruby generally exposes a ton of stuff it doesn't really need to, but it was just internal interpreter stuff only people who wrote C extensions needed to know about. Even metaprogramming API used String objects only. #methods and friends returned String list, #define_method could take either String or Symbol and most people just passed regular String objects. It was good.

Then suddenly people noticed Ruby has two string types, and started using both, without much logic to it. Ruby 1.9 changed some low level APIs from returning String to Symbol, which was reasonable enough, but you can still pass either just fine.

Unfortunately it only got worse as people decided that Symbol is totally awesome to be used for internal identifiers in their programs, not just for interfacing Ruby interpreter - and Ruby enabled this behaviour by adding JSON-style hash syntax (later also used for keyword arguments) - even though in JSON foo: "bar" really just means "foo" => "bar" not :foo => "bar".

That of course instantly led to memory leaks, DoS attacks, and so on, as Symbol type was never meant for anything like that. In ended up with Ruby being hacked to GC Symbols - data type originally just for low level stuff that lived forever, not for any kind of user data.

Look at any Ruby significant codebase today and you see to_s and to_sym scattered everywhere, nasty nonsense like HashWithIndifferentAccess and friends, and endless tests that miserably fail to match live behaviour because saving objects to database or JSON or whatever and loading them back (something tests often don't bother to do) transparently converts things from Symbols to Strings and they are equivalent in some contexts but not others.

Let's just stop this. As long as easy foo: "bar" syntax exists people will keep using Symbols where Strings would be more appropriate, and I'm aware you can only pry this syntax from people's cold dead hands, so this can't realistically be changed by better practices.

The only way this can be fixed is if Ruby just unifies both data types. Just like Ruby 2 made ?x mean "x", which caused some temporary issues but was great simplification, Ruby 3 should just make :foo mean "foo".freeze.

This seemingly radical unification would be surprisingly backwards compatible - String#to_sym can just return frozen copy (or itself if already frozen), and String#to_s can mean "#{self}" in all contexts. It will probably take a few performance hacks here and there to cache some kind of unique key or hash on frozen String so comparing two frozen Strings is fast, but it's conceptually simple problem, and such strings tend to be very small anyway so it's no big deal if unique key sometimes misses forcing fallback to slower comparison routine.

Only the code that relied on distinction between Strings and Symbols for things like sentinel values would be affected, but so many libraries magically convert from one string type to the other or back that I'd not recommend doing that anyway. Just bite the bullet and use sentinel objects. In any case that's like 1% of uses of Symbols.

I know status quo bias will make many people dislike this idea, but let's reverse this - if it was already working like I describe, would any sane person suggest introducing Symbol type? Of course not.

Oh and while we're on the subject of Ruby 3 wishlist, it could really get some kind of #deep_frozen_copy and #deep_unfrozen_copy, with whichever names people end up using. This however is a subject for another post.


Anonymous said...

puts 'string hello'
10.times do puts 'hello'.object_id end
puts 'symbol hello'
10.times do puts :hello.object_id end
puts 'done'

funny_falcon said...

They tried to merge them when planned 1.9 . But it breaks too much libraries.
Do you want Ruby3 to take same adoption as Python3 ?

Well, Ruby could be improved/simplified in many ways. But it will be another language that should build its community and ecosystem from ground.

Look at the Dart: Google made it several years ago, and it is really convenient language with cool standard library. But still there is no adoption.

Anonymous said...

Pure idiocy.

So much code would break and after fixing it, it would run slower