The best kittens, technology, and video games blog in the world.

Friday, December 10, 2010

My Christmas wishlist - redcar and desktop Linux

And we are back by fofurasfelinas from flickr (CC-NC-ND)

Who would have thought my blog would be so high on OSX vs Linux. Well, this subject never gets old, so here's another instalment.

For personal reasons which Facebook aptly calls "It's complicated" I haven't had much time to update this blog recently.

I have so many things I'd like to write about - but marginal utility of time keeps getting in a way - with a lot of my previously free time no longer available, I have to make decisions which of too many things I'd like to do I should drop or postpone, hopefully starting from the least important of them.

And just like nearly everybody else, I keep confusing important with urgent - this blog is never really urgent - no cat will die if I leave writing for another day or week - but it surely is not the least important of all things I could be doing, or have been doing in the last month or so. So quadrant two, here I come!

Macs suck

GPU in my MacBookPro failed, so I'm temporarily back on Linux, on a desktop box which I was using mostly for gaming.

I will probably stay on Linux for quite some time, even after I get my laptop back. And they're really slow with repairs - I will forgive them only if they keep my hard drive undisturbed like they promised - my backups were not terribly up to date (about a month old) and what's worse - didn't include everything (just imagine - I don't have kitten picture downloader compatible with the most recent flickr API and I have to download kittens manually!). Now I backup everything to two independent external USB disks just to be sure, so I was prepared for this kind of backup failure, but I don't backup as often as I should - mostly because dual backup is such a pain.

Anyway, all cost of switching from one system to another must be paid up front - all the time necessary to setup and upgrade the new system, to change my habits and so on. Once the cost is paid (and with this post it will mostly be so), I can as well continue using it - so I might very well wipe out OSX and install Ubuntu on the laptop once I get it - I haven't decided it yet.

Well, first here's the list of the most important things how Linux (kUbuntu, old install upgraded to 10.4) is way better than OSX:
  • Polish Dvorak keyboard, thanks to divide. I kept meaning to write one for OSX, but never quite did it. You can write keyboard layouts for OSX as XMLs these days, but sadly there seems to be no way to get XMLs of existing layouts as starting point.
  • apt-get. Seriously, MacPorts is such a pile of fail compared to apt-get, it's just ridiculous. Imagine that - with apt-get you can upgrade system without breaking anything. It just works! I know, it's crazy!
  • I mostly develop for EC2, so quite a few things are easier on Linux than on OSX.
  • konsole and virtual desktops completely destroy and Spaces.
Sadly from this point it only gets worse.

Rum Tum Tugger by Trish Hamme from flickr (CC-NC-ND)

Desktop Linux sucks

When you keep using the same system for long, you stop noticing some problems - you just get used to them, just like Congolese got used to malaria.

And this is actually a big fucking deal. People "in charge" of desktop Linux - be in GNOME, KDE, or whatever - are mostly old-timers. They got so used to all the fail present in desktop Linux, it doesn't seem like a big deal to them.

The last non-Linux system they used regularly is probably Windows XP or something even older - they have only very faint idea what progress the closed source world have made with OSX and even Windows 7. Did I ever mention I sort of like Windows 7? Not for development of course, but you just don't see fail jumping at you all the time like it used to be on XP and Vista.

Anyway, the first impression I had once I switched from OSX to KDE was:
Are they fucking trying to reimplement Windows XP here?

I am deeply disturbed by what I've seen. KDE not only lacks all the awesome new stuff you get in OSX, it also lacks all the awesome old Unix stuff you get in OSX.

Let's start with the basics - a text editor. Control-A and Control-E don't work. A shortcut as convenient as Control-A instead does something as incredibly useless as selecting entire document. Control-E seems to do nothing. Just like Control-D. Control-K deletes entire line, which is broken, but at least only halfway so.

Fine, whatever, I thought - Windows noobs also use KDE, and if they want such broken shortcuts it's their problem - I'll just go to keyboard settings and switch to Emacs mode.

It's such a shame I haven't been recording myself, as the very moment I realized they removed Emacs mode I produced such an awasome stream of Polish and English swear-words that you could write a research paper about bilingual cursing based on it.

And amazingly, even mcedit lacks Emacs keybindings these days. How is it even possible that even the supposedly most noob-friendly operating system OSX has standard Emacs keybindings, but Linux doesn't even support them as an option any more? It doesn't even seem possible to set them up shortcut by shortcut.

At least they still work in konsole+bash, and in gtk-based programs if you add gtk-key-theme-name = "Emacs" to your ~/.gtkrc-2.0.

Unfortunately even that much barely works. In some contexts Control-F means forward, in some it means search. Control-W changes meaning between close tab and delete previous word - it's an unreliable mess.

They way OSX does it is simple - one modifier key (Control) is used for text editing, another (Command) for application control. I understand Linux cannot really follow this way, but would it really be that difficult to have a single global mapping between key presses and actions?

When you think about it most programs use the same shortcuts:
  • text editing (control-A, control-E, ...)
  • copy&paste (command-C, command-V, command-X)
  • undo stack control (command-Z, command-Y)
  • document/tab control - save (command-S), load (command-L), close document/tab/window (command-W), new (command-N) etc.
  • switching between tabs - command-1/2/3/..., control-pageup/pagedown etc.
  • application control - let's not get into application/window/sdi/mdi mess, there's time for everything and it isn't now - mostly quit (command-Q)
  • window control - maximize, minimize etc.
  • system control - like switching between windows/applications, making screenshots, locking screen, some dashboard, start menu, volume up/down/mute - the list is pretty long, but they're global by definition
There are very few applications that really need more than a few important shortcuts to this list. Some of them are important applications (Office, Photoshop, Emacs), but you'll be mostly using these anyway.

I would be surprised if GNOME/KDE lacked infrastructure for this (what's gtk-key-theme if not half of this?) - it just that nobody bothered doing this consistently. Or if they did, their efforts were blocked by some old-timer whose last recollections of life outside desktop Linux is Windows 3.11.
My Christmas wish: Good keybindings for Linux that won't make me miss OSX, or Linux from a decade ago.

It cannot be that hard, as Linux used to have those before losing them very recently.
Kočka z Mincovní by abejorro34 from flickr (CC-NC)


Let's continue the story. Kate and gedit were various levels of unusable due to keybindings, as was mcedit in Emacs mode which I mostly use over ssh instead of real Emacs. I could try the actual Emacs, but after a few years of TextMate it just isn't the same. Not even close.

Fortunately the totally awesome Daniel Lucraft wrote a TextMate-like editor in JRuby redcar. Every time I check it it gets more and more awesome, and it's safe to say it's already far better than any other editor you can get on Linux.

Of course I wouldn't be myself if I didn't start bitching. So here's my quick list of easy improvements to redcar.

Any of these will make a great Christmas gift for me:
  • Context menu absolutely needs TextMate-like "Filter through command" features and all its options (mostly "input selection / output selection", "input selection / output create new document", and "input document / output create new document", but others are occasionally useful as well). This is easily one of top five most useful features of TextMate ever and it's really not that hard.
  • Getting a lot of common bundles/plugins/snippets/whatever-else-it-is-using with a single command. Batteries included you know.
  • Tab size control really needs to be visible by default. I'm sure this is one of the first thing requested by half of the redcar users, so why not just do it?
  • Support for Control-A and Control-E, seriously. Patch on github.
  • Syntax highlighting often breaks. Now making it unbreakable would require massive architectural charget, but can we at least get easy refresh shortcut, or refresh on switching tabs (if edited since last switch)? Most documents are not that huge, so it shouldn't be a major performance problem, right?
  • Does it really need to create .redcar directories everywhere? It creates ~/.redcar/ already, won't that be enough?
  • Unixy interface like mate command. Right now starting redcar ~/some/dir from console makes it stay there and keep producing java.lang.IllegalArgumentException stack traces. It's definitely a good thing for developing redcar, but would it be possible to get some command line switches to make redcar fork away like mate command does?
  • TextMate-compatible Control-K - it should delete everything between cursor and end of line (right), or delete linebreak if cursor is already at end of line. Right now in redcar pressing Control-K multiple times doesn't do anything useful. Patch on github.
  • Can we get some sort of 2D selections? Even Firefox has them for tables with control-mouse. I understand they might be too hard architecturally, but it would be pretty awesome. Ctrl+B does exactly that already.
  • Control-T (search file) should ignore spaces like in TextMate. I know this behaviour is irregular, but it just makes so much more sense to me to search for "login ctr test" like in TextMate than for "loginctrtest" redcar wants. By the way history for Control-T is pretty awesome.Patch on github.
  • Search and replace within selection.
  • Not in TextMate but would be awesome: support for filtering by full path with Control-T, like login/index.
That was a lot of bitching - but then isn't this exactly what this blog is made for?

Yes, I'm aware of official redcar wishlist and I might convert some of this bitching to requests and maybe even - imagine that - patches (I've been doing quite a lot of JRuby lately, so I'm no longer scared of JVM). Just not now. Marginal utility of time, "it's complicated", delegate if possible and so on... You know, the usual excuses for not contributing.

EDIT: I fixed a few easy ones at github.

I'm not done yet

Silly me, I thought I would be able to include my Ruby standard library wishlish in this post - this will have to wait, there are two weeks left so I might find an evening for that as well - and that will be a ridiculously long list - and judging from reactions to my preliminary post about Ruby stdlib it will make everybody hate me even more than they hate me now.

So just a few minor additions to my Christmas wishlish, just in case robot Santa or spirit of Kwanzaa or your personification of gift giving of choice happens to be deluded enough to think I was nice enough that my wishlist so far is somehow not yet long enough:
  • Is there anything like OmniFocus for Linux?
  • Sound system on Linux is still broken in 2010. Skype mysteriously cannot find my microphone, even after spending far too much time trying various alsamixer and Skype options. Rebooting to Windows on same hardware - zero problems, works out of the box. Can we please get this?
  • System settings are just dreadful. Why cannot we have everything in one place, instead of some options in General, some in Advanced, some inside various programs, and some who knows where. This has always been a big weakness of Linux, but total lack of progress is disturbing. Even something as simple as telling Linux to automatically mount an external disk is nontrivial - in dolphin there's nothing like it in context either on disk icon on sidebar, or disk mountpoint directory in /media. Searching "usb" or "external" in system settings also returned nothing. It turns out I need to switch to "Advanced" tab and search for "removable". I seem to recall this used to work far better than that, but maybe that was GNOME. Is anybody working to improve this situation?
By the way is there any major reason for using KDE versus GNOME these days? It used to be so that GNOME had much worse Windows-itis and hypernoobemia than KDE, but these days I'm getting the impression that noobness rankings changed from Mac > Windows > GNOME > KDE to Mac > Windows > KDE = GNOME > Mac. Yes - somehow Macs dominate both total noob and expert user part of the spectrum.

Or actually - the person who does the most to fulfil my wishlist gets to decide KDE vs GNOME for me.

    Tuesday, October 12, 2010

    Blogging about Robin Hanson is not about blogging about Robin Hanson

    Fancy, snug in my bed by Hairlover from flickr (CC-BY)

    This post is a semi-confession, and the difference between "I find X" statements and "X" statements is significant.

    When I don't like something, I can of course come with some reasonable-sounding arguments specific to the particular subject, but usually "X is not really about X, it's just status seeking game gone awry" has no trouble getting onto such list. Sometimes high, sometimes not, but it's rarely missing.

    When I like something, I tend to believe "X is not about X" type of arguments don't really apply - maybe there's some status seeking involved, but it surely cannot be that significant.

    So here's my main point - "X is not about X" is not about X not being about X. It is a cheap and hard to refute shot at X to lower its status - essentially "X is not about X" argument is really about raising speaker's status relative to X.

    Science is not about science

    For example I find it intuitively obvious that academic science is mostly silly status game, with just tiny fraction of effort directed towards seeking important truth about reality.

    How else can you explain:
    • persistent disregard for statistics, reliance on cargo cult statistics like p-values and PCA
    • vast majority of papers not getting even a single replication attempt
    • most research being about irrelevant minutiae
    • nobody bothered by research results being not publicly available online
    • review by "peers", not by outsiders
    • obsession about citations
    • obsession regarding where something gets published
    • routinely shelving results you don't like
    This is exactly what we'd expect if it was just one big status seeking game! Truth seeking would involve:
    • everything being public from even before experiments start
    • all results being published online immediately, most likely on researcher's blog
    • results being routinely reviewed by outsiders with most clue about statistics and methodology at least as much as by insiders with domain knowledge
    • journals wouldn't exist at all
    • citations wouldn't matter much, hyperlinks work just as well - and nobody would bother with silliness like adding classics from 1980s to bibliography like it's still commonly done in many fields
    • most research focusing on the most important issues
    • vast majority of effort being put towards reproducing others' research - if it was worth studying, it's worth verifying; and if it's not worth verifying, why did anyone bothered with original research in the first place?
    • serious statistical training as part of curriculum in place of current sham in most disciplines
    It's a miracle that anything ever gets discovered at all! It would be a very cheap shot, but it's really easy to finish this line of reasoning by attributing any genuine discovery as an attempt at acquiring funding from outsiders so the status game can continue. And isn't there a huge correlation between commercial funding of science and rate of discovery?

    The Not-So-Fat One - playing with water by jeff-o-matic from flickr (CC-NC-ND)

    Medicine is definitely about health

    And here's something I like - modern medicine. It's just obvious to me it's about health and I find claims to the contrary ridiculous.

    Yes, plenty of medical interventions have only weak evidence behind them, but this is true about anything in the universe. Hard evidence is an exception (also see previous section) and among few things having hard evidence, medical interventions are unusually well represented.

    And yes, a few studies show that medical spending might not be very beneficial on the margin - mostly in short term, mostly in United States, mostly on small samples, always without replication, and I could go on like this.

    Anyway, modern medicine has enough defenders without me getting involved, so I'll just stop here. To me it just looks like what you'd get if it was about health, and it doesn't look like what you'd get if it was about status seeking, even if a tiny bit of it gets into it somehow.

    But notice how the argument goes. Conclusions first, and you can always fit or refute status seeking in arguments later. This bothered me until I've seen even bigger point - it all works as well when you replace "seeking status" with "making money" or any other motivator!

    Compare these semi-silly examples:
    • Writing books is not about content, it's about money
    • Writing blog posts is not about content, it's about status
    • Food industry is not really about food, it's about money
    • Organic food is not really about food, it's about status
    • Commercial software is not really about software, it's about money
    • Open Source is not really about software, it's about status
    This is just universal and an argument like that can be applied to nearly everything. But in all these money and status are only mechanisms of compensation - people optimize for money or status because that's what people are good at, but the only reason there's a connection between money/status and given action or product is underlying connection with desirable outcome.

    To pull an Inception - "X is about status, not X" is about status, not about X being about status, not X.

    PS. Here's a paper claiming that 2/3 of highly cited papers had serious problems getting published at all because of peer review. How many breakthroughs didn't get published? How many got published and then ignored? How many people didn't even bother, limiting themselves to standard useless minutiae, or staying away from academic science entirely?

    PPS. Here's a few more relevant links:
    Notice how except for medicine nobody seems bothered by this.

    Saturday, September 25, 2010

    Neolithic Counter-Revolution in Diet

    Cat grass om nom nom by chris.jervis from flickr (CC-NC)

    All people have silly things to feel proud of, like their country or their soccer team - one unusual thing I feel very proud of myself is Neolithic Revolution. We fucking made it! We're the only beings in the whole damn universe who broke free from bondage of evolution, and that was the main act.

    Like all irrationally proud people, I get easily irritated by all misguided criticism Neolithic Revolution has been getting - by amount of pure hatred it gets you'd think it was run by Justin Bieber and involved kitten sacrifice. Not that there would be any kittens without Neolithic Revolution - it's a true fact!

    The most common thread of criticism is that somehow, with literally over 9000 years of lag, Neolithic food is making us unhealthy and overweight, and if only we could abandon civilization and go back to what was eaten in Paleolithic, we'd all be happy monkey once more.

    This is plainly ridiculous. There's not a single group anywhere in the world even remotely living Upper Paleolithic life any more - even those that get most of their food hunting and gathering have long history of contact, trade, and interbreeding with agriculturalists and pastoralists, and every time you look more closely you'll see they don't really shy away from a bit of farming here and there themselves.

    In other words - our knowledge of Paleolithic life and diet is about as good as our knowledge of mating habits of Hogwarts students - a lot of real fun speculation, and very little hard data.

    And no matter where you look - vast majority of people have been unhealthy, with huge infant mortality, and all kinds of other severe health problems. True, cardiovascular problems and obesity are a fairly recent thing, but it is just ridiculous to focus exclusively on them and blame modern diet and lifestyle, while totally disregarding sheer count of much more debilitating diseases it saves us from.

    It's a lot like complaining about Internet leading to more privacy violations, while ignoring how helpful it is in fixing a far more severe problem of not having fucking Internet access in the first place. Whiners gonna whine.
    Isis is an ouroboros om nom nom cat macro by benchilada from flickr (CC-NC-SA)
    Isis as an Oroboros

    Diet by era

    Getting back on track, I took FAO data, and classified all food consumed into three big groups:
    • Paleolithic - vegetables, fruit (excluding wine), treenuts, meat, eggs, fish, seafood, other aquatic products, offal
    • Neolithic - cereals, pulses, alcoholic beverages, milk, butter, animal fats, starchy roots, spices, stimulants, miscellaneous
    • Industrial - sugar, sweeteners, sugarcrops, vegetable oils, oilcrops

    For nitpickers who want to pick nits, here are exact definitions.

    There are a few borderline cases I'll explain before proceeding:
    • Sugars and sweeteners (HFCS) are undoubtedly extremely recent introduction to diet.
    • A few vegetable oils like palm kernel oil and olive oil were common in traditional Neolithic, but they're a few percent of total, which is dominated by extremely recent soy, corn, sunflower etc., and they get far far more refining including partial oxidation than they ever used to. Traditional vegetable oils exist, but it would just complicate matters without changing conclusions.
    • Starchy roots category are almost all Neolithic crops like potatoes, sweet potatoes, and so on.
    • Minor categories without bold font are rather insignificant, and only included to make percents add up nicely.
    • Fans of Paleolithic will undoubtedly whine that modern vegetable, meat, etc. are not really Paleo due to all changes in modern agriculture, but by the same logic Neolithic foods are very rapidly disappearing as well. More seriously - in the grand scheme of things an egg is an egg, a pig is a pig, and a nut is a nut. Modern pig is far closer to wild pig than it is to a can of Diet Mountain Dew.
    Still with me? Can you guess what changed in our diet between 1961 and 2007?
    Our diet became a lot less Neolithic, a lot more Paleolithic, and a lot more Industrial.
    Yes, we eat more Paleolithic, and we're more fatter than when we were eating Neolithic. How do you deal with that, Neolithic-haters?

    Here's full list of countries for which data exists for both 1961 and 2007, percents are Paleolithic:Neolithic:Industrial by calories. Sadly there's no high quality data predating 1961, and for many countries like United States that was halfway through the Neolithic Counter-Revolution in diet.
    • Albania - 11.75%:80.75%:7.51% - 20.52%:65.36%:14.12%
    • Algeria - 9.75%:73.61%:16.65% - 11.11%:67.83%:21.06%
    • Angola - 7.89%:79.36%:12.75% - 9.32%:69.61%:21.07%
    • Antigua and Barbuda - 15.81%:57.22%:26.96% - 31.47%:46.85%:21.67%
    • Argentina - 27.36%:54.76%:17.88% - 24.13%:47.73%:28.13%
    • Australia - 24.36%:54.1%:21.54% - 25.47%:45.74%:28.79%
    • Austria - 18.21%:63.22%:18.57% - 22.02%:50.83%:27.15%
    • Bahamas - 23.15%:61.93%:14.93% - 28.53%:50.39%:21.08%
    • Bangladesh - 3.85%:89.29%:6.86% - 4.28%:84.97%:10.74%
    • Barbados - 13.98%:60.39%:25.63% - 23.63%:46.47%:29.9%
    • Belize - 11.35%:74.1%:14.55% - 19.45%:59.28%:21.26%
    • Benin - 6.94%:81.69%:11.37% - 5.59%:79.67%:14.74%
    • Bermuda - 29.55%:49.89%:20.56% - 26.64%:47.87%:25.49%
    • Bolivia - 16.84%:69.13%:14.03% - 20.5%:63.21%:16.29%
    • Botswana - 8.12%:81.73%:10.15% - 8.33%:68.47%:23.2%
    • Brazil - 10.84%:64.72%:24.44% - 18.5%:52.16%:29.34%
    • Brunei Darussalam - 11.3%:60.84%:27.86% - 18.18%:58.17%:23.65%
    • Bulgaria - 12.09%:73.62%:14.29% - 14.32%:59.42%:26.27%
    • Burkina Faso - 5.7%:82.29%:12.0% - 4.88%:82.2%:12.92%
    • Burundi - 12.63%:86.26%:1.12% - 18.25%:76.82%:4.93%
    • Cambodia - 6.9%:85.41%:7.69% - 10.36%:78.72%:10.92%
    • Cameroon - 13.2%:77.79%:9.02% - 14.16%:69.59%:16.25%
    • Canada - 19.53%:56.58%:23.89% - 20.26%:47.84%:31.9%
    • Cape Verde - 4.7%:81.66%:13.64% - 15.5%:66.54%:17.96%
    • Central African Republic - 7.06%:78.32%:14.62% - 12.51%:62.86%:24.63%
    • Chad - 5.5%:76.12%:18.37% - 4.72%:74.52%:20.76%
    • Chile - 14.27%:69.53%:16.2% - 22.35%:55.36%:22.29%
    • China - 7.84%:85.5%:6.66% - 26.97%:60.24%:12.8%
    • Colombia - 20.28%:58.13%:21.59% - 17.96%:54.0%:28.04%
    • Comoros - 10.92%:69.45%:19.63% - 12.41%:63.33%:24.26%
    • Congo - 9.8%:78.31%:11.89% - 11.48%:65.33%:23.19%
    • Costa Rica - 12.49%:58.25%:29.25% - 13.21%:54.1%:32.69%
    • Cuba - 12.87%:56.72%:30.41% - 17.15%:59.63%:23.22%
    • Cyprus - 18.96%:55.05%:25.99% - 25.71%:44.5%:29.79%
    • Côte d'Ivoire - 22.44%:67.91%:9.65% - 12.63%:71.54%:15.84%
    • Democratic People's Republic of Korea - 10.04%:82.17%:7.78% - 14.28%:76.58%:9.14%
    • Democratic Republic of the Congo - 11.51%:77.38%:11.11% - 5.62%:80.92%:13.46%
    • Denmark - 12.89%:62.25%:24.85% - 22.85%:56.69%:20.46%
    • Djibouti - 10.02%:64.9%:25.08% - 9.22%:64.56%:26.22%
    • Dominica - 17.34%:54.38%:28.27% - 25.55%:53.08%:21.37%
    • Dominican Republic - 30.24%:50.15%:19.62% - 21.13%:45.9%:32.97%
    • Ecuador - 19.72%:57.59%:22.69% - 23.44%:50.79%:25.77%
    • Egypt - 9.99%:76.98%:13.04% - 13.53%:72.27%:14.2%
    • El Salvador - 9.57%:71.64%:18.79% - 12.95%:63.52%:23.53%
    • Fiji - 5.56%:70.69%:23.75% - 13.74%:59.49%:26.77%
    • Finland - 12.06%:71.48%:16.46% - 24.17%:57.14%:18.7%
    • France - 21.62%:63.01%:15.37% - 22.55%:52.99%:24.46%
    • French Polynesia - 14.2%:63.64%:22.16% - 25.66%:52.12%:22.22%
    • Gabon - 32.81%:60.46%:6.72% - 24.44%:58.77%:16.79%
    • Gambia - 4.72%:71.77%:23.51% - 5.43%:62.54%:32.02%
    • Germany - 17.41%:61.61%:20.98% - 19.11%:54.34%:26.55%
    • Ghana - 14.96%:70.04%:14.99% - 15.66%:68.45%:15.88%
    • Greece - 16.44%:62.54%:21.02% - 20.88%:52.17%:26.95%
    • Grenada - 19.96%:51.5%:28.54% - 22.75%:44.22%:33.03%
    • Guatemala - 7.94%:78.14%:13.92% - 11.88%:62.1%:26.03%
    • Guinea - 19.42%:66.22%:14.36% - 12.7%:65.3%:22.0%
    • Guinea-Bissau - 13.84%:66.61%:19.55% - 8.88%:72.67%:18.45%
    • Guyana - 9.91%:63.19%:26.9% - 13.37%:61.33%:25.29%
    • Haiti - 12.78%:71.19%:16.03% - 11.49%:67.65%:20.86%
    • Honduras - 14.6%:70.33%:15.06% - 14.01%:60.13%:25.86%
    • Hungary - 15.99%:73.08%:10.93% - 18.2%:54.28%:27.52%
    • Iceland - 23.18%:52.58%:24.24% - 29.86%:50.0%:20.14%
    • India - 3.91%:79.42%:16.67% - 6.05%:75.19%:18.76%
    • Indonesia - 5.01%:78.85%:16.14% - 10.06%:70.67%:19.28%
    • Iran - 13.05%:71.72%:15.22% - 19.0%:64.28%:16.72%
    • Ireland - 13.96%:68.21%:17.83% - 18.92%:57.09%:23.99%
    • Israel - 18.05%:54.71%:27.25% - 24.11%:46.88%:29.01%
    • Italy - 15.24%:65.83%:18.93% - 22.54%:50.52%:26.95%
    • Jamaica - 17.55%:52.07%:30.37% - 19.78%:50.2%:30.02%
    • Japan - 10.96%:73.67%:15.37% - 20.18%:52.37%:27.45%
    • Jordan - 15.15%:61.34%:23.52% - 11.56%:56.94%:31.51%
    • Kenya - 10.21%:81.37%:8.42% - 10.96%:71.23%:17.8%
    • Kiribati - 15.97%:36.76%:47.27% - 17.87%:41.66%:40.47%
    • Kuwait - 20.35%:53.3%:26.35% - 20.59%:55.01%:24.4%
    • Lao People's Democratic Republic - 5.7%:92.49%:1.81% - 12.54%:80.1%:7.36%
    • Lebanon - 18.25%:63.08%:18.67% - 19.77%:51.64%:28.59%
    • Lesotho - 6.41%:87.67%:5.93% - 5.42%:86.71%:7.87%
    • Liberia - 11.77%:78.59%:9.64% - 7.39%:70.14%:22.47%
    • Libyan Arab Jamahiriya - 12.64%:64.69%:22.67% - 14.19%:57.38%:28.43%
    • Madagascar - 10.93%:82.92%:6.15% - 8.59%:82.55%:8.87%
    • Malawi - 6.94%:80.3%:12.76% - 8.17%:80.37%:11.46%
    • Malaysia - 10.39%:66.13%:23.48% - 17.46%:55.23%:27.31%
    • Maldives - 11.62%:52.79%:35.59% - 31.66%:47.91%:20.43%
    • Mali - 7.59%:84.68%:7.73% - 7.61%:78.81%:13.58%
    • Malta - 11.15%:68.35%:20.5% - 20.93%:58.91%:20.16%
    • Mauritania - 12.88%:75.94%:11.18% - 8.49%:68.07%:23.43%
    • Mauritius - 3.87%:65.63%:30.5% - 11.67%:60.76%:27.58%
    • Mexico - 10.63%:71.76%:17.61% - 17.84%:57.94%:24.21%
    • Mongolia - 39.69%:58.38%:1.94% - 21.64%:66.94%:11.42%
    • Morocco - 7.48%:72.86%:19.66% - 11.14%:67.73%:21.13%
    • Mozambique - 4.49%:86.97%:8.54% - 4.86%:81.05%:14.09%
    • Myanmar - 7.85%:79.08%:13.07% - 13.94%:68.05%:18.01%
    • Namibia - 11.65%:72.43%:15.92% - 9.47%:73.61%:16.93%
    • Nepal - 2.92%:92.98%:4.1% - 6.57%:83.19%:10.24%
    • Netherlands - 15.62%:54.88%:29.5% - 23.71%:48.54%:27.75%
    • Netherlands Antilles - 22.11%:56.63%:21.26% - 19.48%:57.6%:22.92%
    • New Caledonia - 18.0%:59.27%:22.73% - 20.3%:55.11%:24.58%
    • New Zealand - 25.31%:57.12%:17.57% - 26.86%:46.42%:26.72%
    • Nicaragua - 10.28%:69.71%:20.01% - 7.62%:66.84%:25.54%
    • Niger - 6.39%:88.74%:4.87% - 8.12%:82.06%:9.82%
    • Nigeria - 10.03%:69.04%:20.93% - 8.41%:72.24%:19.35%
    • Norway - 19.78%:61.74%:18.48% - 21.85%:55.4%:22.75%
    • Pakistan - 4.75%:81.31%:13.94% - 6.8%:69.01%:24.19%
    • Panama - 15.72%:66.74%:17.54% - 16.01%:62.74%:21.24%
    • Paraguay - 23.41%:66.75%:9.83% - 16.12%:59.2%:24.68%
    • Peru - 12.89%:68.83%:18.28% - 15.12%:69.75%:15.13%
    • Philippines - 19.21%:66.67%:14.12% - 20.62%:65.71%:13.68%
    • Poland - 10.91%:76.75%:12.35% - 18.45%:60.73%:20.82%
    • Portugal - 16.91%:66.28%:16.81% - 23.23%:56.13%:20.64%
    • Republic of Korea - 5.32%:90.37%:4.32% - 22.31%:52.44%:25.25%
    • Romania - 9.45%:82.0%:8.55% - 15.07%:67.21%:17.73%
    • Rwanda - 25.5%:73.72%:0.78% - 19.72%:73.29%:6.99%
    • Saint Kitts and Nevis - 8.93%:50.93%:40.14% - 20.29%:48.07%:31.65%
    • Saint Lucia - 25.06%:50.34%:24.6% - 29.78%:51.14%:19.08%
    • Saint Vincent and the Grenadines - 9.56%:60.38%:30.06% - 21.06%:53.12%:25.82%
    • Samoa - 23.99%:36.67%:39.34% - 30.0%:36.99%:33.01%
    • Sao Tome and Principe - 8.04%:60.04%:31.92% - 16.54%:55.14%:28.31%
    • Saudi Arabia - 15.92%:75.97%:8.11% - 18.27%:59.75%:21.98%
    • Senegal - 7.95%:71.51%:20.54% - 8.12%:67.61%:24.28%
    • Seychelles - 8.56%:67.95%:23.49% - 19.51%:55.67%:24.82%
    • Sierra Leone - 8.22%:60.19%:31.59% - 7.24%:69.23%:23.53%
    • Solomon Islands - 11.34%:76.01%:12.65% - 8.8%:74.13%:17.07%
    • South Africa - 11.65%:68.29%:20.06% - 13.0%:65.02%:21.98%
    • Spain - 14.51%:65.32%:20.17% - 26.26%:44.51%:29.23%
    • Sri Lanka - 7.11%:66.81%:26.08% - 7.77%:65.04%:27.2%
    • Sudan - 12.25%:69.05%:18.7% - 8.99%:73.13%:17.88%
    • Suriname - 10.5%:62.34%:27.17% - 17.27%:50.4%:32.33%
    • Swaziland - 9.94%:72.8%:17.26% - 11.05%:66.46%:22.48%
    • Sweden - 15.71%:55.94%:28.35% - 21.33%:52.34%:26.33%
    • Switzerland - 18.89%:56.97%:24.14% - 22.51%:47.44%:30.06%
    • Syrian Arab Republic - 15.76%:64.68%:19.57% - 12.54%:59.61%:27.85%
    • Thailand - 13.35%:75.37%:11.28% - 16.74%:57.68%:25.57%
    • Timor-Leste - 24.68%:71.22%:4.09% - 9.99%:75.86%:14.15%
    • Togo - 4.07%:86.81%:9.12% - 4.06%:78.59%:17.35%
    • Trinidad and Tobago - 10.2%:63.93%:25.88% - 14.56%:53.62%:31.82%
    • Tunisia - 9.36%:70.51%:20.13% - 13.45%:62.22%:24.33%
    • Turkey - 15.4%:74.39%:10.2% - 13.6%:62.28%:24.12%
    • Uganda - 17.44%:68.31%:14.25% - 22.27%:61.07%:16.66%
    • United Arab Emirates - 17.52%:70.51%:11.98% - 23.45%:57.72%:18.82%
    • United Kingdom - 20.41%:56.05%:23.54% - 22.54%:53.49%:23.97%
    • United Republic of Tanzania - 11.34%:80.15%:8.51% - 10.55%:75.45%:14.0%
    • United States of America - 20.36%:50.72%:28.92% - 20.51%:43.43%:36.06%
    • Uruguay - 28.54%:52.68%:18.78% - 15.05%:62.14%:22.82%
    • Vanuatu - 19.77%:57.07%:23.16% - 17.79%:52.8%:29.41%
    • Venezuela - 18.93%:54.48%:26.59% - 17.04%:53.45%:29.51%
    • Viet Nam - 10.49%:85.45%:4.06% - 19.59%:70.94%:9.46%
    • Yemen - 7.35%:85.13%:7.53% - 8.83%:69.29%:21.88%
    • Zambia - 6.0%:85.56%:8.44% - 5.53%:80.41%:14.05%
    • Zimbabwe - 5.85%:83.99%:10.16% - 5.64%:69.28%:25.07%
    Almost universally, good old Neolithic foods that served us so well for entire history of human civilization are abandoned. And two primary foods of industrial era - sugar and vegetable oil - become the new basis of diet. What people rarely mention is how at the same time Paleolithic food - meat, fish, fruit, vegetables and so on - doubled in popularity in so many countries. And some combination of Paleolithic and Industrial is destroying everyone's hearts, livers, thyroids, and attractiveness outside certain narrow niche.

    To save you some eye strain, here's the list of Top Ten Least Neolithic Countries. I don't even need to mention how it correlates with obesity rankings:
    1. Samoa - 23.99%:36.67%:39.34% - 30.0%:36.99%:33.01%
    2. Kiribati - 15.97%:36.76%:47.27% - 17.87%:41.66%:40.47%
    3. United States of America - 20.36%:50.72%:28.92% - 20.51%:43.43%:36.06%
    4. Grenada - 19.96%:51.5%:28.54% - 22.75%:44.22%:33.03%
    5. Cyprus - 18.96%:55.05%:25.99% - 25.71%:44.5%:29.79%
    6. Spain - 14.51%:65.32%:20.17% - 26.26%:44.51%:29.23%
    7. Australia - 24.36%:54.1%:21.54% - 25.47%:45.74%:28.79%
    8. Dominican Republic - 30.24%:50.15%:19.62% - 21.13%:45.9%:32.97%
    9. New Zealand - 25.31%:57.12%:17.57% - 26.86%:46.42%:26.72%
    10. Barbados - 13.98%:60.39%:25.63% - 23.63%:46.47%:29.9%
      But no worries - The Neolithic Counter-Revolution seems to be reaching everyone who can afford it, so soon the entire planet will consist of walking blobs of omega-6 PUFA.
      OM NOMNOMNOM NOM by katherine.a from flickr (CC-NC-SA)

      Anonymous Heroes of Science

      You might now think that I'm some ridiculous kind of ultra-Conservative, who would like to move back not just to the Founding Fathers or their local equivalent but all the way to ancient Sumer. Not at all.

      Food of Industrial era is mostly horrible crap, but it's also very plentiful, very cheap, and some of it is actually quite tasty. I'm sure we'll figure it out eventually - a big more Omega-3 here, a bit less Bisphenol A there - and step by step it might get just as good or even better than what farming came out with. Food eaten by early farmers was also horrible and caused severe health problems, but it was plentiful, cheap, and it took only a few millennia to figure out matters of health.

      The thing is - I just don't feel particular need to be the early adopter of Industrial Diet. Early adopters are essential, as our scientific ethics committees are too chickenshit to approve genuine scientific experiments on humans, so we're doing the second best things by pretending it's all perfectly safe and letting nature run the experiment. Which trial of trans fats could even hope compete with feeding ridiculous amounts of them to hundreds of millions of people for nearly a century now? Without even telling them, to avoid placebo effect! Now that's proper science on grand scale!

      And so I'd like to thank all people who unlike me bravely volunteer to let food industry run scientific experiments on their bodies by purchasing industrial food, and not stopping even after suffering severe side effects including but not limited to disfiguration, heart failure, liver failure, depression, diabetes, cancer, osteoarthritis, sleep apnea, and many others.

      Your noble sacrifice will not be forgotten. Actually it totally will, but keep going please? The science needs you!

      Tuesday, July 20, 2010

      We need syntax for talking about Ruby types

      Koteczek by kemcio from flickr (CC-NC)

      All this is about discussing types in blog posts, documentation etc. None of that goes anywhere near actual code (except possibly in comments). Ruby never sees that.

      Statically typed languages have all this covered, and we need it too. Not static typing of course - just an expressive way to talk about what types things are - as plain English fails here very quickly. As far as I know nothing like that exists yet, so here's my proposal.

      This system of type descriptions is meant for humans, not machines. It focuses on the most important distinctions, and ignores details that are not important, or very difficult to keep track of. Type descriptions should only be as specific as necessary in given context. If it makes sense, there rules should be violated.

      In advance I'll say I totally ignored all the covariance / contravariance / invariance business - it's far to complicated, and getting too deeply into such issues makes little sense in a language where everything can be redefined.

      Basic types

      Types of simple values can be described by their class name, or any of its superclasses or mixins. So some ways to describe type of 15 would be Fixnum (actual class), Integer (superclass), Comparable (mixin), or Object (superclass all the way up).

      In context of describing types, everything is considered an Object, and existence of Kernel, BasicObject etc. is ignored.

      So far, it should all be rather obvious. Examples:
      • 42 - Integer
      •  - Time
      • Dir.glob("*") - Enumerable
      • STDIN - IO

      nil and other ignored issues

      nil will be treated specially - as if it was of every possible type. nil means absence of value, and doesn't indicate what type the value would have if it was present. This is messy, but most explicitly typed languages follow this path.

      Distinction between situations that allow nils and those that don't will be treated as all other value range restrictions (Integer must be posibile, IO must be open for writing etc.) - as something outside the type system.

      For cases where nil means something magical, and not just absence of value, it should probably be mentioned.

      Checked exceptions and related non-local exits in Ruby would be a hopeless thing to even attempt. There's syntax for exceptions and catches used as control structures if they're really necessary.


      We will also pretend that Boolean is a common superclass of TrueClass and FalseClass.

      We will also normally ignore distinction between situations where real true/false are expected, and situations where any object goes, but acts identically to its boolean conversion. Any method that acts identically on x and !!x can be said to take Boolean.

      On the other hand if some values are treated differently than their double negation, that's not really Boolean and it deserves a mention. Especially if nil and false are not equivalent - like in Rails's #in_groups_of (I don't think Ruby stdlib ever does thing like that).

      Duck typing

      If something quacks like a Duck convincingly enough, it can be said to be of type Duck, it being object's responsibility that its cover doesn't get blown.

      In particular, Ruby uses certain methods for automatic type conversion. In many contexts objects implementing #to_str like Pathnames will be treated as Strings, objects implementing #to_ary as Arrays, #to_hash as Hashes, and to_proc as Procs - this can be used for some amazing things like Symbol#to_proc.

      This leads to a big complication for us - C code implementing Ruby interpreter and many libraries is normally written in a way that calls these conversion functions automatically, so in such contexts Symbol really is a Proc, Pathname really is a String and so on. On the other hand, in Ruby code these methods are not magical, and such conversions will only happen if explicitly called - for them Pathname and String are completely unrelated types. Unless Ruby code calls C code, which then autoconverts.

      Explicitly differentiating between contexts which expect a genuine String and those which expect either that or something with a valid #to_str method would be highly tedious, and I doubt anyone would get it exactly right.

      My recommendation would be to treat everything that autoconverts to something as if it subclassed it. So we'll pretend Pathname is a subclass of String, even though it's not really. In some cases this will be wrong, but it's not really all that different from subclassing something and then introducing incompatible changes.

      This all doesn't extend to #to_s, #to_a etc - nothing can be described as String just because it has to_s method - every object has to_s but most aren't really strings.

      Technical explanation of to_str and friends

      This section is unrelated to post's primary subject - skip if uninterested.

      Ruby uses special memory layout for basic types like strings and arrays. Performance would be abysmal if string methods had to actually call Ruby code associated with whatever [] happened to be redefined to for every character - instead they ask for a certain C data structure, and access that directly (via some macros providing extra safety and convenience to be really exact).

      By the way this is a great example of C being really slow - if Ruby was implemented on a platform with really good JIT, it could plausibly have every single string function implemented in term of calls to [], []=, size, and just a few others, with different subclasses of String providing different implementations, and JIT compiling inlining all that to make it really fast.

      It would make it really simple to create class representing a text file, and =~ /regexp/ that directly without reading anything more than required to memory, or maybe even gsub! it in a way that would read it in small chunks, saving them to another file as soon as they're ready, and then renaming in one go. All that without regexp library knowing anything about it all. It's all just my fantasy, I'm not saying any such JIT actually exists.

      Anyway, strings and such are implemented specially, but we still want these types to be real objects, not like what they've done in Java. To make it work, all C functions requiring access to underlying storage call a special macro which automatically calls a method like to_str or to_ary if necessary - so such objects can pretend to be strings very effectively. For example if you alias method to_str to path on File code like system"/bin/hostname") will suddenly start working. It really makes sense only for things which are "essentially strings" like Pathname, URI, Unicode-enhanced strings, proxies for strings in third party libraries like Qt etc.

      To complicate things further objects of all classes inheriting from String automatically use String's data representation - and C code will access that, never calling to_str. This leaves objects which duck type as Strings two choices:
      • Subclass String and every time anything changes update C string data. This can be difficult - if you implement an URI and keep query part as a hash instance variable - you need to somehow make sure that your update code gets run every time that hash changes - like by not exposing it at all and only allowing query updates via your direct methods, or wrapping it in a special object that calls you back.
      • Don't subclass String, define to_str the way you want. Everything works - except your class isn't technically a String so it's not terribly pretty OO design.
      You probably won't be surprised that not subclassing is the more popular choice. As it's all due to technical limitations not design choices, it makes sense to treat such objects as if they were properly subclassed.

      Pussy by tripleigrek from flickr (CC-SA)


      Back to the subject. For collections we often want to describe types of their elements. For simple collections yielding successive elements on #each, syntax for type description is CollectionType[MemberType]. Examples:
      • [42.0, 17.5] - Array[Float]
      • Set["foo","bar"] - Set[String]
      • 5..10 - Range[Integer]
      When we don't care about collection type, only about element types, descriptions like Enumerable[ElementType] will do.

      Syntax for types of hashtables is Hash[KeyType, ValueType] - in general collections which yield multiple values to #each can be described as CollectionType[Type1, Type2, ..., TypeN].

      For example {:foo => "bar"} is of type Hash[Symbol, String].

      This is optional - type descriptions like Hash or Enumerable are perfectly valid - and often types are unrelated, or we don't care.

      Not every Enumerable should be treated as collection of members like that - File might technically be File[String] but it's usually pointless to describe it this way. In 1.8 String is Enumerable, yielding successive lines when iterated - but String[String] make no sense (no longer a problem in 1.9).

      Classes other than Enumerable like Delegator might need type parameters, and they should be specified with the same syntax. Their order and meaning depends on particular class, but usually should be obvious.

      Literals and tuples

      Ruby doesn't make distinction between Arrays and tuples. What I mean here is a kind of Array which shouldn't really be treated as a collection, and in which different members have unrelated type and meaning depending on their position.

      Like method arguments. It really wouldn't be useful to say that every method takes Array[Object] (and an optional Proc) - types and meanings of elements in this array should be specified.

      Syntax I want for this is [Type1, Type2, *TypeRest] - so for example Hash[Date, Integer]'s #select passes [Date, Integer] to the block, which should return a Boolean result, and then returns either Array[[Date, Integer]] (1.8) or Hash[Date, Integer] (1.9). Notice double [[]]s here - it's an Array of pairs. In many contexts Ruby automatically unpacks such tuples, so Array[[Date,Integer]] can often be treated as Array[Date,Integer] - but it doesn't go deeper than one level, and if you need this distinction it's available.

      Extra arguments can be specified with *Type or ... which is treated here as *Object. If you want to specify some arguments as optional suffix their types with ? (the most obvious [] having too many uses already, and = not really fitting right).

      In this syntax [*Foo] is pretty much equivalent to Array[Foo], or possibly Enumerable[Foo] (with some duck typing) - feel free to use that if it makes things clearer.

      Basic literals like true, false, nil stand for themselves - and for entire TrueClass, FalseClass, NilClass classes too as they're their only members. Other literals such as symbols, strings, numbers etc. can be used too when needed.

      To describe keyword arguments and hashes used in similar way, syntax is {Key1=>Type1, Key2=>Type2} - specifying exact key, and type of value like {:noop=>Boolean, :force=>Boolean}.

      It should be assumed that keys other than those listed are ignored, cause exception, or are otherwise not supported. If they're meaningful it should be marked with ... like this {:query=>String, ...}. Subclasses often add extra keyword arguments, and this issue is ignored.


      Everything so far was just a prelude to the most important part of any type system - types for functions. Syntax I'd propose it: ArgumentTypes -> ReturnType (=> being already used by hashes).

      I cannot decide if blocks should be specified in Ruby-style notation or a function notation, so both  & {|BlockArgumentTypes| BlockReturnType} and &(BlockArgumentTypes->BlockReturnType) are valid. & is necessary, as block are passed separately from normal arguments, however strong the temptation to reuse -> and let the context disambiguate might be.

      Blocks that don't take any arguments or don't return anything can drop that part, leaving only something like &{|X|}, &{Y}, &{}, or in more functional notation &(X->), &(Y), &().

      Because of all the [] unpacking, using [] around arguments, tuple return values etc. is optional - and just like in Ruby () can be used instead in such contexts.

      If function doesn't take any arguments, or returns no values, these parts can be left - leaving perhaps as little as ->.

      • In context of %w[Hello world !].group_by(&:size) method #group_by has type Array[String]&{|String| Integer}->Hash[Integer,String]
      • has type Numeric -> Time
      • String#tr has type [String, String] -> String
      • On a collection of Floats, #find would have type Float?&(Float->Boolean)->Float
      • Function which takes no arguments and returns no values has type []->nil
      If you really need to specify exceptions and throws, you can add raises Type, or throws :kind after return value.  Use only for control structure exceptions, not for actual errors exceptions. It might actually be useful if actual data gets passed around.
      • Find.find has type [String*]&(String->nil throws :prune)->nil

      A standalone Proc can be described as (ArgumentsTypes->ReturnType) just as with notation for functions. There is no ambiguity between Proc arguments and block arguments, as blocks are always marked with |.

      Type variable and everything else

      In addition to names of real classes, any name starting with an uppercase letter should be consider a type. Unless it's specified otherwise in context, all such unknown  names should be considered class variables with big forall quantifier in front of it all.

      • Enumerable[A]#partition has type &(B->Boolean)->[Array[A], Array[A]]
      • Hash[A,B]#merge has type Hash[A,B]&(A,B,B->B)->Hash[A,B]
      • Array[A]#inject has either type B&(B,A->B)->B or &(A,A)->A. This isn't just a usual case of missing argument being substituted by nil - these are two completely different functions.
      To specify that multiple types are allowed (usually implying that behaviour will be different, otherwise there should be a superclass somewhere, or we could treat it as common duck typing and ignore it) join them with |. If there's ambiguity between this use and block arguments, parenthesize. It binds more tightly than ,, so it only applies to one argument. Example:
      • String#index in 1.8 has type (String|Integer|Regexp, Integer?)->Integer (and notice how I ignored Fixnums here).
      For functions that can be called in multiple unrelated ways, just list them separately - | and parentheses will work, but they are usually top level, and not needed anywhere deeper.

      If you want to specify type of self, prefix function specification with Type#:
      • #sort has type like Enumerable[A]#()&(A,A->1|0|-1)->Array[A]

      To specify that something takes range of values not really corresponding to a Ruby class, just define such extra names somewhere and then use like this:
      • File#chown has type (UnixUserId, UnixUserId)->0 - with UnixUserId being a pretend subclass of Integer, and 0 is literal value actually returned.

      To specify that something needs a particular methods just make up a pretend mixin like Meowable for #meow.

      Any obvious extensions to this notation can be used, like this:
      • Enumerable[A]#zip has type (Enumerable[B_1], *Enumerable[B_i])->Array[A, B_1, *B_i] - with intention that B_is will be different for each argument understood from context. (I don't think any static type system handles cases like this one reasonably - most require separate case for each supported tuple length, and you cannot use arrays if you mix types. Am I missing something?)

      The End

      Well, what I really wanted to do what talk about Ruby collection system, and how 1.9 doesn't go far enough in its attempts at fixing it. And without notation for types talking about high order functions that operate on collections quickly turns into a horrible mess. So I started with a brief explanation of notation I wanted to use, and then I figured out I can as well do it right and write something that will be reusable in other contexts too.

      Most discussion of type systems concerns issues like safety and flexibility, which don't concern me at all, and limit themselves to type systems usable by machines.

      I need types for something else - as statements about data flow. Type signature like Enumerable[A]#()&(A->B)->Hash[A,B] doesn't tell you exactly what such function does but narrows set of possibilities extremely quickly. What it describes is a function which iterates over collection in order while building a Hash to be returned, using collection's elements as keys, and values returned by the block as values. Can you guess the function I was thinking about here?

      Now a type like that is not a complete specification - a function that returns an empty hash fits it. As does one which skips every 5th element. And one that only keeps entries with unique block results. And for that matter also one that sends your email password to NSA - at least assuming it returns that Hash afterwards.

      It was still pretty useful. How about some of those?
      • Hash[A,B] -> Hash[B, Array[A]]
      • Hash[A,B] &(A,B->C) -> Hash[A,C]
      • Hash[A, Hash[B,C]] -> Hash[[A,B], C]
      • Hash[A,B] &(A,B->C) -> Hash[C, Hash[A,B]]
      • Enumerable[Hash[A,B]] &(A,B,B->B) -> Hash[A,B]
      • Hash[A,Set[B]] -> Hash[Set[A], Set[B]]

      Even these short snippets should give a pretty good idea what these are all about.

      That's it for now. Hopefully it won't be long until that promised 1.9 collections post.

      Sunday, July 18, 2010

      If only Ruby had macros

      Kicius Gustaw Czarowny by K0P from flickr (CC-NC-ND)

      Blogger will most likely totally destroy code formatting again, sorry about that.

      Ruby annoys me a lot - the code gets so close to being Just Right, with only that last little bit of wrongness that won't go away no matter what. With everything except Ruby at least I know it will be crap no matter what, so I never get this.

      For example it's so easy to make a function generating successive values on each call:

      def counter(v)
        return counter(v, &:succ) unless block_given?
        proc{ v = yield(v) }

      But you must give it value before the first - and sometimes such a thing doesn't exist, like with generating successive labels "a", "b", "c" ... A counter starting from the first value passed isn't exactly difficult, it just doesn't feel right:

      def counter(v)
        return counter(v, &:succ) unless block_given?
        proc{ old, v = v, yield(v); old }

      Useless variables like old that only indicate control flow just annoy me. Not to mention lack of default block argument. I'm undecided if this tap makes things better or worse.

      def counter(v)
        return counter(v, &:succ) unless block_given?
        proc{v.tap{ v = yield(v) }}

      Another example. This wrapper for Ruby executable makes rubygems and -r compatible. It's so close to being able to use Array#map, and yet so far away:

      args = []
      while arg = ARGV.shift
        if arg =~ /\A-r(.*)\z/
          lib = $1.empty? ? ARGV.shift : $1
          args << "-e" << "require 'rubygems'; require '#{lib}'"
          args << arg
      exec "ruby", *args

      Yes, these are tiny things, but it's frustrating to get almost there. By the way, -r should just call require, another thing which is almost right but no.

      I could go on with these small examples, but I want to talk about something bigger. A very common pattern in all programming languages is something like this:

        if item.test_1
        elsif item.test_2
        elsif item.test_3

      Or a very similar:

      case item
        when pattern_1
        when pattern_2
        when pattern_3

      Tests and actions are all next to each other, where they belong. But what if instead of executing an action on a single item at a time, we wanted to do so on all matching items together?

      If Ruby had proper macros it would be totally trivial - unfortunately Ruby forces us to choose one of bad options. First, the most straightforward:

      yes1, no1 = collection.partition{|item| item.test_1}
      yes2, no12 = no1.partition{|item| item.test_2}
      yes3, no123 = no12.partition{|item| item.test_3}

      Rather awful. Or perhaps this?

      groups = collection.group_by{|item|
      if item.test_1 then 1
        elsif item.test_2 then 2
        elsif item.test_3 then 3
        else 4

      By the way we cannot use a series of selects here - action_3 should apply only to items which pass test_3 but not test_1 or test_2.

      We can imagine adding extra methods to Enumerable to get syntax like this:

      proc{|item| item.test_1}, proc{|group| group.action_1},
        proc{|item| item.test_2}, proc{|group| group.action_2},
        proc{|item| item.test_3}, proc{|group| group.action_3},
                                  proc{|group| group.otherwise})

      Or maybe like this (looks even worse if you need to assign groups to a variable before performing the relevant action):

      tmp = collection.dup
      tmp.destructive_select!{|item| item.test_1}.action_1
      tmp.destructive_select!{|item| item.test_2}.action_2
      tmp.destructive_select!{|item| item.test_3}.action_3

      #destructive_select! being a method in style of Perl's splice - removing some items from collection, and returning removed values.

      Possibly wrapping it in something like:

      collection.filter{|item| item.test_1}.action{|group| group.action_1}.
                .filter{|item| item.test_2}.action{|group| group.action_2}.
                .filter{|item| item.test_3}.action{|group| group.action_3}.
                                           .action{|group| group.otherwise}

      It's Kicius by starlightexpress from flickr (CC-NC-ND)

      A few more bad ideas (David Allen says the way you can tell a highly creative person is that they generate bad ideas faster than anyone else). With instance_eval we could do something like this, with item and group being appropriate method calls.

        rule{ item.test_1 }
        action{ group.action_1 }
        rule{ item.test_2 }
        action{ group.action_2 }
        rule{ item.test_3 }
        action{ group.action_3 }
        action{ group.otherwise }

      It would be pretty hard to do that while still being able to have inner blocks with your current object's context. By the way trying this out I found out that it's impossible to call a block specifying self, and call a block passing arguments at the same time - it's only one or the other - and no combination of the two makes it work. Those tiny limitations are just infuriating.

      I also tried overriding ===. Now that would only work for a small subset of cases but was worth a try:

      collection.run_for_each_group{|item, group|
        case item
        when pattern_1
        when pattern_2
        when pattern_3

      This item would actually be a special object, calling === on which would callcc, partition collection in two, and resume twice modifying group variable (initially set to the entire collection). That would be pretty cool - except Ruby doesn't use double dispatch, so === is not a CLOS style generic function - it's a method, set on pattern objects, and while adding new pattern types is easy, making old patterns match new kinds of objects is hard. It would require manually finding out every pattern, and manually overriding it to handle our magic item type - and then a lot of hackery to make Regexp#=== work, and then it would fail anyway, as Range#=== and such seem to be handled specially by Ruby.

      There was a related possibility of not doing anything weird to item, but requiring special patterns:

      collection.run_for_each_group{|item, group, all|
        case item
        when all[pattern_1]
        when all[pattern_2]
        when all[pattern_3]

      We're not actually using item here all, so we don't really need to pass it:

      collection.run_for_each_group{|group, all|
        if all[pattern_1]
        elsif all[pattern_2]
        elsif all[pattern_3]

      Totally implementable, only somewhat ugly with all these all[]s. There are two good ways to implement it - all function would test all items, and if all returned the same value it would just return. Otherwise, it would divide the collection, and in one implementation use callcc, or in alternative implementation, throw something, and restart the whole block twice - this assumes tests are cheap and deterministic.

      It looks good, but it doesn't make me happy, as I want all kinds of tests, not just pattern matches. And eventually I came up with this:

      collection.run_for_each_group{|item, group, all|
        if all[item.test_1]
        elsif all[item.test_2]
        elsif all[item.test_3]

      This way, you can do any test on item you want - just pass the result to all[] before proceeding.

      How is it implemented? I could callcc for every element, but unlike Scheme's, Ruby's callcc is rather expensive. And not every version of Ruby has it. So it's the naive throw-and-restart-twice instead. This means tests on each item can be rerun many times, so they better be cheap. Determinism is also advised, even though my implementation caches the first value returned to avoid troubles.

      Well, first some usage example you can actually run:

      require "pathname"
      files = Pathname("/etc").children
        if all[]
          puts "Subdirectories: #{xs*' '}"
        elsif all[x.symlink?]
          puts "Symlinks: #{xs*' '}"
        elsif all[x.size > 2**16]
          puts "Big files: #{xs*' '}"
          puts "The rest: #{xs.size} files"

      Doesn't it look a lot lot better than a long cascade of #partitions?

      And now #run_for_in_group:

      module Enumerable 
        def run_for_each_group(expected=[], &blk)
          return if empty?
          xst, xsf = [], []
            answers = expected.dup
            catch :item_tested do
              yield(it, self, proc{|v|
                if answers.empty?
                  (v ? xst : xsf) << it
                  throw :item_tested
          xst.run_for_each_group([true, *expected], &blk)
          xsf.run_for_each_group([false, *expected], &blk)

      It shouldn't be that difficult to understand. expected tracks the list of expected test results for all items in current collection. Now we iterate, passing each element, the entire group, and all callback function.

      The first few times all is called, it just returns recorded answers - they're the same for every element. If after all recorded answers all is called again - we record its result, throw out of the block, and rerun it twice with expanded expectations.

      On the other hand if we didn't get any calls to all other than those already recorded, it means we reached the action - group it sees is every element with the same test history. This must only happen once for group, so we return from function.

      Total number of block calls is - 1x for each action, 2x for directories, 3x for symlinks, 4x for big files, and also 4x for everything else. Avoiding these reruns would be totally possible with callcc - but it's rather ugly, and often these tests aren't an issue.

      So problem solved? Not really. I keep finding myself in situations where a new control structure would make a big difference, and there just doesn't seem to be any way of making it work in Ruby without enough boilerplate code to make it not worthwhile.

      I'll end this post with some snippets of code which are just not quite right. Any ideas for making them suck less?

      urls = Hash[{|line| id, url = line.split; [id.to_i, url]}] 
      each_event{|type, *args| 
        case type
        when :foo
          one, two = *args
          # ...
        when :bar
          one, = *args
          # ...
      if dir
        Dir.chdir(dir){ yield(x) }

      Saturday, July 17, 2010

      Ruby is now more popular than C++

      啊?! by bc cf from flickr (CC-ND)

      There are many ways to measure popularity, and I'll take one proposed by creator of C++ Bjarne Stroustrup. According to him there are only two kinds of languages - those that nobody uses and those that everybody bitches about - so counting google results for "<language> sucks" is a perfectly good way of measuring popularity.

      I did an identical experiment exactly 4 years ago, so it's interesting to see what changed since then.
      • D 147000 - mostly bogus matches for "3D sucks" etc.
      • Java 56300
      • C 48900 - possibly also many bogus matches
      • PHP 34500
      • Ruby 25900
      • Ruby on Rails 18100 - included for comparison only
      • Scheme 14900 - my blog is #1, also many bogus matches
      • C++ 14000
      • Visual Basic 11600
      • Python 8930 
      • Perl 5450
      • Lisp 3510
      • C# 3310
      • Ada 1240
      • OCaml 1070
      • SML 784
      • Erlang 750
      • Cobol 641
      • Fortran 476
      • Haskell 416
      • Smalltalk 176
      • Prolog 161
      Some things just don't change. Ignoring queries with too many false positives, the list is dominated by Java and PHP, just as four years ago - with Java lead now being a lot stronger. Most niche languages like OCaml, Smalltalk, and Prolog are still niche languages - although many get a lot of bitching these days (like OCaml's 17x increase).

      On the other hand some things changed. Perl used to be very high in the sucking charts - at about 15x as many sucks as Ruby and Python, but isn't anywhere close to the top now, losing almost half of the sucks in that time, as old ones die in link rot, and new ones stop being generated.

      The second biggest success story is Python, which sucks 12x as much now, finally overtaking Perl.

      But the biggest success is a spectacular explosion of popularity of Ruby. My first list was released only half a year after Rails 1.0, when many people were intrigued by Ruby, but few were actually using it. In those four years Ruby suckiness levels exploded 43x - not even counting Rails bitching, a lot of which is as much about Ruby as about Rails themselves.

      Ruby is now a lot more popular than C++ - according to the very metric endorsed by C++ creator. What alternative explanation is there? That C++ is used a lot, it just happens to suck less? Come on.

      People complaining about scientific validity of this post are sent here.

      Another example of Ruby being awesome - %W

      cuteness by sparkleice from flickr (CC-NC-ND)

      And there I was thinking I knew everything about Ruby, at least as far as its syntax goes...

      As you might have figured out from my previous posts, I'm totally obsessed about string escaping hygiene - I would never send "SELECT * FROM reasons_why_mysql_sucks WHERE reason_id = #{id}" to an sql server even if I was absolutely totally certain that id is a valid integer and nothing can possibly go wrong here. Sure, I might be right 99% of time, but it only takes a single such mistake to screw up the system. And not only with SQL - it's the same with generated HTML, generated shell commands and so on.

      And speaking of shell commands - system function accepts either a string which it then evaluates according to shell rules (big red flag), or a list of arguments which it uses to fork+exec right away. Of course we want to do that - except it's really goddamn ugly. Faced with a choice between this insecure but reasonably looking way of starting MongoDB shard servers:

      system "mongod --shardsvr --port '#{port}' --fork --dbpath '#{data_dir}' \
      --logappend --logpath '#{logpath}' --directoryperdb"

      And this secure but godawful (to_s is necessary as port is an integer, and system won't take that):

      system *["mongod", "--shardsvr", "--port", port, "--fork",
      "--dbpath", data_dir, "--logappend",
      "--logpath", logpath, "--directoryperdb"].map(&:to_s)

      Even I have my doubts.

      And then I found something really cool in Ruby syntax that totally solves the problem. Now I was totally aware of %w[foo bar] syntax Ruby copied from Perl's qw[foo bar], and while useful occasionally, is really little more than constructing a string, and then calling #split on that.

      And I though I was also aware of %W - which obviously would work just like %w except evaluating code inside. Except that's not what it does! %W[foo #{bar}] is not "foo #{bar}".split - it's ["foo", "#{bar}"]! And using a real parser of course, so you can use as many spaces inside that code block as you want.

      system *%W[mongod --shardsvr --port #{port} --fork --dbpath #{data_dir}
      --logappend --logpath #{logpath} --directoryperdb]

      There's nothing in Perl able to do that. Not only it's totally secure, it looks every better than the original insecure version as you don't need to insert all those 's around arguments (which only half-protected them anyway, but were better than nothing), and you can break it into multiple lines without \s.

      %W always does the right thing - %W[scp #{local_path} #{user}@#{host}:#{remote_path}] will keep the whole remote address together - and if the code block returns an empty string or nil, you'll get an empty string there in the resulting array. I sort of wish there was some way of adding extra arguments with *args-like syntax like in other contexts, but %W[...] + args does exactly that, so it's not a big deal.

      By the way, it seems to me that all % constructors undeservingly get a really bad reputation as some sort of ugly Perl leftover in Ruby community. This is so wrong - what's ugly is excessive escaping with \ which they help avoid. Which regexp for Ruby executables looks less bad, the one with way too many \/s - /\A(\/usr|\/usr\/local|\/opt|)\/bin\/j?ruby[\d.]*\z/, or one which avoids them all thanks to %r - %r[\A(/usr|/usr/local|/opt|)/bin/j?ruby[\d.]*\z]?

      By the way - yes I used []s inside even though they were the big demarcator. That's another great beauty of % constructions - if you demarcate with some sort of braces like [], (), <>, or {} - it will only close once every matched pair inside is closed - so unlike traditional singly and doubly quoted strings % can be nested infinitely deep without a single escape character! (Perl could do that one as well)

      And speaking of things that Ruby copied from Perl, and then made them much more awesome, here's a one-liner to truncate a bunch of files after 10 lines, with optional backups. Which language gets even close to matching that? ($. in both Perl and Ruby will keep increasing from file to file, so you cannot use that)

      ruby -i.bak -ple 'ARGF.skip if ARGF.file.lineno > 10' files*.txt

      Friday, July 16, 2010

      Arrays are not integer-indexed Hashes

      Cabooki by elycefeliz from flickr (CC-NC-ND)

      We use a separate Array type even though Ruby Hashes can be indexed by integers perfectly well (unlike Perl hashes which implicitly convert all hash keys to strings, and array keys to integers). Hypothetically, we could get rid of them altogether and treat ["foo", "bar"] as syntactic sugar for {0=>"foo", 1=>"bar"}.

      Now there are obviously some performance reasons for this - these are mostly fixable and a single data structure can perform well in both roles. And it would break backwards compatibility rather drastically, but let's ignore all that and imagine we're designing a completely fresh language which simply looks a lot like Ruby.

      What would work

      First, a lot of things work right away like [], []=, ==, size, clear, replace, and zip.

      The first incompatibility is with each - for hashes it yields both keys and values, for arrays only values, and we'd need to decide one way or the other - I think yielding both makes more sense, but then there are all those non-indexable enumerables which won't be able to follow this change, so there are good reasons to only yield values as well. In any case, each_pair, each_key, and each_value would be available.

      Either way, one more change would be necessary here - each and everything else would need to yield elements sorted by key. There are performance implications, but they're not so bad, and it would be nicer API.

      Hash's methods keys, values, invert, and update all make perfect sense for Arrays. With keys sorted, first, last, and pop would work quite well. push/<< would be slightly nontrivial - but making it add #succ of the last key (or 0 for empty hashes) would work well enough.

      Collection tests like any?, all?, one?, none? are obvious once we decide each, and so is count. map/collect adapts to hashes well enough (yielding both key and value, and returning new value).

      Array methods like shuffle, sort, sample, uniq, and flatten which ignore indexes (but not their relative positions) would do likewise for hashes, so flattening {"a"=>[10,20], "b"=>30} would result in [10,20,30] ("a" yields before "b").

      Enumerable methods like min/max/min_by/max_by, find, find_index, inject would do likewise.

      include? checks values for Arrays and keys for hashes - we can throw that one out (or decide one way or the other, values make more sense to me), and use has_key?/has_value? when it matters.

      reverse should just return values, but reverse_each should yield real keys.

      I could go on like this. My point is - a lot of this stuff can be made to work really well. Usually there's a single behavior sensible for both Arrays, and Hashes, and if you really need something different then keys, values, or pairs would usually be a suitable solution.

      What doesn't work

      Unfortunately some things cannot be made to work. Consider this - what should be the return value of {0 => "zero", 1 => "one"}.select{|k,v| v == "one"}?

      If we treat it as a hash - let's say a mapping of numbers to their English names, there is only one correct answer, and everything else is completely wrong - {1=>"one"}.

      On the other hand if we treat it as an array - just an ordered list of words - there is also only one correct answer, and everything else is completely wrong - {0=>"one"}.

      These two are of course totally incompatible. And an identical problem affects a lot of essential methods. Deleting an element renumbers items for an array, but not for a hash. shift/unshift/drop/insert/slice make so sense for hashes, and methods like group_by and partition have two valid and conflicting interpretations. It is, pretty much, unfixable.

      So what went wrong? Thinking that Arrays are indexed by integers was wrong!

      In {0=>"zero",1=>"one"} association between keys and values is extremely strong - key 0 is associated with value "zero", and key 1 with value "one". They exist as a pair and everything that happens to the hash happens to pairs, not to keys or values separately - there are no operations like insert_value, delete_value which would just shift remaining values around from one key to another. This is the nature of hashes.

      Arrays are not at all like that. In ["zero", "one"] association between 0 and "zero" is very weak. The real keys are not 0, and 1 - they're two objects devoid of any external meaning, whose only property is their relative partial order.

      To implement array semantics on top of hashes, we need a class like, less_than=nil). Then a construction like this would have semantics we desire.

      arr = {}
      arr[, nil)] = "zero"
      arr[, nil)] = "one"

      If we use these instead of integers, hashes can perform all array operations correctly.

      # shift
      # unshift
      arr[, arr.first_key)] = "minus one"
      # select - indexes for "zero" and "two" in result have correct order
      ["zero", "one", "two"].select{|key, value| value != "one"}
      # insert - nth_key only needs each
      arr[, arr.nth_key(1))] = "one and half"

      And so the theory is satisfied. We have a working solution, even if highly impractical one. Of course all these Index objects are rather hard to use, so the first thing we'd do is subclassing Hash so that arr[i] would really mean arr[arr.nth_key(i)] and so on, and there's really no point yielding them in #each and friends... oh wait, that's exactly where we started.

      In other words, unification of arrays and hashes is impossible - at least unless you're willing to accept a monstrosity like PHP where numerical and non-numerical indexes are treated differently, and half of array functions accept a boolean flag asking if you'd rather have it behave like an array or like a hash.

      Random sampling or processing data streams in Ruby

      7 14 10 by BernieG10 from flickr (CC-NC-ND)

      It might sound like I'm tackling a long solved problem here - sort_by{rand}[0, n] is a well known idiom, and in more recent versions of Ruby you can use even simpler shuffle[0, n] or sample(n).

      They all suffer from two problems. The minor one is that quite often I want elements in the sample to be in the same relative order as in the original collection (this in no way implies sorted) - what can be dealt with by a Schwartzian transform to [index, item] space, sampling that, sorting results, and transforming out to just item.

      The major problem is far worse - for any of these to work, the entire collection must be loaded to memory, and if that was possible, why even bother with random sampling? More often than not, the collection I'm interested in sampling is something disk-based that I can iterate only once with #each (or twice if I really really have to), and I'm lucky if I even know its #size in advance.

      By the way - this is totally unrelated, but I really hate #length method with passion - collections have sizes, not "lengths" - for a few kinds of collections we can imagine them arranged in a neat ordered line, and so their size is also length, but it's really lame to name a method after special case instead of far more general "size" - hashtables have sizes not lengths, sets have sizes not lengths, and so on - #length should die in fire!

      When size is known

      So we have a collection we can only iterate once - for now let's assume we're really lucky and we know exactly how many elements it has - this isn't all that common, but it happens every now and then. As we want n elements out of size, probability of each element being included is n/size, and so select{ n > rand(size) } will nearly do the trick - even keeping samples in the right order... except it will only return approximately n elements.

      If we're sampling 1000 out of a billion we might not really care all that much, but it turns out it's not so difficult to do better than that. Sampling n elements out of [first, *rest] collection neatly reduces to: [first, *rest.sample(n-1)] with n/size probability, or rest.sample(n) otherwise. Except Ruby doesn't have decent tail-call optimization, so we'll use counters for it.

      module Enumerable
        def random_sample_known_size(wanted, remaining=size)
          if block_given?
              if wanted > rand(remaining) 
                wanted -= 1
              remaining -= 1
            rv = []
            random_sample_known_size(wanted, remaining){|it| rv.push(it) }

      This way of sampling has an extra feature that it can yield samples one at a time and never needs to store any in memory - something you might appreciate if you want to take a couple million elements out of 10 billions or so, and you will not only avoid loading them to memory, you will be able to use the results immediately, instead of only when the entire input finishes.

      This is only possible if collection size is known - if we don't know if there's 1 element ahead or 100 billion, there's really no way of deciding what to put in the sample.

      If you cannot fit even the sample in memory at once, and don't know collection size in advice - it might be the easiest thing to iterate twice, first to compute the size, and then to yield random records one at a time (assuming collection size doesn't change between iterations at least). CPU and sequential I/O are cheap, memory and random I/O are expensive.

      Russian Blue by Adam Zdebel from flickr (CC-NC-ND)

      When size is unknown

      Usually we don't know collection size in advance, so we need to keep a running sample - initialize it with the first n elements, and then for each element that arrives replace a random one from the sample with probability n / size_so_far.

      The first idea would be something like this:

      module Enumerable
        def random_sample(wanted)
          rv = []
          size_so_far = 0
            size_so_far += 1
            j = rand(size_so_far)
            rv.delete_at(j) if wanted == rv.size and wanted > j
            rv.push(it) if wanted > rv.size

      It suffers from a rather annoying performance problem - we're keeping the sample in a Ruby Array, and while they're optimized for adding and removing elements at both ends, deleting something from the middle is a O(size) memmove.

      We could replace rv.delete_at(j); rv.push(it) with rv[j] = it to gain performance at cost of item order in the sample... or we could do that plus Schwarzian transform into [index, item] space to get correctly ordered results fast. This only matters once sample size reaches tens of thousands, before that brute memmove is simply faster than evaluating extra Ruby code.

      module Enumerable
        def random_sample(wanted)
          rv = []
          size_so_far = 0
            size_so_far += 1
            j = wanted > rv.size ? rv.size : rand(size_so_far)
            rv[j] = [size_so_far, it] if wanted > j
{|idx, it| it}

      This isn't what stream processing looks like!

      The algorithms are as good as they'll get, but API is really not what we want. When we actually do have an iterate-once collection, we usually want to do more than just collect a sample. So let's encapsulate such continuously updated sample into Sample class:

      class Sample
        def initialize(wanted)
          @wanted = wanted
          @size_so_far = 0
          @sample = []
        def add(it)
          @size_so_far += 1
          j = @wanted > @sample.size ? @sample.size : rand(@size_so_far)
          @sample[j] = [@size_so_far, it] if @wanted > j
        def each
          @sample.sort.each{|idx, it| yield(it)}
        def total_size
        include Enumerable

      It's a fully-featured Enumerable, so it should be really easy to use. #total_size will return count of all elements seen so far - calling that #size would conflict with the usual meaning of number of times #each yields. You can even nondestructively access the sample, and then keep updating it - usually you wouldn't want that, but it might be useful for scripts that run forever and periodically save partial results.

      To see how it can be used, here's a very simple script, which reads a possibly extremely long list of URLs, and prints a sample of 3 by host. By the way notice autovivification of Samples inside the Hash - it's a really useful trick, and Ruby's autovivification can do a lot more than Perl's.

      require "uri"
      sites ={|ht,k| ht[k] =}
        host = URI.parse(url).host rescue next
      sites.sort.each{|host, url_sample|
        puts "#{host} - #{url_sample.total_size}:"
        url_sample.each{|u| puts "* #{u}"}

      So enjoy your massive data streams.