Wednesday, December 10, 2008

Funny characters in Unicode

This post is a blogified version of a lightning talk I gave on BarCamp London 5. It was inspired by Chris Ball's Favourite Unicode Codepoints post. It's going to be in a weird talk/blogpost hybrid form that I hope my readers will excuse.

First, I want to say that this talk is not going to convey any useful information whatsoever. You won't learn anything about internationalization, or anything else from it. I'm doing it just because it's going to be fun and awesome.



First the famous mirror trick, where text can be seen upside down, or mirrored left to right. None of it is real Unicode characters like "mirrored e" or "upside down a". It's just a bunch of characters that happen to look like that - for example "upside down p" (like in pet) is obviously "d" (like in dog). If there's no good Latin letter, a letter from other script is used, like Cyrillic or IPA phonetic alphabet. It will be more or less noticable depending on your font.



Here's a real Unicode character - Skull and Crossbones, arrr! It's used as danger signal, so it's arguably common enough for inclusion in Unicode.



This one I totally don't get. It's just a random icon that somehow got into Unicode. Unicode is huge, so they have very low standards for inclusion. Maybe it was in Microsoft Wingdings or something like that and they thought it's a good enough reason to include it.



I half-get this one. Top three lines are Japanese Post symbol. Where does the rest of the face comes from and how it got into Unicode is a mystery to me. It was probably included in some JIS standard as a joke, and Unicode copied it, or something along these lines.



Operators from APL programming language got into Unicode too. APL is like 1960s' Perl. This operator doesn't feel too good because it has to program in APL.



It's called Arabic ligature Uighur Kirghiz yeh with hamza above with alef maksura isolated form, and it's exactly what it says it is. It looks rather ordinarily for this list, but it might be the character with the longest name.



Another Arabic one. Most ligatures are for just 2 or 3 characters, but canonical decomposition of this one is whooping 18 characters. It means something like "May Allah bless him and grant him peace" and is used when Prophet Muhammad is mentioned. By the way I had a really funny picture of Muhammad that I wanted to put here, but I somehow cannot find it.



How many loops are there?



This letter is very spidery so better be careful or it will bite you.



Sometimes it's not enough to be greater than, or even much greater than something else. Oh no, you need to be very much greater than. I think TeX is spoiling mathematicians and they come up with way too many symbols, and then we have to support them.



A polar opposite of the previous character. It's not greater than, neither is it less than. We kinda have a symbol for that already - U+003D EQUALS SIGN. OK, I know it's about partial orders, and it means that two objects cannot be compared, but it's not any less funny for knowing that.



This is a very sad symbol. Not only its heart is heavy, it's also black. Is it a waste of codepoint or what? It's just a random icon not a meaningful "character".



That's my personal favorite for "worst waste of codepoint award". Not only is "Floral Heart Bullet" not a character, they even included a reversed rotated version of it in Unicode. It's an icon, not a character.



We really need a punctuation mark that says "WTF". This entire list is one big interrobang use case, am I right?


The last one is not a character, but the entire Tibetan script. It looks absolutely beautiful.

If you have any questions related to this talk/blogpost, just put them in comments.


Anonymous said...

Apparently the snowman's from a legacy character set. Which one, I don't know.

Anonymous said...

Also, check out the Arabic Letter Teh (U+062A). Something shocking happens and it turns into Arabic Letter Teh Marbuta (U+0629). Maybe there is a close call in a soccer game, because Arabic Letter Teh Marbuta GOAAAAAL!!! (U+06C3) looks quite similar.

Finally, Arabic Letter Teh With Ring (U+067C) has various uses, even if you don't read and write Pashto.

jcob said...

matthew said...

Anonymous said...

The florar heart bullet is actually called "the Aldus leaf." It's one of the oldest known ornaments used in printing. It has great historical and iconic value, and is - by no means - waste. In fact the Aldus leaf can be seen as a symbol for the art of printing itself.

Ray said...

Anonymous said...


Jordan Bettis said...

One of the design requirements of Unicode is that it be "round-trip compatible" with every crappy legacy encoding ever used seriously.

What that means is that you can take some knee-biting horrible encoding like EBCDIC (take your pick of the variant) and you can take text in that encoding, translate it into Unicode, then translate it *back* into EBCDIC and there will be enough information to reproduce the *exact same* EBCDIC code-points.

To do that, Unicode must include every silly, stupid character ever used in every obscure, local encoding out there. It's somewhat unfortunate, yes, but the alternative is to break round-trip compatibility.

The snowman and the monkey both come from the land of hello-kitty. Which one of Japan's several incompatible legacy encodings, I'm not sure. Probably the JIS family.

Anonymous said...

The funky changes mentioned by an earlier commentator in regards to the Arabic Tah and Tah Marbuta are actually just a functional representation of an Arabic oddity. When there is a "Tah Marbuta" placed at the end of a word, it generally functions as an "Ah" sound. However, if a possessive or other conjunction is added onto the end of the word, that "Tah Marbuta" turns into a normal "Tah" and assumes the normal "Tah" sound and functions. Oh how I love the Arabic.

Liza said...

philvarner said...

Anonymous said...

Just wanted to point out that APL is still in daily use and not some dead programming language from the 1960s. The inclusion of the APL character set in Unicode is really necessary for APL programmers - it's our ordinary working alphabet.

taw said...

Anonymous: APL is a dead language from 1960s. There are still a few leftover systems written in APL, just like there are still some vacuum tubes, horse buggies, and typewriters in use, but they're all dead technologies for practical purposes.

Anonymous said...

I'm sorry you think that. It's the number one problem that APL faces - because it's been around a long time people think it's out-dated.

I know and use lots of computer languages - C#, Java, Objective C, Ruby, etc. For some jobs APL is still my language of choice.

You should take time to find out more about APL.

Anonymous said...

No, taw, APL is not a dead language from the 1960s. The character in question was introduced and used in the language from the 1980s. APL may have been in its prime back then, but like Mary Queen of Scots, it's not dead yet.

Anonymous said...

I would appreciate you not mentioning that u found a funny picture of Muhammad. It is very offensive to people as a whole.

Nelson said...

You say "Unicode is huge, so they have very low standards for inclusion.". But that's not true, in fact Unicode has very high standards for inclusion with some remarkably erudite and detailed discussion. It's not a perfect process, but by no means do they just include every goofy character set they find.

Calculator Ftvb said...

U+FDFD ﷽ (a ligature of "ﻢﻴﺣﺮﻟﺍ
ﻢﺴﺑ") has to be one of the most awesome Unicode characters. However, it is hard to find a font supporting it. I have found only three fonts supporting it: Nafees Nastaleeq, GNU Unifont, and PakType Naskh. Though GNU Unifont looks really bad with it.

Anonymous said...

I appreciate you mentioning that you found a funny image of Mohammad. It made someone overstate the severity of the infraction by stating that it is "very offensive to people as a whole."

drq said...

FLORAL HEART BULLET, REVERSED ROTATED is actually a typographic symbol used in french writing (mostly academical and in beaux-lètres). So it's not as useless as it may appear:)

Ashley Adams : Online Printing said...

Anonymous said...

I have been trying to write something that you would understand, as you are non-Muslim, to express to you how much disappointed I was when I read your comment about the funny pic you found for prophet Mohamed.
So, just at least from respect to other people point-of-view, don't make such comments again, please.
We Muslims, do respect the other prophets and acknowledge them. We don't make funny comments about them or take them as a subject of cartoons and jokes. They are prophets, that means God chosen them among all other people to deliver his message. If God chosen them to deliver his message to us, how would one underestimate them and make cartoons and jokes about them??!!

taw said...

Anonymous: Making fun of religion is an established part of the Western culture at least since the Enlightenment - people have been making fun of the religion, mostly of Christian religion but others are not spared either, for very long time.

You should respect our culture, including our custom of making fun of different religions. We don't force you to make or read any cartoons or jokes yourself.

Anonymous said...

muman613 said...

I want to know what the heck a character with the name of 'allah' is doing in the character set. Islam is a violent religion which has constantly attacked other beliefs since the illiterate 'prophet' mohamud appeared. See how these muslims will complain about pictures of mohamud and yet they kill each other every day, mutilate and rape their own women, and engage in deceit in order to spread their sick religion.

Website hosting forum said...

unicode fun said...

Anonymous said...

grow up, muman. if you don't know what the hell you're talking about, just keep quiet

dabid said...

@muman613 If you really want to know, it's because simple text processors don't know to automatically put a shadda (for gemination) and an alif (for vocalization) on top of the second lam.

dabid said...

Oh and Islam prohibits all pictures of people, so you can imagine exactly how offended they get at a picture of Muhammad.

Anonymous said...

Anonymous said...


Anonymous said...

How do you appreciate something that is very offensive to people

Anonymous said...

Aristocratic & educated westerners are know to uphold a level of respect for other religions & cultures. What you said makes no sense at all. Westerners are know to be very respectful people, maybe you are an exception.

Anonymous said...

i am very sorry, but i am missing U+1F4A9 - the pile of Poo
this is too fun to keep it for yourself.