Thursday, 24 October 2019

Unicode, Currency Symbols and Cultural Hegemony

The Unicode character standard defines characters for every written language on the planet, a huge range of emoticons, and many other things besides. And, naturally, symbols for all the currencies in use, and quite a few that aren't.

In the beginning, computers represented characters using various manufacturer-dependent 6-bit codes, allowing for 64 characters. That's enough for a single case of letters, numerals, common punctuation, and a few special characters like space and newline. That's why old computers printed everything in CAPITALS - they had no way to distinguish between upper and lowercase.

In the 1960s, IBM introduced the System 360, and with it the 8-bit byte which is now completely taken for granted. (Before that, computers came with all kinds of strange word lengths - the Elliott 803 from 1962 or so had a 39-bit word). That allowed the use of 8-bit characters. But at first, one bit was reserved for error detection (the "parity bit"), so only 7 bits were used, providing for 128 characters. That allowed lower case letters to be added, and some additional punctuation such as '{' and '}'. There was also a range of 32 mysterious control characters with exotic sounding names like End of Block, most of which were never really used.

The 7-bit character set was rapidly standardised by the US in 1963, as ASCII (American Standard Character set for Information Interchange), and adopted internationally by the International Standards Organisation as ISO 646. Because it was a US standard, it did not allow for the needs of other countries. For example, it included no accented characters - it couldn't, because there wasn't room. Many countries therefore had their own national variant of ISO 646. In the UK, the rarely used character '#' was replaced by the pound symbol, '£'. Several countries replaces lesser-used punctuation by the accented letters they needed. For example in France, 'é' replaced '{'. (There is a legacy of this in the C programming language, which allows odd-looking trigraphs such as '??<' as an alternative to '{', for people who don't have '{' on their keyboard because it has been replaced by an accented letter.)

This was all a bit unsatisfactory if files need to be moved between countries, which would result in characters changing at random. There was a move to abandon the parity bit and use it to create an additional 128 code points for characters. This resulted in the Latin-1 character set, or ISO 8859, which first appeared in 1987. With this, all the major Latin-alphabet languages could be written and, more importantly, moved between countries without turning to gibberish.

None of this was any use for languages which do not use the Latin alphabet. Similar standards were defined for Cyrillic and for Greek. For Chinese and other Asian languages, 8 bits is nowhere near enough to represent all the characters - modern Chinese needs around 6,000 characters, and a great many more if historical documents are to be encoded. So each country - China, Japan and Korea - developed its own encoding for the Chinese-based characters.

Starting in the 1980s, there was a move to come up with a single character set that could represent all languages and other uses. Initially it was thought that a 16-bit character set, allowing 65,536 combinations, would cover this, but in the end to represent all languages requires more than that.

Unicode, or ISO 10646 (the similarity in the number is not a coincidence), is the result of all this. It can represent almost every language with a written representation, and a great many other characters as well. Maybe surprisingly, there is a lot of controversy around it. As one example, each user of Chinese characters has taken its own steps through history to simplify writing. The same ideogram (i.e. a character with a meaning rather than a sound) is often written differently in Taiwan, China and Japan. For example, means 'to speak' in Japanese, but the same ideogram is written in China, with the left radical greatly simplified. Should these be considered as different glyphs representing the same character, determined solely by the font in use, just like Comic Sans and Times Roman Italic, or are they different characters? This is called the Han Unification Controversy.

Anyway, back to currency symbols. ASCII defines just one of them, '$'. Other countries used various code points for their own currency symbol. Latin-1 added a few more: £, ¥, ¢. You may notice a theme here: nearly all currency symbols consist of a (maybe) mnemonic letter, with a line through it. (S for 'scudero', the original currency in the Americas, and L for 'libra', Latin for pound). For example the symbol for the Thai Baht is ฿.

I ran into this on my first visit to India recently. The character for the Indian Rupee is . This is the Hindi character , pronounced 'ra', with a line through it. The character is actually very recent, introduced just in 2009 following a national competition to replace the traditional representation Rs.

When I saw this I was naturally interested to see if there was a Unicode representation. Obviously there is, or I wouldn't have been able to write the previous paragraph. Where currency symbols show up seems to be somewhat random. The character for Thai Baht is indeed mixed in with the characters for Thai script. But the Rupee symbol, which you might reasonably expect to find with the Devanagari (Hindi) character set, is in a page dedicated to currency symbols.

The page contains some truly amazing symbols. There are symbols you've probably never seen for currencies you've probably never heard of, like the Ukrainian Hrvynia which looks like a dollar sign adapted for one of those "fill in the missing square" intelligence tests, or the Kazakhstan Tenga which is identical to the Japanese symbol for a public phone. Others have surely never been used, like the Livre Tournois , a currency used in France from the 13th to the 16th century, or the German Penny . One gets the feeling that once they had a whole 256-character code page they were determined to fill it.

My favourites among them, just because they are so visually complex, are these two. They're shown as images because very few fonts have them - inserting the Unicode characters usually results in a blank space or an empty box. The one on the left is the Spesmilo. It's an artificial currency invented alongside Esperanto in the early 20th century, with similar idealist objectives. It has never been implemented as an actual currency of any kind. But they certainly got creative with the symbol.

The second character is the Nordic Mark. This was once a real currency, used in Denmark (which at the time included Norway) in the 17th and 18th centuries. It seems unlikely that the character is much used today, especially since most systems can't display it.

There's a certain amount of cultural hegemony in all this. In theory all Unicode code points are equal. But in practice there is still an important practical advantage to being in the original ASCII 7-bit character set. The Python language copes very badly with Unicode, throwing errors all over the place when you try to use anything outside ASCII. (The excuse for Python 3 was to fix these problems, but actually it is even worse). So there is a distinct advantage to having your currency symbol be there, which is the case only for $. I'm not aware of any practical advantage of being in ISO 8859 (Latin-1), but it's still "less Unicode" than being in some remote code page along with the Spesmilo and the Nordic Mark.

No comments: