Benjamin Waters

Unicode


The Code Tables

Unicode: Every Character of Every Language a Unique Number

UTF-8: These Numbers are Not Stored the Way You Expect


The Code Tables

Each of these links takes you to 16 256-code tables, laying out in order the unicode standard. Numbering is in hexadecimal.

0000–0FFF 1000–1FFF 2000–2FFF 3000–3FFF
4000–4FFF 5000–5FFF 6000–6FFF 7000–7FFF
8000–8FFF 9000–9FFF A000–AFFF B000–BFFF
C000–CFFF D000–DFFF E000–EFFF F000–FFFF

Unicode: Every Character of Every Language a Unique Number

Unicode is the way all computer systems should be assigning numbers to letters.

A computer represents letters by assigning every letter a number. Once upon a time, it was thought convenient to represent each letter with a single byte. Being an 8-digit binary number, a byte represents maximally 28 = 256 letters. But in those days, the last bit of the byte was used for error checking, which left just 27 = 128 possible positions, out of which the standard ASCII character set was formed. Over the years, those who needed special characters started dropping the error-checking bit and mapping their characters onto the other 128 positions, leading to the rather chaotic situation that the numbers from 128–255 meant different things in different places. Unicode sets itself the task of solving the chaos by giving a unique number to every character of every language. The way they thought this was going to work was simply to use two bytes for each character, rather than one, giving us 216 = 65 546 code points, rather than 256. The situation is in fact more complicated, firstly because most of these code points have now been filled, thanks largely to Chinese, and the standard is now expanding beyond these first 65 546 places, and because, as I will now explain, Unicode letters tend not to be represented simply as two bytes by your computer.

UTF-8: These Numbers are Not Stored the Way You Expect

UTF-8 is presently the best way for computer systems to store the numbers that represent the letters.

You would expect a character whose unique number is 40 000 to be represented simply as two bytes, which, read in a row, tally to the number 40 000. On basically all Unix systems, and also elsewhere, this is not the case. If we simply started using 2 bytes for every letter instead of 1, every text file created to date using ASCII would now have to be converted, and every one of them would double in size. On systems like Unix, where the whole system and all of its source code consist of text, this is unacceptable. Luckily, ken devised a method, UTF-8, to represent each of these 65 546 characters in such a way that all of the ASCII could remain unchanged. Basically we use the first bit of the byte to tell the computer whether this byte is a simple old-fashioned ASCII byte, or a newfangled Unicode byte that needs to be seen as part of a multi-byte secquence. The computer then treats old-fashioned ASCII the old-fashioned way, but matches Unicode bytes up with their partners. This means that if you hexdump a file, the ascii characters will have a very unsurprising representation, but the higher Unicode characters will have quite complicated representations. If you want to understand all of this in real detail, read the UTF-8 and Unicode FAQ.

Benjamin Waters | 2006-03-15