Let's talk about unicode

18 Dec 2017

[ tech   ]


What is unicode?

According to Google Dictionary:

an international encoding standard for use with different languages and scripts, by which each letter, digit, or symbol is assigned a unique numeric value that applies across different platforms and programs.

To understand what that means, let’s start from the beginning:

We know how computers understand things, by reading bits, 1s and 0s, on and off switches. We can essentially think about computers as huge arrays of lightbulbs which are switching on and off constantly.

In the history of information technology, there have been plenty of heated arguments about which standards we should be collectively using to represent data, and we somehow agreed to define a byte as an array of 8 bits, no more, no less.

So in order to represent a number to a computer, for example the number ‘15’, we would have the sequence ‘00001111’.

Of course working with 1s and 0s is extremely tedious, boring and error-prone especially for us human beings, so we designed several conventions to try and simplify these bytes into lists of numbers/characters easily accessible for humans.

The most popular of these conventions ended up being the American Standard Code for Information Interchange (or ASCII for friends): The standard allows for letters, both small and capital, to be mapped to an ASCII table in our computer, for easy translation from letter to byte.

For example the letter M for Michael, is in bits 01001101 or in ASCII code 077. So if I wanted to encode my full name Michael, I would have to string together a series of ASCII codes as such:

077, 105, 099, 104, 097, 101, 108

Here is an example of a table containing all letters with their binary and ASCII code equivalent.

This is great, but unfortunately ASCII will only encode English language, and others which are very similar to it, as a byte can only hold 256 numbers (0-255 or 00000000 - 11111111). Surely enough, there are a lot more characters than 256 used throughout the entire world’s languages!

So different countries ended up sorting out their own encoding conventions for their languages; this worked, but many encodings could handle only one language. Inserting the title of an american english book in a thai sentence meant we’d have to take into consideration two different encodings.

To solve this problem, a group of people came up with Unicode, a word which is meant to be a ‘universal encoding’ for all languages. The Unicode solution is similar to the ASCII table, but instead of using 8 bits to encode a character, it uses 32. This means we can store up to 4,294,967,295 characters (2^32) compared to the previously available 256. Pretty neat! We can use it to encode all the languages of the world and then some.

Of course for our use in programming, we will hardly need to use all 32 bits to encode our information, so we often use a compressed convention of unicode to use only the most common 8-bit characters, called UTF-8 (which means Unicode Transformation Format, 8 bits); We can then escape into UTF-16 or UTF-32 depending on the use case.

We do this for simplicity’s sake, and to not consume resources for nothing.
Surely though we have space for creating more languages in the future, and encode them with Unicode, or perhaps include one or more alien languages when such an exciting time will come.

In the mean time though, we started using Unicode to represent our beloved emojis.


What a time to be alive…

A special mention to Zed A. Shaw and Charles Petzold, and their respective books, Learn Python the hard way and Code: The hidden language of computer hardware and software, for inspiration into writing this post.