Encoding - the Cyrillic alphabet in your computer

Last updated: September 18th, 2006

The aim of this page is to help anybody who wants to understand what is really happening when the cyrillic alphabet is displayed on screen - and why sometimes it doesn't work as nicely as it should.

Warning: I'm not an expert on this matter. I've learned about the subject along with writing this page. All corrections are welcome.

First of all: this A is not an A

It is true that a computer can play music and display pictures. However, all a computer can store is numbers. All songs and images you have on the computer are, at the end of the day, a sequence of numbers written on the disk. After all, you can't wait too much from a simple box full of metal, plastic and silicon.

Not even text files are stored in memory directly as sequences of letters. They are also stored as sequences of numbers. Programs such as text editors or web browsers are programmed to read these numbers from disk and transform them into letters which are displayed on screen.

Wouldn't it be easier to store letters as letters? Not really. It is much easier to store numbers. Numbers are simple and universal. Letters are much more arbitrary.

Maybe you now understand what I mean when I say: this A is not an A. This A, as stored in the hard disk, is really a number (a 41, in fact) your web browser has translated into an understandable form for you. If you want to know which numbers correspond to the letters of the latin alphabet, google for ASCII charts. Make sure you see number 41 next to capital letter A. The encoding according to which letter 41 represents letter A in memory is called ASCII.

Which are the numbers for the Cyrillic letters?

Fortunately, all people nowadays use ASCII to represent letters in the computer. I mean, nobody has decided that it would be better to make a letter C appear on screen each time a 41 is read from disk, instead of the A most people use. Nobody has created an alternative encoding, and this is good because text files can be used in different computers without going through much trouble. There is a standard encoding to represent the latin alphabet.

Here come the bad news: there is not a standard encoding for representing Cyrillic letters in the computer.

Windows-1251

This is the chart for Windows-1251 encoding.

LetterHexDec LetterHexDec LetterHexDec LetterHexDec LetterHexDec LetterHexDec
А0xC0192 К0xCA202 Х0xD5213 а0xE0224 к0xEA234 х0xF5245
Б0xC1193 Л0xCB203 Ц0xD6214 б0xE1225 л0xEB235 ц0xF6246
В0xC2194 М0xCC204 Ч0xD7215 в0xE2226 м0xEC236 ч0xF7247
Г0xC3195 Н0xCD205 Ш0xD8216 г0xE3227 н0xED237 ш0xF8248
Д0xC4196 О0xCE206 Щ0xD9217 д0xE4228 о0xEE238 щ0xF9249
Е0xC5197 П0xCF207 Ъ0xDA218 е0xE5229 п0xEF239 ъ0xFA250
Ё0xA8168 Р0xD0208 Ы0xDB219 ё0xB8184 р0xF0240 ы0xFB251
Ж0xC6198 С0xD1209 Ь0xDC220 ж0xE6230 с0xF1241 ь0xFC252
З0xC7199 Т0xD2210 Э0xDD221 з0xE7231 т0xF2242 э0xFD253
И0xC8200 У0xD3211 Ю0xDE222 и0xE8232 у0xF3243 ю0xFE254
Й0xC9201 Ф0xD4212 Я0xDF223 й0xE9233 ф0xF4244 я0xFF255

Each entry in the table has three columns: the Cyrillic letter and its numeric equivalent in the Windows-1251 encoding, which is a very common encoding (that's why I've shown you this chart). Both the decimal and hexadecimal representation of the letter appear on the chart. Decimal numbers are simply numbers as you know them. Hexadecimal numbers can be thought of as the numerical system we would use if we had 16 fingers (google for more detailed explanations). Hexadecimal is much more used than decimal in the encoding world, because it is more closely related to the system a computer uses to store numbers.

Look at the decimal column. Each Cyrillic letter, starting from А, has been assigned a number from 192 on. Only Ё and ё interrupt this sequence. Ё is a special letter, isn't it? It is always stressed, but Russians don't always write is with the two dots above, which means that one has to guess that a е isn't really an е but a soft o and has to be read like an o. But, sometimes, an е is really an ё which has lost the two dots and has to be read like an е... enough about this.

Sometimes, Russian looks like anything but Russian. Why?

Perhaps you have downloaded Russian music, and your playlist looks like this:

Perhaps you've come across pages looking like this while surfing in the Russian web (click to enlarge):

With what you know, you can make an educated guess of why this happens. This chart gives even more clues:

Windows-1252 Western alphabet
LetterHexDec LetterHexDec LetterHexDec LetterHexDec LetterHexDec LetterHexDec
À0xC0192 Ê0xCA202 Õ0xD5213 à0xE0224 ê0xEA234 õ0xF5245
Á0xC1193 Ë0xCB203 Ö0xD6214 á0xE1225 ë0xEB235 ö0xF6246
Â0xC2194 Ì0xCC204 ×0xD7215 â0xE2226 ì0xEC236 ÷0xF7247
Ã0xC3195 Í0xCD205 Ø0xD8216 ã0xE3227 í0xED237 ø0xF8248
Ä0xC4196 Î0xCE206 Ù0xD9217 ä0xE4228 î0xEE238 ù0xF9249
Å0xC5197 Ï0xCF207 Ú0xDA218 å0xE5229 ï0xEF239 ú0xFA250
¨0xA8168 Ð0xD0208 Û0xDB219 ¸0xB8184 ð0xF0240 û0xFB251
Æ0xC6198 Ñ0xD1209 Ü0xDC220 æ0xE6230 ñ0xF1241 ü0xFC252
Ç0xC7199 Ò0xD2210 Ý0xDD221 ç0xE7231 ò0xF2242 ý0xFD253
È0xC8200 Ó0xD3211 Þ0xDE222 è0xE8232 ó0xF3243 þ0xFE254
É0xC9201 Ô0xD4212 ß0xDF223 é0xE9233 ô0xF4244 ÿ0xFF255

Read the title of this chart. Do you get it?

UTF-8

UTF-8 is what this site uses to represent Cyrillic. Here is the chart:

LetterHexDec LetterHexDec LetterHexDec LetterHexDec LetterHexDec LetterHexDec
АU+04101040 КU+041A1050 ХU+04251061 аU+04301072 кU+043A1082 хU+04451093
БU+04111041 ЛU+041B1051 ЦU+04261062 бU+04311073 лU+043B1083 цU+04461094
ВU+04121042 МU+041C1052 ЧU+04271063 вU+04321074 мU+043C1084 чU+04471095
ГU+04131043 НU+041D1053 ШU+04281064 гU+04331075 нU+043D1085 шU+04481096
ДU+04141044 ОU+041E1054 ЩU+04291065 дU+04341076 оU+043E1086 щU+04491097
ЕU+04151045 ПU+041F1055 ЪU+042A1066 еU+04351077 пU+043F1087 ъU+044A1098
ЁU+04011025 РU+04201056 ЫU+042B1067 ёU+04511105 рU+04401088 ыU+044B1099
ЖU+04161046 СU+04211057 ЬU+042C1068 жU+04361078 сU+04411089 ьU+044C1100
ЗU+04171047 ТU+04221058 ЭU+042D1069 зU+04371079 тU+04421090 эU+044D1101
ИU+04181048 УU+04231059 ЮU+042E1070 иU+04381080 уU+04431091 юU+044E1102
ЙU+04191049 ФU+04241060 ЯU+042F1071 йU+04391081 фU+04441092 яU+044F1103

Observe that Ё and ё are exceptions - again.

By the way, if you ask "which letters are represented by numbers 192 to 255 (or, in hexadecimal, C0 to FF) in this UTF-8 encoding?" the answer is "Look at the previous second table".