By Mariana Quirino Rodrigues dos Santos
Have you ever come across this character ‘�’ in the middle of a word and didn’t know why it happened? The answer lies in a common encode problem. The encode is nothing more than a character mapping for a group of bits.
A few years ago, the most important characters were represented by the English alphabet, and used the ASCII table, defined by 7 bits. Thus, there were 128 binary sequences, and therefore, 128 identifiers available to map each of these 128 characters, which are: The letters of the English alphabet; the numbers 0 to 9; some symbols; and other control characters. So, for example, the word ‘Hello’ in the ASCII table, is represented by the bits: 1001000 1100101 1101100 1101100 1101111; which, in hexadecimal, is equivalent to: 48 65 6C 6C 6F.
The computers of that time already had support for 8 bits, totaling 256 binary strings. However, of this total, 128 were already being occupied by the ASCII table. Soon, each company began to fill the remaining 128 binary strings in its own way, and that was how the first encode problem arose.
A computer purchased in Brazil displays the accented letters of the Portuguese language, such as the letter “é”. However, that same character on a computer purchased in another country, may end up displaying another letter, and consequently, the word formed with that character will be represented in error. To solve this type of problem, ANSI emerged, which maintains the ASCII table, and defines that each country has a code page with the use of the 128 surpluses. For example, Israel had code page 862, while Greece had code page 737, and so on.
However, it wasn’t yet possible to represent Asian characters in 8 bits. To solve this problem, we used a double-byte character set, the DBCS (Double-Byte Character Set). This encoding is represented by two bytes, which brought a greater range of possibilities, far beyond the 256 normally available. In DBCS, up to 65,536 characters can be represented, where the first 128 characters remain reserved with the values of the ASCII table, so that ‘Hello’ in ASCII has the same bit values as in DBCS. The Japanese and Chinese languages are two examples of languages that use the double-byte character set.
Another solution to the problems arising from the internationalization of characters, was the creation of the Unicode, which is also divided into pages, where for each page there is a sequence of characters, each represented by a sequence of bits, this means that, basically, there is one identifier for each character. For the word ‘Hello’, for example, we have the following representation: U+0048 U+0065 U+006C U+006C U+006F. However, the Americans, wanting to reduce the number of zeros, since they rarely used code points greater than U+00FF, quickly invented the UTF (Unicode transformation format).
UTF-8, UTF-16 and UTF-32 are different, but encode the same Unicode standard. UTF-8 maintains the same Unicode standard, but every code point from 0 to 127 is stored in a single byte. However, when it exceeds 127, more bytes are used to represent the other characters. Therefore, while in the Unicode the word ‘Hello’ is equivalent to U+0048 U+0065 U+006C U+006C U+006F, in UTF-8 it becomes 48 65 6C 6C 6F. Thus, a text in UTF-8, for Americans, was the same as the ASCII table, ANSI and all other encodes that follow this same pattern.
Given the different encodings, there are 3 scenarios that can happen when decoding a text:
- You can decode the Unicode string of the word ‘Olá’ (U 004F U+006C U+C3A1) in ASCII, or in Greek OEM or Hebrew ANSI, or in any of the hundreds of encodings that have been created so far, and not finding the equivalent you can get a question mark (?), or if you are really lucky, a “?” in a box: ‘�’. As in the following example:
ola = 'Olá' print(ola.decode('ascii', 'replace')) Ol��
When UTF-8 is decoded, ‘Olá’ is translated into hexadecimal as 4F(O) 6C(l) C3A1(á), but when done in ASCII, only 4F(O) and 6C(l) are found. Since there’s no character for C3 or A1 in the table, the replace is made with ‘� ‘.
- The string is successfully decoded, that is, without replacing the character with an ‘�’, but the word is meaningless because the encode chosen is not the correct one. As in the following example, which was decoded in KOI8-U, when the correct encode would be UFT-8.
ola = 'Olá' print(ola.decode('koi8_u', 'replace')) Olц║
- The string encoding is correct. The following code snippet shows a decoded string with no errors, where the word ‘Olá’ is the expected word.
ola = 'Olá' print(ola.decode('utf-8', 'replace')) Olá
In this way, we can conclude that, for any string that we deal with, it’s necessary to know the encoding used to display it correctly, once it is in memory, e-mail or any other means. Whenever, for a string, its encoding is not defined, what can be done are attempts to find the encoding that guarantees the smallest number of ‘�’, however, there is still no guarantee that the sentence will make sense, since the encoding has not been defined.
Let’s see an example in the code below, where there’s a list of encodes and an encoded text, and a loop is made testing each encode and checking the amount of replace by ‘� ‘; if this value is the lowest, this quantity will be kept in the variable “finalReplace”, and the encode performed will be stored. Thus, at the end of the loop, it’ll be possible to have an encode that most closely matches the correct one.
encodings = ['ascii', 'utf8', 'cp856', ‘iso8859_6’] text = 'Ol\xc3\xa1' finalText = '' finalEncode = '' finalReplace = 0 for encoding in encodings: newText = text.decode(encoding, 'replace') replace = newText.find(u"�") if finalReplace == 0 or finalReplace > replace: finalReplace = replace finalEncode = encoding finalText = newText print(u'Encode: {0}, Replace: {1}, Text: {2}'.format(finalEncode, finalReplace, finalText))
Conclusion
To conclude, we emphasize the importance of always sending the charset attribute, which serves to indicate the character encoding format used in the document, whether in requests (Content-Type: text/plain; charset=”UTF-8″), in the body of an email (Content-Type: text/plain; charset=”UTF-8″), or even as a meta tag in HTML (<meta http-equiv=”Content-Type” content=”text/html; charset=utf-8”>). This way, it’ll be easier to ensure the data is processed properly, preventing loss of information.