« Technical interviews - Different kinds and approaches | Main | Why is scaling a web application hard ? »
October 31, 2010
Why you benefit from using UTF-8 Unicode everywhere in your web applications
Character data which consists of the letters, numbers and symbols used in web applications isn't managed by a computer as you see it on a screen. It's rather encoded into a series of 1s and 0s to make management easier.
When you write character data in a text editor or IDE for a web application, it's encoded using a series of numbers. When a user's browser receives a web application's content, these numbers are decoded and placed on a screen. When data is saved to a web application's permanent storage system (e.g. RDBMS), it's encoded using a series of numbers. When a web application's business logic code reads data from a permanent storage system, it's decoded to execute the appropriate logic. This same encoding and decoding process can take place at several other parts of a web application that require reading and writing character data.
This encoding and decoding process in character data operates on the basis of character encodings, also called character sets, charsets, character maps or charmaps. But with over 50 character encodings to choose from in web applications, which one should you choose ?
This entry addresses why you should select UTF-8 Unicode for every part of your web applications and how you benefit from doing so.
| [Entry continues to the left and below ad ] |
Lets start with what makes one character encoding different from the next. The differences in character encodings are due to different representations of letters, numbers and symbols between countries (e.g. Germany, Japan, France and China use certain character representations in their languages not used in languages in other countries), in addition to manufacturers designing software for a particular country market.
Most web applications rely on a small set of character encodings that include: ASCII, ISO-8859-1, Windows-1252 and UTF-8. But even with this small set, there can be several differences in choosing one character set over another.
In terms of performance, character encodings are important because they influence the space needed to represent characters. Some character encodings are known as single-byte encodings, while others are multi-byte encodings. Of the multi-byte kind, variable-width encodings are the more versatile, since they can represent characters using 1-byte, 2-bytes or more bytes depending on the character they're trying to represent.
Choosing a web application's character encoding solely on the basis of performance (i.e. the amount of space it uses to represent characters) is generally a bad idea, since it can seriously impact a web application's usability. With variable-width encodings, you can choose a character encoding that's versatile to single or multiple bytes, which is often a better tradeoff -- favoring usability -- than one of strictly following byte efficiency by using single-byte encodings. To understand this, it's necessary to describe the consequences of using single-byte character encodings in a web application.
If you write web applications targeting a particular region in the world, single-byte character encodings like ISO-8859-1, Windows-1252 and even ASCII, can represent the most efficient character encodings. However, even though you're getting performance efficiency by using a single byte to represent each character, you're limiting a web application's usability because single-byte encodings are only capable of representing up to 256 characters (1 byte = 8 bits = 2^8 positions = 256 characters).
ASCII which was one of the earliest character encodings, in fact only uses 7 bits out of the possible 8 bits in a byte to represent characters. The reasoning behind this logic in ASCII -- made around the 1960's -- was that when you counted all the possible alphanumeric characters used by computers, the letters A to Z upper and lower case, the numbers 0 to 9 and special characters like %, *, ? among others things, the sum came to less than 100 characters. Given that 7 bits or 2^7 positions equals 128, 7-bits for character representations was enough, leaving the remaining 8th bit as a parity bit to detect transmission errors -- this again was the 1960's when network transmissions were in their infancy and not very reliable.
ASCII served its purpose until computers required to represent more characters. This brought the need to either make use of the 8th bit in ASCII to increment the total number of character representations in a byte to 256 or use multi-byte character encodings to break the 256 limit on single-byte character encodings.
The first change came with single-byte/8-bit character encodings. Among these single-byte/8-bit character encodings came ISO-8859-1 and Windows-1252. With 256 positions available to represent characters -- double the amount available in ASCII -- many new characters could be supported. The British were now able to represent their pound sign £ on their applications, the French their cedilla ç on their applications and the Spanish their vocals with accents á, é, í, ó, ú on their own applications.
Everyone happy, right ? Not exactly, even though 256 characters were enough to accommodate web applications targeting British, French and Spanish citizens, web applications requiring to represent characters in Arab, Hebrew or Nordic languages (e.g. Swedish, Danish, Norwegian), got squeezed out of this 256 character space. This led to multiple ISO-8859 regional character encodings. ISO-8859-6 supporting characters from the Arab alphabet, ISO-8859-8 supporting Hebrew letters and ISO-8859-10 supporting Nordic languages. As well as the equivalent Windows character encoding variations, like Windows-1256 supporting Arabic characters and Windows-1255 supporting Hebrew characters.
This need to encode characters in a single byte leads to usability problems. For example, the 232nd position in the ISO-8859-1 character encoding represents the è character (Latin small letter e with grave accent). But since other languages need space for their own characters and can do without this particular character, the 232nd position in the ISO-8859-6 character encoding represents the و character (Arabic Letter waw), where as the 232nd position in the ISO-8859-8 character encoding represents the ט character (Hebrew letter tet) and the 232nd position in the ISO-8859-10 represents the č character (Grapheme or latin c with háček).
Care to guess what happens if a web application's permanent storage system uses the ISO-8859-1 character encoding to store data and the business logic code attempts to read it as Windows-1256 ? Or a web application's HTML content is written with the ISO-8859-6 character encoding but accessed with a browser incapable of detecting this encoding ? You'll see character data as either squiggles (e.g. ), question marks (e.g. �,�,�) or characters with different meaning (e.g. instead of an expected è character, you could see ט, و or č). Yes, you gain performance using a single-byte to represent each character, but this is the potential usability penalty you pay for doing so.
Meanwhile, while this single-byte character encoding fragmentation process took place, users in other parts of the world like China, Japan and Korea found the idea of using a single byte to encode characters laughable. Languages like Chinese, Japanese and Korean rely heavily on ideographs -- symbols representing the meaning of a word, not the sounds of a language -- so with single-byte character encodings limited to representing 256 characters they prove inadequate for these languages that can have thousands of characters. These special symbols used in Chinese, Japanese and Korean are often called CJK characters, with CJK being an acronym made up from each language's initial. To address such needs web applications targeting Chinese, Japanese and Korean speakers require to use multi-byte character encodings.
This brought about a series of multi-byte character encodings, such as IEC-2022-JP and Shift-JIS to support Japanese, IEC-2022-KR and KSX1001 to support Korean, as well as GB-2312 and GBK to support Chinese. By using multiple-bytes, an enormous amount of positions are made available for character representations (e.g. 2 bytes with 8 bits each has a potential for 2^16=65,536 character representations and 3 bytes with 8 bits each has a potential for 2^24=16,777,216 character representations), more than enough to satisfy thousands of ideographs in Chinese, Japanese and Korean.
But would you dare guess what happens if a web application's permanent storage system uses the IEC-2022-JP character encoding to store data and you use a run-time (e.g. Java, Python, Ruby) built using ISO-8859-1 to read this data and do a business process ? Or a web application's HTML content is written with the GBK character encoding but accessed with a browser in Europe incapable of detecting this encoding ? You'll be back to seeing squiggles, question marks or characters with different meaning. This is because, multi-byte character encodings also define meanings for each position in their bytes (e.g. the 232nd position in a multi-byte character encoding is used to represent a meaningful character to that particular encoding, like a kanji, hiragana or katakana character, similar to how the 232nd position in an Arabic focused character encoding is used to represent an Arab letter or the 232nd position in a Hebrew focused character encoding is used to represent Hebrew letters).
So is there a solution to all this character encoding madness ? Yes, it's called Unicode, and it's one of the leading variable-width encodings used in software. There are actually several types of Unicode, but for web applications UTF-8 is the dominant choice. UTF-16 and UTF-32 are the other types of Unicode, but UTF-16 is primarily used in run-time platforms like Java and Windows OS, where as UTF-32 is used in several Unix OS. The differences consist of UTF-8 using one to four 8-bit bytes to represent characters, UTF-16 one or two 16-bit code units to represent characters and UTF-32 using a single 32-bit code unit to represent characters.
Getting back to UTF-8 which is the preferred Unicode choice for web applications. How is it that UTF-8 solves the problem of multiple character encodings not getting in the way of each other ? The answer is simple, in UTF-8 there's only a single character for every byte position that is equally interpreted on every system in the world supporting UTF-8. This means that in UTF-8, the 232nd position in its first byte will always represent the è character (Latin small letter e with grave accent), whether it's accesed on a system in Saudi Arabia, Israel or Norway.
So what happens if a web application requires using characters like و (Arabic Letter waw), ט (Hebrew letter tet) or č (Grapheme or latin c with háček) ? UTF-8 also assigns each of these characters an exclusive byte position to unequivocally represent such characters on any system in the world. UTF-8 can do this because it's designed to use up to 4-bytes to represent characters.
UTF-8 defines these exclusive byte positions for characters used in languages with Latin letters (e.g. English, Spanish, Italian Portuguese, German,etc), as well as Arabic, Hebrew, Nordic languages, Chinese, Japanese, Korean, in addition to other characters used in Ethiopic, Cherokee, Mongolian, Thai and Tibetan, to name a few. UTF-8 also defines exclusive byte positions for special characters like currency symbols, emoticons and even music symbols, also to name a few. In essence, UTF-8 provides software representations for practically every character imaginable that's used in computers. You can take a look at the entire set of UTF-8 characters by consulting references like the Unicode charts or this Unicode character chart with named and HTML entities .
The first issue than can come to mind about supporting this amount of characters in UTF-8 is the average number of bytes needed to represent characters. If encoding a basic character like the letter A requires using 4-bytes, this can translate into exponential byte growth requiring more bandwidth, memory and storage for each character, all this overhead just to support characters in Ethiopic, Cherokee or Mongolian a web application will never use.
You can relax on this particular issue. UTF-8 is a variable-width encoding character set, which means even though it can use up to 4-bytes to represent characters, it doesn't mean it requires using all 4-bytes to do so, certain characters are represented using just 1-byte, which makes the overhead concerns of exponential byte growth unwarranted.
UTF-8 designers took a very clever approach to the way characters are assigned among this possible 4-byte representation. The 1st byte in UTF-8 supports 7 bits for character representations (2^7= 128 characters or code points). Can you guess which characters got the privilege of being represented as a single-byte ? The most common set used in computers of course, the same sub-100 character set defined in ASCII in the 1960's. What this means is that the most prevalent set of characters used in software, even encoded using UTF-8, need just 1 byte per character representation, just like single-byte encoding characters sets like ASCII, ISO-8859-1 and Windows-1252.
So why doesn't UTF-8 use all 8-bits in the 1st byte to represent characters ? Since UTF-8 is a variable-width encoding character set, it needs a way to indicate if a character is made up of a single-byte or multiple-bytes. Therefore one bit of the 1st byte in UTF-8 character representations is reserved for this purpose. Following this same rational, 2 bytes in UTF-8 support 11 bits for character representations (2^11 = 2048 characters or code points), 3 bytes in UTF-8 support 16 bits for character representations (2^16 = 65,536 characters or code points) and 4 bytes support 21 bits for character representations (2^21 = 2,097,152 characters or code points). Here again, the reason UTF-8 doesn't use the entire 8-bit spectrum across the 4 available bytes, is due to UTF-8 reserving bits in each byte to determine if it's a single-byte code point, a multi-byte code point or a continuation of a multi-byte code point.
But wait, does this mean UTF-8 requires 2-bytes to represent characters that in character encodings like ISO-8859 and Windows-1252 could be represented with 1-byte ? Yes, since one bit in the 1st byte of UTF-8 characters is reserved. But don't get hung-up on small details. How many web applications have you written made up entirely of characters in the upper boundary -- above ASCII characters -- of single-byte character sets ? Characters like £, ç, á, é, í, ó, ú ? Not many I would think. Since such characters are used sparingly, the tradeoff between using 2-bytes instead of 1-byte for such characters, is well worth it when you consider the usability benefits of UTF-8.
But wait, what about CJK characters, won't UTF-8 need 3-bytes or even 4-bytes to represent them ? Yes, but again this shouldn't be an issue considering the usability benefits of UTF-8. Web applications that require CJK characters one way or another need multi-byte character encodings, so it's not like UTF-8 adds extra overhead for representing characters that would still need multiple-bytes to be represented.
Even though the maximum character definitions permitted in UTF-8 is 2,097,152, the most recent Unicode standard version 6.0 released in late 2010 defines a little over 100,000 characters or code points -- 109,449 to be exact. Among these 100,000 characters you'll find definitions for the barrage of characters already described earlier like Cherokee and Mongolian, all the way up to Vedic Sanskrit. Considering there's still room to add characters to the tune of nearly 2,000,000 character representations, UTF-8/Unicode is likely to accommodate characters for some civilizations to come.
Use UTF-8 in your web applications and ensure their usability without any performance or storage penalties.
This is an excerpt from a book I'm writing on web application performance and scalability. You can find the entire book's contents at http://www.webforefront.com/performance/
Posted by Daniel at October 31, 2010 7:51 PM




