UTF-8 Everywhere

Manifesto

Purpose of this document

This document contains special characters. Without proper rendering support, you may see question marks, boxes, or other symbols.

To promote usage and support of the UTF-8 encoding, to convince that this should be the default choice of encoding for storing text strings in memory or on disk, for communication and all other uses. We believe that all other encodings of Unicode (or text, in general) belong to rare edge-cases of optimization and should be avoided by mainstream users.
In particular, we believe that the very popular UTF-16 encoding (mistakenly used as a synonym to ‘widechar’ and ‘Unicode’ in the Windows world) has no place in library APIs (except for specialized libraries, which deal with text).
This document recommends choosing UTF-8 as string storage in Windows applications, despite the fact that this standard is less popular there, due to historical reasons and the lack of native UTF-8 support by the API. Yet, we believe that, even on this platform, the following arguments outweigh the lack of native support. Also, we recommend forgetting forever what ‘ANSI codepages’ are and what they were used for. It is in the customer’s bill of rights to mix any number of languages in any text string.
We recommend avoiding C++ application code that depends on _UNICODE define. This includes TCHAR/LPTSTR types on Windows and APIs defined as macros, such as CreateWindow. We also recommend alternative ways to reach the goals of these APIs.
We also believe that, if an application is not supposed to specialize in text, the infrastructure must make it possible for the program to be unaware of encoding issues. For instance, a file copy utility should not be written differently to support non-English file names. Joel’s great article on Unicode explains the encodings well for the beginners, but it lacks the most important part: how a programmer should proceed, if she does not care what is inside the string.

Background

In 1988, Joseph D. Becker published the first Unicode draft proposal. At the basis of his design was the naïve assumption that 16 bits per character would suffice. In 1991, the first version of the Unicode standard was published, with code points limited to 16 bits. In the following years many systems have added support for Unicode and switched to the UCS-2 encoding. It was especially attractive for new technologies, like Qt framework (1992), Windows NT 3.1 (1993) and Java (1995).
However, it was soon discovered that 16 bits per character will not do for Unicode. In 1996, the UTF-16 encoding was created so existing systems would be able to work with non-16-bit characters. This effectively nullified the rationale behind choosing 16-bit encoding in the first place, namely being a fixed-width encoding. Currently Unicode spans over 109449 characters, about 74500 of them being CJK ideographs.

A little child playing an encodings game in front of a large poster about encodings.

Nagoya City Science Museum. Photo by Vadim Zlotnik.

Microsoft has, ever since, mistakenly used ‘Unicode’ and ‘widechar’ as synonyms for ‘UCS-2’ and ‘UTF-16’. Furthermore, since UTF-8 cannot be set as the encoding for narrow string WinAPI, one must compile her code with _UNICODE rather than _MBCS. Windows C++ programmers are educated that Unicode must be done with ‘widechars’. As a result of this mess, they are now among the most confused ones about what is the right thing to do about text.
At the same time, in the Linux and the Web worlds, there is a silent agreement that UTF-8 is the most correct encoding for Unicode on the planet Earth. Even though it gives a strong preference to English and therefore to computer languages (such as C++, HTML, XML, etc) over any other text, it is seldom less efficient than UTF-16 for commonly used character sets.

The Facts

In both UTF-8 and UTF-16 encodings, code points may take up to 4 bytes (contrary to what Joel says).
UTF-8 is endianness independent. UTF-16 comes in two flavors: UTF-16LE and UTF-16BE (for different byte orders, respectively). Here we name them collectively as UTF-16.
Widechar is 2 bytes in size on some platforms, 4 on others.
UTF-8 and UTF-32 result the same order when sorted lexicographically. UTF-16 does not.
UTF-8 favors efficiency for English letters and other ASCII characters (one byte per character) while UTF-16 favors several Asian character sets (2 bytes instead of 3 in UTF-8). This is what made UTF-8 the favorite choice in the Web world, where English HTML/XML tags are intermixed with any-language text. Cyrillic, Hebrew and several other popular Unicode blocks are 2 bytes both in UTF-16 and UTF-8.
In the Linux world, narrow strings are considered UTF-8 by default almost everywhere. This way, for example, a file copy utility would not need to care about encodings. Once tested on ASCII strings for file name arguments, it would certainly work correctly for file names in any language, as arguments are treated as cookies. The code of the file copy utility would not need to change at all to support foreign languages. fopen() would accept Unicode seamlessly, and so would argv.
On Microsoft Windows, however, making a file copy utility that can accept file names in a mix of several different Unicode blocks requires advanced trickery. First, the application must be compiled as Unicode-aware. In this case, it cannot have main() function with standard-C parameters. It will then accept UTF-16 encoded argv. To convert a Windows program written with narrow text in mind to support Unicode, one has to refactor deep and to take care of each and every string variable.
On Windows, SetCodePage() API enables receiving non-ASCII characters, but only from a single ANSI codepage. An unimplemented parameter CF_UTF8 would enable doing the above, on Windows.
The standard library shipped with MSVC is poorly implemented. It forwards narrow-string parameters directly to the OS ANSI API. There is no way to override this. Changing std::locale does not work. It’s impossible to open a file with a Unicode name on MSVC using standard features of C++. The standard way to open a file is:
```
std::fstream fout("abc.txt");
```
The proper way to get around is by using Microsoft’s own hack that accepts wide-string parameter, which is a non-standard extension.
There is no way to return Unicode from std::exception::what() other than using UTF-8.
UTF-16 is often misused as a fixed-width encoding, even by Windows native programs themselves: in plain Windows edit control (until Vista), it takes two backspaces to delete a character which takes 4 bytes in UTF-16. On Windows 7 the console displays that character as two invalid characters, regardless of the font used.
Many third-party libraries for Windows do not support Unicode: they accept narrow string parameters and pass them to the ANSI API. Sometimes, even for file names. In the general case, it is impossible to work around this, as a string may not be representable completely in any ANSI code page (if it contains characters from a mix of Unicode blocks). What is normally done on Windows for file names is getting an 8.3 path to the file (if it already exists) and feeding it into such a library. It is not possible if the library is supposed to create a non-existing file. It is not possible if the path is very long and the 8.3 form is longer than MAX_PATH. It is not possible if short-name generation is disabled in OS settings.
UTF-16 is very popular today, even outside the Windows world. Qt, Java, C#, Python, the ICU—they all use UTF-16 for internal string representation.

Our Conclusions

UTF-16 is the worst of both worlds—variable length and too wide. It exists for historical reasons, adds a lot of confusion and will hopefully die out.
Portability, cross-platform interoperability and simplicity are more important than interoperability with existing platform APIs. So, the best approach is to use UTF-8 narrow strings everywhere and convert them back and forth on Windows before calling APIs that accept strings. Performance is seldom an issue of any relevance when dealing with string-accepting system APIs (e.g. UI code and file system APIs), but there is a huge advantage to using the same encoding everywhere, and we see no sufficient reason to do otherwise.
Speaking of performance, machines often use strings to communicate (e.g. HTTP headers, XML). Many see this as a mistake, but regardless of that it is nearly always done in English, giving UTF-8 further advantage there. Using different encodings for different kinds of strings significantly increases complexity and consequent bugs.
In particular, we believe that adding wchar_t to C++ was a mistake, and so are the Unicode additions to C++11. What must be demanded from the implementations, though, is that the basic execution character set would be capable of storing any Unicode data. Then, every std::string or char* parameter would be Unicode-compatible. ‘If this accepts text, it should be Unicode compatible’—and with UTF-8, it is also easy to do.
The standard facets have many design flaws. This includes std::numpunct, std::moneypunct and std::ctype not supporting variable-length encoded characters (non-ASCII UTF-8 and non-BMP UTF-16). They must be fixed:

decimal_point() and thousands_sep() should return a string rather than a single code unit. (By the way C locales do support this, albeit not customizable.)
toupper() and tolower() shall not be phrased in terms of code units, as it does not work in Unicode. For example, the German ß must be converted to SS and ﬄ to FFL.

How to do text on Windows

The following is what we recommend to everyone else for compile-time checked Unicode correctness, ease of use and better multi-platformness of the code. This substantially differs from what is usually recommended as the proper way of using Unicode on Windows. Yet, an in-depth research of these recommendations resulted in the same conclusion. So here it goes:

Do not use wchar_t or std::wstring in any place other than adjacent point to APIs accepting UTF-16.
Do not use _T("") or L"" literals in any place other than parameters to APIs accepting UTF-16.
Do not use types, functions, or their derivatives that are sensitive to the _UNICODE constant, such as LPTSTR or CreateWindow().
Yet, _UNICODE is always defined, to avoid passing narrow strings to WinAPI getting silently compiled.
std::strings and char*, anywhere in the program, are considered UTF-8 (if not said otherwise).
Only use Win32 functions that accept widechars (LPWSTR), never those which accept LPTSTR or LPSTR. Pass parameters this way:
```
::SetWindowTextW(widen(someStdString or "string litteral").c_str())
```
(The policy uses conversion functions described below.)

With MFC strings:

CString someoneElse; // something that arrived from MFC.

// Converted as soon as possible, before passing any further away from the API call:
std::string s = str(boost::format("Hello %s\n") % narrow(someoneElse));
AfxMessageBox(widen(s).c_str(), L"Error", MB_OK);

Working with files, filenames and fstreams on Windows

Never produce text output files with non-UTF-8 content
Using fopen() should anyway be avoided for RAII/OOD reasons. However, if necessary, use _wfopen() and WinAPI conventions as described above.
Never pass std::string or const char* filename arguments to the fstream family. MSVC CRT does not support UTF-8 arguments, but it has a non-standard extension which should be used as follows:
Convert std::string arguments to std::wstring with widen:
```
std::ifstream ifs(widen("hello"), std::ios_base::binary);
```
We will have to manually remove the conversion, when MSVC’s attitude to fstream changes.
This code is not multi-platform and may have to be changed manually in the future.
Alternatively use a set of wrappers that hide the conversions.

Conversion functions

This guideline uses the conversion functions from the Boost.Nowide library (it is not yet a part of boost):

std::string narrow(const wchar_t *s);
std::wstring widen(const char *s);
std::string narrow(const std::wstring &s);
std::wstring widen(const std::string &s);

The library also provides a set of wrappers for commonly used standard C and C++ library functions that deal with files, as well as means of reading an writing UTF-8 through iostreams.
These functions and wrappers are easy to implement using Windows’ MultiByteToWideChar and WideCharToMultiByte functions. Any other (possibly faster) conversion routines can be used.

FAQ

Q: Are you a linuxer? Is this a concealed religious fight against Windows?
A: No, I grew up on Windows, and I am a Windows fan. I believe that they did a wrong choice in the text domain, because they did it earlier than others.—Pavel
Q: Are you an Anglophile? Do you secretly think English alphabet and culture are superior to any other?
A: No, and my country is non-ASCII speaking. I do not think that using a format which encodes ASCII characters in single byte is Anglo-centrism, or has anything to do with human interaction. Even though one can argue that source codes of programs, web pages and XML files, OS file names and other computer-to-computer text interfaces should never have existed, as long as they do exist, text is not only for human readers.
Q: Why do you guys care? I program in C# and/or Java and I don’t need to care about encodings at all.
A: Not true. Both C# and Java offer a 16 bit char type, which is less than a Unicode character, congratulations. The .NET indexer str[i] works in units of the internal representation, hence a leaky abstraction once again. Substring methods will happily return an invalid string, cutting a non-BMP character in parts.
Furthermore, you have to mind encodings when you are writing your text to files on disk, network communications, external devices, or any place for other program to read from. Please be kind to use System.Text.Encoding.UTF8 (.NET) in these cases, never Encoding.ASCII, UTF-16 or cellphone PDU, regardless of the assumptions about the contents.
Web frameworks like ASP.NET do suffer from the poor choice of internal string representation in the underlying framework: the expected string output (and input) of a web application is nearly always UTF-8, resulting in significant conversion overhead in high-throughput web applications and web services.
Q: Why not just let any programmer use her favorite encoding internally, as long as she knows how to use it?
A: We have nothing against correct using of any encoding. However, it becomes a problem when the same type, such as std::string, means different things in different contexts. While it is ‘ANSI codepage’ for some, for others, it means ‘this code is broken and does not support non-English text’. In our programs, it means Unicode-aware UTF-8 string. This diversity is a source of many bugs and much misery: this additional complexity is something that world does not really need, and the result is much Unicode-broken software, industry-wide.
Q: UTF-16 characters that take more than two bytes are extremely rare in the real world. This practically makes UTF-16 a fixed-width encoding, giving it a whole bunch of advantages. Can’t we just neglect these characters?
A: Are you serious about not supporting all of Unicode in your software design? And, if you are going to support it anyway, how does the fact that non-BMP characters are rare practically change anything, except for making software testing harder? What does matter, however, is that text manipulations are relatively rare in real applications—compared to just passing strings around as-is. This means the "almost fixed width" has little performance advantage (see Performance), while having shorter strings may be significant.
Q: Why do you turn on the _UNICODE define, if you do not intend to use Windows’ LPTSTR/TCHAR/etc macros?
A: This is a precaution against plugging a UTF-8 char* string into ANSI-expecting functions of Windows API. We want it to generate a compiler error. It is the same kind of a hard-to-find bug as passing an argv[] string to fopen() on Windows: it assumes that the user will never pass non-current-codepage filenames. You will be unlikely to find this kind of a bug by manual testing, unless your testers are trained to supply Chinese file names occasionally, and yet it is a broken program logic. Thanks to _UNICODE define, you get an error for that.
Q: Isn’t it quite naïve to think that Microsoft will stop using widechars one day?
A: Let’s first see when they start supporting CP_UTF8 as a valid locale. This should not be very hard to do. Then, we see no reason why anybody would continue using the widechar APIs. Also, adding support for CP_UTF8 would ‘unbreak’ some of existing unicode-broken programs and libraries.
Some say that adding CP_UTF8 support would break existing applications that use the ANSI API, and that this was supposedly the reason why Microsoft had to resort to creating the wide string API. This is not true. Even some popular ANSI encodings are variable length (Shift JIS, for example), so no correct code would become broken. The reason Microsoft chose UCS-2 is purely historical. Back then UTF-8 hasn’t yet existed, Unicode was believed to be ‘just a wider ASCII’, and it was cosidered important to use a fixed-width encoding.
Q: What are characters, code points, code units and grapheme clusters?
A: Here is an excerpt of the definitions according to the Unicode Standard with our comments. Refer to the relevant sections of the standard for more detailed description.
Code point

Any numerical value in the Unicode codespace.^{[§3.4, D10]} For instance: U+3243F.

Code unit

The minimal bit combination that can represent a unit of encoded text.^{[§3.9, D77]} For example, UTF-8, UTF-16 and UTF-32 use 8-bit, 16-bit and 32-bit code units respectively. The above code point will be encoded as ‘f0 b2 90 bf’ in UTF-8, ‘d889 dc3f’ in UTF-16 and ‘0003243f’ in UTF-32. Note that these are just sequences of groups of bits; how they are stored further depends on the endianess of the particular encoding. So, when storing the above UTF-16 code units on an octet-oriented media, they will be converted to ‘d8 89 dc 3f’ for UTF-16BE and to ‘89 d8 3f dc’ for UTF-16LE.

Abstract character

A unit of information used for the organization, control, or representation of textual data.^{[§3.4, D7]} The standard further says in §3.1:

For the Unicode Standard, [...] the repertoire is inherently open. Because Unicode is a universal encoding, any abstract character that could ever be encoded is a potential candidate to be encoded, regardless of whether the character is currently known.
The definition is indeed abstract. Whatever one can think of as a character—is an abstract character. For example, tengwar letter ungwe is an abstract character, although it is not yet representable in Unicode.

Encoded character

Coded character
A mapping between a code point and an abstract character.^{[§3.4, D11]} For example, U+1F428 is a coded character which represents the abstract character 🐨 koala.
This mapping is neither total, nor injective, nor surjective:
- Surragates, noncharacters and unassigned code points do not correspond to abstract characters at all.
- Some abstract characters can be encoded by different code points; U+03A9 greek capital letter omega and U+2126 ohm sign both correspond to the same abstract character ‘Ω’, and must be treated idnetically.
- Some abstract characters cannot be encoded by a single code point. These are represented by sequences of coded characters. For example, the only way to represent the abstract character ю́ cyrillic small letter yu with acute is by the sequence U+044E cyrillic small letter yu followed by U+0301 combining acute accent.
Moreover, for some abstract characters, there exist representations using multiple code points, in addition to the single coded character form. The abstract character ǵ can be coded by the single code point U+01F5 latin small letter g with acute, or by the sequence <U+0067 latin small letter g, U+0301 combining acute accent>.
User-perceived character

Whatever the end user thinks of as a character. This notion is language dependent. For instance, ‘ch’ is two letters in English and Latin, but considered to be one letter in Czech and Slovak.

Grapheme cluster

A sequence of coded characters that ‘should be kept together’.^[§2.11] Grapheme clusters approximate the notion of user-perceived characters in a language independent way. They are used for, e.g., cursor movement and selection.

Character

May mean any of the above. The Unicode Standard uses it as a synonym for coded character.^[§3.4]
When some programming language or library documentation says ‘character’, it almost always means a code unit. When an end user is asked about the number of characters in a string, she will count the user-perceived characters. When a programmer tries to count the number of characters, she will count the number of code units, code points, or grapheme clusters, according to the level of her expertise. All this is a source of confusion, as people conclude that, if for the length of the string ‘🐨’ the library returns a value other than one, then it ‘does not support Unicode’.

Q: Why would the Asians give up on UTF-16 encoding, which saves them 50% the memory per character?

A: It does so only in artificially constructed examples containing only characters in the U+0800 to U+FFFF range. However, computer-to-computer text interfaces dominate any other. This includes XML, HTTP, filesystem paths and configuration files—they all use almost exclusively ASCII characters, and in fact UTF-8 is used just as often in those countries.
For a dedicated storage of Chinese books, UTF-16 may still be used as a fair optimization. As soon as the text is retrieved from such storage, it should be converted to the standard compatible with the rest of the world. Anyway, if storage is at premium, a lossless compression will be used. In such cases, UTF-8 and UTF-16 will take roughly the same space. Furthermore, ‘in the said languages, a glyph conveys more information than a [L]atin character so it is justified for it to take more space.’ (Tronic, UTF-16 harmful).
Here are the results of a simple experiment. The space used by the HTML source of some web page (Japan article, retrieved from Japanese Wikipedia on 2012–01–01) is shown in the first column. The second column shows the results for text with markup removed, that is ‘select all, copy, paste into plain text file’.

	HTML Source (Δ UTF-8)	Dense text (Δ UTF-8)
UTF-8	767 KB (0%)	222 KB (0%)
UTF-16	1 186 KB (+55%)	176 KB (−21%)
UTF-8 zipped	179 KB (−77%)	83 KB (−63%)
UTF-16LE zipped	192 KB (−75%)	76 KB (−66%)
UTF-16BE zipped	194 KB (−75%)	77 KB (−65%)

As can be seen, UTF-16 takes about 50% more space than UTF-8 on real data, it only saves 20% for dense Asian text, and hardly competes with general purpose compression algorithms.

Q: What do you think about BOMs?
A: They are another reason not to use UTF-16. UTF-8 has a BOM too, even though byte order is not an issue in this encoding. This is to manifest that this is a UTF-8 stream. If UTF-8 remains the only popular encoding (as it already is in the internet world), the BOM becomes redundant. In practice, many UTF-8 text files omit BOMs today. The Unicode Standard does not recommend using BOMs.
Q: What do you think about line endings?
A: All files shall be read and written in binary mode since this guarantees interoperability—a program will always give the same output on any system. Since the C and C++ standards use \n as in-memory line endings, this will cause all files to be written in the POSIX convention. It may cause trouble when the file is opened in Notepad on Windows; however, any decent text viewer understands such line endings.
Q: But what about performance of text processing algorithms, byte alignment, etc?
A: Is it really better with UTF-16? Maybe so. ICU uses UTF-16 for historical reasons, thus it is quite hard to measure. However, most of the times strings are treated as cookies, not sorted or reversed every second use. Smaller encoding is then favorable for performance.
Q: Isn’t UTF-8 merely an attempt to be compatible with ASCII? Why keep this old fossil?
A: Maybe it was. Today, it is a better and more popular encoding of Unicode than any other.
Q: Is it really a fault of UTF-16 that people misuse it, assuming that it is 16 bits per character?
A: Not really. But yes, safety is an important feature of every design.
Q: If std::string means UTF-8, wouldn’t that get confused with code that stores plain text in std::strings?
A: There is no such thing as plain text. There is no reason for storing codepage-ANSI or ASCII-only text in a class named ‘string’.
Q: Won’t the conversions between UTF-8 and UTF-16 when passing strings to Windows slow down my application?
A: First, you will do some conversion either way. It’s either when calling the system, or when interacting with the rest of the world. Even if your interaction with the system is more frequent in your application, here is a little experiment.
A typical use of the OS is to open files. This function executes in (184 ± 3)μs on my machine:
```
void f(const wchar_t* name)
{
    HANDLE f = CreateFile(name, GENERIC_WRITE, FILE_SHARE_READ, 0, CREATE_ALWAYS, 0, 0);
    DWORD written;
    WriteFile(f, "Hello world!\n", 13, &written, 0);
    CloseHandle(f);
}
```
While this runs in (186 ± 0.7)μs:
```
void f(const char* name)
{
    HANDLE f = CreateFile(widen(name).c_str(), GENERIC_WRITE, FILE_SHARE_READ, 0, CREATE_ALWAYS, 0, 0);
    DWORD written;
    WriteFile(f, "Hello world!\n", 13, &written, 0);
    CloseHandle(f);
}
```
(Run with name="D:\\a\\test\\subdir\\subsubdir\\this is the sub dir\\a.txt" in both cases. It was averaged over 5 runs. We used an optimized widen that relies on std::string contiguous storage guarantee given by C++11.)
This is just (1 ± 2)% overhead. Moreover, MultiByteToWideChar is almost surely suboptimal. Better UTF-8↔UTF-16 conversion functions exist.
Q: How do I write UTF-8 string literal in my C++ code?
A: If you internationalize your software then all non-ASCII strings will be loaded from an external translation database, so it is not a problem.
If you still want to embed a special character you can do it as follows. In C++11 you can do it as:

u8"∃y ∀x ¬(x ≺ y)"
With compilers that do not support ‘u8’ you can hard-code the UTF-8 code units as follows:

"\xE2\x88\x83y \xE2\x88\x80x \xC2\xAC(x \xE2\x89\xBA y)"
However the most straightforward way is to just write the string as-is and save the source file encoded in UTF-8:

"∃y ∀x ¬(x ≺ y)"
Unfortunately, MSVC converts it to some ANSI codepage, corrupting the string. To work around this, save the file in UTF-8 without BOM. MSVC will assume that it is in the correct codepage and will not touch your strings. However, it renders it impossible to use Unicode identifiers and wide string literals (that you will not be using anyway).
Q: How can I check for presence of a specific ASCII character, e.g. apostrophe (') for SQL injection prevention, or HTML markup special characters, etc. in a UTF-8 encoded string?
A: Do as you would for an ASCII string. Every non-ASCII character is encoded in UTF-8 as a sequence of bytes, each of them having value greater than 127. This leaves no place for collision for a naïve algorithm—simple, fast and elegant.
Also, you can search for a UTF-8 encoded substring in a UTF-8 string as if it was a plain byte array—no need to mind code point boundaries. This is a design feature of UTF-8—a leading byte of an encoded code point can never hold value corresponding to one of trailing bytes of any other code point.
Q: I have a complex large char-based Windows application. What is the easiest way to make it Unicode-aware?
Keep the chars. Define _UNICODE to get compiler errors where narrow()/widen() should be used. Find all fstream and fopen() uses, and use wide overloads as described above. By now, you are almost done.
If you use 3rd-party libraries that do not support Unicode, e.g. forwarding file name strings as-is to fopen(), you will have to work around with tools such as GetShortPathName() as shown above.
Q: I already use this approach and I want to make our vision come true. What can I do?
A: Review your code and see what library is most painful to use in portable Unicode-aware code. Open a bug report to the authors.
If you are a C or C++ library author, use char* and std::string with UTF-8 implied, and refuse to support ANSI code pages—since they are inherently Unicode-broken.
If you are a Microsoft employee, push for implementing support of the CP_UTF8 as one of narrow API code pages.

Myths

Note: If you are not familiar with the Unicode terminology, please read this FAQ first.

Note: For the purpose of this discussion, indexing into the string is also a kind of character counting.

Counting characters can be done in constant time with UTF-16.

This is a common mistake by those who think that UTF-16 is a fixed-width encoding. It is not. In fact UTF-16 is a variable length encoding. Refer to this FAQ if you still deny the existence of non-BMP characters.
Many try to fix this statement by switching encodings, and come with the following statement:

Counting characters can be done in constant time with UTF-32.

Now, the truth of this statement depends on the meaning of the ambiguous and overloaded word ‘character’. The only interpretations that would make the claim true are ‘code units’ and ‘code points’, which coincide in UTF-32. However, code points are not characters, neither according to Unicode nor according to the end user. Some of them are non-characters. These should not be interchanged though. So, assuming we can guarantee that the string does not contain non-characters, each code point would represent a single coded character, and we could count them.
But, is it so an important achievement? Why the above concern raises at all?

Counting coded characters or code points is important.

The importance of code points is frequently overstated. This is due to misunderstanding of the complexity of Unicode, which merely reflects the complexity of human languages. It is easy to tell how many characters are there in ‘Abracadabra’, but it is not so simple for the following string:

Приве́т नमस्ते שָׁלוֹם

The above string consists of 22 (!) code points but only 16 grapheme clusters. So, ‘Abracadabra’ consists of 11 code points, the above string consists of 22 code points, and further of 20 if converted to NFC. Yet, the number of code points is irrelevant to almost any software engineering question, with perhaps the only exception of converting the string to UTF-32. For example:

For cursor movement, text selection and alike, grapheme clusters shall be used.
For limiting the length of a string in input fields, file formats, protocols, or databases, the length is measured in code units of some predetermined encoding. The reason is that any length limit is derived from the fixed amount of memory allocated for the string at a lower level, be it in memory, disk or in a particular data structure.
The size of the string as it appears on the screen is unrelated to the number of code points in the string. One has to communicate with the rendering engine for this. Code points do not occupy one column even in monospace fonts and terminals. POSIX takes this into account.

In NFC each code point corresponds to one user-perceived character.

No, because the number of user-perceived characters that can be represented in Unicode is virtually infinite. Even in practice, most characters do not have a fully composed form. For example, the NFD string from the example above, which consists of three real words in three real languages, will consist of 20 code points in NFC. This is still far more than the 16 user-perceived characters it has.

The string `length()` operation must count user-perceived or coded characters. If not, it does not support Unicode properly.

Unicode support of libraries and programming languages is frequently judged by the value returned for the ‘length of the string’ operation. According to this evaluation of Unicode support, most popular languages, such as C#, Java, and even the ICU itself, would not support Unicode. For example, the length of the one character string ‘🐨’ will be often reported to be 2 where UTF-16 is used as for the internal string representation and 4 for the languages that internally use UTF-8. The source of the misconception is that the specification of these languages use the word ‘character’ to mean a code unit, while the programmer expects it to be something else.

About the authors

This manifesto was written by Pavel Radzivilovsky, Yakov Galka and Slava Novgorodov, as a result of much experience and research of real-world Unicode issues and mistakes done by real-world programmers. The goal is to improve awareness of text issues and to inspire industry-wide changes to make Unicode-aware programming easier, ultimately improving the experience of users of those programs written by human engineers. Neither of us is involved in the Unicode consortium.
Much of the text was inspired by discussions on StackOverflow initiated by Artyom Beilis, the author of Boost.Locale. You can leave comments/feedback there. Additional inspiration came from the development conventions at VisionMap and Michael Hartl’s tauday.org.

External links

The Unicode Consortium
International Components for Unicode (ICU)
Joel on Unicode—‘The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets’
Boost.Locale—high quality localization facilities in a C++ way.
Should UTF-16 be considered harmful on StackOverflow, started by Artyom Beilis.

Last modified: 2013-02-15

niedziela, 29 grudnia 2013

niedziela, 15 grudnia 2013