niedziela, 29 grudnia 2013

UTF-8 Everywhere (manifesto by Pavel Radzivilovsky, Yakov Galka and Slava Novgorodov)

UTF-8 Everywhere

Manifesto

Purpose of this document

This document contains special characters. Without proper rendering support, you may see question marks, boxes, or other symbols.
To promote usage and support of the UTF-8 encoding, to convince that this should be the default choice of encoding for storing text strings in memory or on disk, for communication and all other uses. We believe that all other encodings of Unicode (or text, in general) belong to rare edge-cases of optimization and should be avoided by mainstream users.
In particular, we believe that the very popular UTF-16 encoding (mistakenly used as a synonym to ‘widechar’ and ‘Unicode’ in the Windows world) has no place in library APIs (except for specialized libraries, which deal with text).
This document recommends choosing UTF-8 as string storage in Windows applications, despite the fact that this standard is less popular there, due to historical reasons and the lack of native UTF-8 support by the API. Yet, we believe that, even on this platform, the following arguments outweigh the lack of native support. Also, we recommend forgetting forever what ‘ANSI codepages’ are and what they were used for. It is in the customer’s bill of rights to mix any number of languages in any text string.
We recommend avoiding C++ application code that depends on _UNICODE define. This includes TCHAR/LPTSTR types on Windows and APIs defined as macros, such as CreateWindow. We also recommend alternative ways to reach the goals of these APIs.
We also believe that, if an application is not supposed to specialize in text, the infrastructure must make it possible for the program to be unaware of encoding issues. For instance, a file copy utility should not be written differently to support non-English file names. Joel’s great article on Unicode explains the encodings well for the beginners, but it lacks the most important part: how a programmer should proceed, if she does not care what is inside the string.

Background

In 1988, Joseph D. Becker published the first Unicode draft proposal. At the basis of his design was the naïve assumption that 16 bits per character would suffice. In 1991, the first version of the Unicode standard was published, with code points limited to 16 bits. In the following years many systems have added support for Unicode and switched to the UCS-2 encoding. It was especially attractive for new technologies, like Qt framework (1992), Windows NT 3.1 (1993) and Java (1995).
However, it was soon discovered that 16 bits per character will not do for Unicode. In 1996, the UTF-16 encoding was created so existing systems would be able to work with non-16-bit characters. This effectively nullified the rationale behind choosing 16-bit encoding in the first place, namely being a fixed-width encoding. Currently Unicode spans over 109449 characters, about 74500 of them being CJK ideographs.
A little child playing an encodings game in front of a large poster about encodings.
Nagoya City Science Museum. Photo by Vadim Zlotnik.
Microsoft has, ever since, mistakenly used ‘Unicode’ and ‘widechar’ as synonyms for ‘UCS-2’ and ‘UTF-16’. Furthermore, since UTF-8 cannot be set as the encoding for narrow string WinAPI, one must compile her code with _UNICODE rather than _MBCS. Windows C++ programmers are educated that Unicode must be done with ‘widechars’. As a result of this mess, they are now among the most confused ones about what is the right thing to do about text.
At the same time, in the Linux and the Web worlds, there is a silent agreement that UTF-8 is the most correct encoding for Unicode on the planet Earth. Even though it gives a strong preference to English and therefore to computer languages (such as C++, HTML, XML, etc) over any other text, it is seldom less efficient than UTF-16 for commonly used character sets.

The Facts

  • In both UTF-8 and UTF-16 encodings, code points may take up to 4 bytes (contrary to what Joel says).
  • UTF-8 is endianness independent. UTF-16 comes in two flavors: UTF-16LE and UTF-16BE (for different byte orders, respectively). Here we name them collectively as UTF-16.
  • Widechar is 2 bytes in size on some platforms, 4 on others.
  • UTF-8 and UTF-32 result the same order when sorted lexicographically. UTF-16 does not.
  • UTF-8 favors efficiency for English letters and other ASCII characters (one byte per character) while UTF-16 favors several Asian character sets (2 bytes instead of 3 in UTF-8). This is what made UTF-8 the favorite choice in the Web world, where English HTML/XML tags are intermixed with any-language text. Cyrillic, Hebrew and several other popular Unicode blocks are 2 bytes both in UTF-16 and UTF-8.
  • In the Linux world, narrow strings are considered UTF-8 by default almost everywhere. This way, for example, a file copy utility would not need to care about encodings. Once tested on ASCII strings for file name arguments, it would certainly work correctly for file names in any language, as arguments are treated as cookies. The code of the file copy utility would not need to change at all to support foreign languages. fopen() would accept Unicode seamlessly, and so would argv.
  • On Microsoft Windows, however, making a file copy utility that can accept file names in a mix of several different Unicode blocks requires advanced trickery. First, the application must be compiled as Unicode-aware. In this case, it cannot have main() function with standard-C parameters. It will then accept UTF-16 encoded argv. To convert a Windows program written with narrow text in mind to support Unicode, one has to refactor deep and to take care of each and every string variable.
  • On Windows, SetCodePage() API enables receiving non-ASCII characters, but only from a single ANSI codepage. An unimplemented parameter CF_UTF8 would enable doing the above, on Windows.
  • The standard library shipped with MSVC is poorly implemented. It forwards narrow-string parameters directly to the OS ANSI API. There is no way to override this. Changing std::locale does not work. It’s impossible to open a file with a Unicode name on MSVC using standard features of C++. The standard way to open a file is:
    std::fstream fout("abc.txt");
    The proper way to get around is by using Microsoft’s own hack that accepts wide-string parameter, which is a non-standard extension.
  • There is no way to return Unicode from std::exception::what() other than using UTF-8.
  • UTF-16 is often misused as a fixed-width encoding, even by Windows native programs themselves: in plain Windows edit control (until Vista), it takes two backspaces to delete a character which takes 4 bytes in UTF-16. On Windows 7 the console displays that character as two invalid characters, regardless of the font used.
  • Many third-party libraries for Windows do not support Unicode: they accept narrow string parameters and pass them to the ANSI API. Sometimes, even for file names. In the general case, it is impossible to work around this, as a string may not be representable completely in any ANSI code page (if it contains characters from a mix of Unicode blocks). What is normally done on Windows for file names is getting an 8.3 path to the file (if it already exists) and feeding it into such a library. It is not possible if the library is supposed to create a non-existing file. It is not possible if the path is very long and the 8.3 form is longer than MAX_PATH. It is not possible if short-name generation is disabled in OS settings.
  • UTF-16 is very popular today, even outside the Windows world. Qt, Java, C#, Python, the ICU—they all use UTF-16 for internal string representation.

Our Conclusions

UTF-16 is the worst of both worlds—variable length and too wide. It exists for historical reasons, adds a lot of confusion and will hopefully die out.
Portability, cross-platform interoperability and simplicity are more important than interoperability with existing platform APIs. So, the best approach is to use UTF-8 narrow strings everywhere and convert them back and forth on Windows before calling APIs that accept strings. Performance is seldom an issue of any relevance when dealing with string-accepting system APIs (e.g. UI code and file system APIs), but there is a huge advantage to using the same encoding everywhere, and we see no sufficient reason to do otherwise.
Speaking of performance, machines often use strings to communicate (e.g. HTTP headers, XML). Many see this as a mistake, but regardless of that it is nearly always done in English, giving UTF-8 further advantage there. Using different encodings for different kinds of strings significantly increases complexity and consequent bugs.
In particular, we believe that adding wchar_t to C++ was a mistake, and so are the Unicode additions to C++11. What must be demanded from the implementations, though, is that the basic execution character set would be capable of storing any Unicode data. Then, every std::string or char* parameter would be Unicode-compatible. ‘If this accepts text, it should be Unicode compatible’—and with UTF-8, it is also easy to do.
The standard facets have many design flaws. This includes std::numpunct, std::moneypunct and std::ctype not supporting variable-length encoded characters (non-ASCII UTF-8 and non-BMP UTF-16). They must be fixed:
  • decimal_point() and thousands_sep() should return a string rather than a single code unit. (By the way C locales do support this, albeit not customizable.)
  • toupper() and tolower() shall not be phrased in terms of code units, as it does not work in Unicode. For example, the German ß must be converted to SS and ffl to FFL.

How to do text on Windows

The following is what we recommend to everyone else for compile-time checked Unicode correctness, ease of use and better multi-platformness of the code. This substantially differs from what is usually recommended as the proper way of using Unicode on Windows. Yet, an in-depth research of these recommendations resulted in the same conclusion. So here it goes:
  • Do not use wchar_t or std::wstring in any place other than adjacent point to APIs accepting UTF-16.
  • Do not use _T("") or L"" literals in any place other than parameters to APIs accepting UTF-16.
  • Do not use types, functions, or their derivatives that are sensitive to the _UNICODE constant, such as LPTSTR or CreateWindow().
  • Yet, _UNICODE is always defined, to avoid passing narrow strings to WinAPI getting silently compiled.
  • std::strings and char*, anywhere in the program, are considered UTF-8 (if not said otherwise).
  • Only use Win32 functions that accept widechars (LPWSTR), never those which accept LPTSTR or LPSTR. Pass parameters this way:
    ::SetWindowTextW(widen(someStdString or "string litteral").c_str())
    (The policy uses conversion functions described below.)
  • With MFC strings:
    CString someoneElse; // something that arrived from MFC.
    
    // Converted as soon as possible, before passing any further away from the API call:
    std::string s = str(boost::format("Hello %s\n") % narrow(someoneElse));
    AfxMessageBox(widen(s).c_str(), L"Error", MB_OK);

Working with files, filenames and fstreams on Windows

  • Never produce text output files with non-UTF-8 content
  • Using fopen() should anyway be avoided for RAII/OOD reasons. However, if necessary, use _wfopen() and WinAPI conventions as described above.
  • Never pass std::string or const char* filename arguments to the fstream family. MSVC CRT does not support UTF-8 arguments, but it has a non-standard extension which should be used as follows:
  • Convert std::string arguments to std::wstring with widen:
    std::ifstream ifs(widen("hello"), std::ios_base::binary);
    We will have to manually remove the conversion, when MSVC’s attitude to fstream changes.
  • This code is not multi-platform and may have to be changed manually in the future.
  • Alternatively use a set of wrappers that hide the conversions.

Conversion functions

This guideline uses the conversion functions from the Boost.Nowide library (it is not yet a part of boost):
std::string narrow(const wchar_t *s);
std::wstring widen(const char *s);
std::string narrow(const std::wstring &s);
std::wstring widen(const std::string &s);
The library also provides a set of wrappers for commonly used standard C and C++ library functions that deal with files, as well as means of reading an writing UTF-8 through iostreams.
These functions and wrappers are easy to implement using Windows’ MultiByteToWideChar and WideCharToMultiByte functions. Any other (possibly faster) conversion routines can be used.

FAQ

  1. Q: Are you a linuxer? Is this a concealed religious fight against Windows?

    A: No, I grew up on Windows, and I am a Windows fan. I believe that they did a wrong choice in the text domain, because they did it earlier than others.—Pavel
  2. Q: Are you an Anglophile? Do you secretly think English alphabet and culture are superior to any other?

    A: No, and my country is non-ASCII speaking. I do not think that using a format which encodes ASCII characters in single byte is Anglo-centrism, or has anything to do with human interaction. Even though one can argue that source codes of programs, web pages and XML files, OS file names and other computer-to-computer text interfaces should never have existed, as long as they do exist, text is not only for human readers.
  3. Q: Why do you guys care? I program in C# and/or Java and I don’t need to care about encodings at all.

    A: Not true. Both C# and Java offer a 16 bit char type, which is less than a Unicode character, congratulations. The .NET indexer str[i] works in units of the internal representation, hence a leaky abstraction once again. Substring methods will happily return an invalid string, cutting a non-BMP character in parts.
    Furthermore, you have to mind encodings when you are writing your text to files on disk, network communications, external devices, or any place for other program to read from. Please be kind to use System.Text.Encoding.UTF8 (.NET) in these cases, never Encoding.ASCII, UTF-16 or cellphone PDU, regardless of the assumptions about the contents.
    Web frameworks like ASP.NET do suffer from the poor choice of internal string representation in the underlying framework: the expected string output (and input) of a web application is nearly always UTF-8, resulting in significant conversion overhead in high-throughput web applications and web services.
  4. Q: Why not just let any programmer use her favorite encoding internally, as long as she knows how to use it?

    A: We have nothing against correct using of any encoding. However, it becomes a problem when the same type, such as std::string, means different things in different contexts. While it is ‘ANSI codepage’ for some, for others, it means ‘this code is broken and does not support non-English text’. In our programs, it means Unicode-aware UTF-8 string. This diversity is a source of many bugs and much misery: this additional complexity is something that world does not really need, and the result is much Unicode-broken software, industry-wide.
  5. Q: UTF-16 characters that take more than two bytes are extremely rare in the real world. This practically makes UTF-16 a fixed-width encoding, giving it a whole bunch of advantages. Can’t we just neglect these characters?

    A: Are you serious about not supporting all of Unicode in your software design? And, if you are going to support it anyway, how does the fact that non-BMP characters are rare practically change anything, except for making software testing harder? What does matter, however, is that text manipulations are relatively rare in real applications—compared to just passing strings around as-is. This means the "almost fixed width" has little performance advantage (see Performance), while having shorter strings may be significant.
  6. Q: Why do you turn on the _UNICODE define, if you do not intend to use Windows’ LPTSTR/TCHAR/etc macros?

    A: This is a precaution against plugging a UTF-8 char* string into ANSI-expecting functions of Windows API. We want it to generate a compiler error. It is the same kind of a hard-to-find bug as passing an argv[] string to fopen() on Windows: it assumes that the user will never pass non-current-codepage filenames. You will be unlikely to find this kind of a bug by manual testing, unless your testers are trained to supply Chinese file names occasionally, and yet it is a broken program logic. Thanks to _UNICODE define, you get an error for that.
  7. Q: Isn’t it quite naïve to think that Microsoft will stop using widechars one day?

    A: Let’s first see when they start supporting CP_UTF8 as a valid locale. This should not be very hard to do. Then, we see no reason why anybody would continue using the widechar APIs. Also, adding support for CP_UTF8 would ‘unbreak’ some of existing unicode-broken programs and libraries.
    Some say that adding CP_UTF8 support would break existing applications that use the ANSI API, and that this was supposedly the reason why Microsoft had to resort to creating the wide string API. This is not true. Even some popular ANSI encodings are variable length (Shift JIS, for example), so no correct code would become broken. The reason Microsoft chose UCS-2 is purely historical. Back then UTF-8 hasn’t yet existed, Unicode was believed to be ‘just a wider ASCII’, and it was cosidered important to use a fixed-width encoding.
  8. Q: What are characters, code points, code units and grapheme clusters?

    A: Here is an excerpt of the definitions according to the Unicode Standard with our comments. Refer to the relevant sections of the standard for more detailed description.
    Code point
    Any numerical value in the Unicode codespace.[§3.4, D10] For instance: U+3243F.
    Code unit
    The minimal bit combination that can represent a unit of encoded text.[§3.9, D77] For example, UTF-8, UTF-16 and UTF-32 use 8-bit, 16-bit and 32-bit code units respectively. The above code point will be encoded as ‘f0 b2 90 bf’ in UTF-8, ‘d889 dc3f’ in UTF-16 and ‘0003243f’ in UTF-32. Note that these are just sequences of groups of bits; how they are stored further depends on the endianess of the particular encoding. So, when storing the above UTF-16 code units on an octet-oriented media, they will be converted to ‘d8 89 dc 3f’ for UTF-16BE and to ‘89 d8 3f dc’ for UTF-16LE.
    Abstract character
    A unit of information used for the organization, control, or representation of textual data.[§3.4, D7] The standard further says in §3.1:
    For the Unicode Standard, [...] the repertoire is inherently open. Because Unicode is a universal encoding, any abstract character that could ever be encoded is a potential candidate to be encoded, regardless of whether the character is currently known.
    The definition is indeed abstract. Whatever one can think of as a character—is an abstract character. For example, tengwar letter ungwe is an abstract character, although it is not yet representable in Unicode.
    Encoded character
    Coded character
    A mapping between a code point and an abstract character.[§3.4, D11] For example, U+1F428 is a coded character which represents the abstract character 🐨 koala.
    This mapping is neither total, nor injective, nor surjective:
    • Surragates, noncharacters and unassigned code points do not correspond to abstract characters at all.
    • Some abstract characters can be encoded by different code points; U+03A9 greek capital letter omega and U+2126 ohm sign both correspond to the same abstract character ‘Ω’, and must be treated idnetically.
    • Some abstract characters cannot be encoded by a single code point. These are represented by sequences of coded characters. For example, the only way to represent the abstract character ю́ cyrillic small letter yu with acute is by the sequence U+044E cyrillic small letter yu followed by U+0301 combining acute accent.
    Moreover, for some abstract characters, there exist representations using multiple code points, in addition to the single coded character form. The abstract character ǵ can be coded by the single code point U+01F5 latin small letter g with acute, or by the sequence <U+0067 latin small letter g, U+0301 combining acute accent>.
    User-perceived character
    Whatever the end user thinks of as a character. This notion is language dependent. For instance, ‘ch’ is two letters in English and Latin, but considered to be one letter in Czech and Slovak.
    Grapheme cluster
    A sequence of coded characters that ‘should be kept together’.[§2.11] Grapheme clusters approximate the notion of user-perceived characters in a language independent way. They are used for, e.g., cursor movement and selection.
    Character
    May mean any of the above. The Unicode Standard uses it as a synonym for coded character.[§3.4]
    When some programming language or library documentation says ‘character’, it almost always means a code unit. When an end user is asked about the number of characters in a string, she will count the user-perceived characters. When a programmer tries to count the number of characters, she will count the number of code units, code points, or grapheme clusters, according to the level of her expertise. All this is a source of confusion, as people conclude that, if for the length of the string ‘🐨’ the library returns a value other than one, then it ‘does not support Unicode’.
  9. Q: Why would the Asians give up on UTF-16 encoding, which saves them 50% the memory per character?

    A: It does so only in artificially constructed examples containing only characters in the U+0800 to U+FFFF range. However, computer-to-computer text interfaces dominate any other. This includes XML, HTTP, filesystem paths and configuration files—they all use almost exclusively ASCII characters, and in fact UTF-8 is used just as often in those countries.
    For a dedicated storage of Chinese books, UTF-16 may still be used as a fair optimization. As soon as the text is retrieved from such storage, it should be converted to the standard compatible with the rest of the world. Anyway, if storage is at premium, a lossless compression will be used. In such cases, UTF-8 and UTF-16 will take roughly the same space. Furthermore, ‘in the said languages, a glyph conveys more information than a [L]atin character so it is justified for it to take more space.’ (Tronic, UTF-16 harmful).
    Here are the results of a simple experiment. The space used by the HTML source of some web page (Japan article, retrieved from Japanese Wikipedia on 2012–01–01) is shown in the first column. The second column shows the results for text with markup removed, that is ‘select all, copy, paste into plain text file’.

    HTML Source (Δ UTF-8)Dense text (Δ UTF-8)
    UTF-8767 KB (0%)222 KB (0%)
    UTF-161 186 KB (+55%)176 KB (−21%)
    UTF-8 zipped179 KB (−77%)83 KB (−63%)
    UTF-16LE zipped192 KB (−75%)76 KB (−66%)
    UTF-16BE zipped194 KB (−75%)77 KB (−65%)
    As can be seen, UTF-16 takes about 50% more space than UTF-8 on real data, it only saves 20% for dense Asian text, and hardly competes with general purpose compression algorithms.
  10. Q: What do you think about BOMs?

    A: They are another reason not to use UTF-16. UTF-8 has a BOM too, even though byte order is not an issue in this encoding. This is to manifest that this is a UTF-8 stream. If UTF-8 remains the only popular encoding (as it already is in the internet world), the BOM becomes redundant. In practice, many UTF-8 text files omit BOMs today. The Unicode Standard does not recommend using BOMs.
  11. Q: What do you think about line endings?

    A: All files shall be read and written in binary mode since this guarantees interoperability—a program will always give the same output on any system. Since the C and C++ standards use \n as in-memory line endings, this will cause all files to be written in the POSIX convention. It may cause trouble when the file is opened in Notepad on Windows; however, any decent text viewer understands such line endings.
  12. Q: But what about performance of text processing algorithms, byte alignment, etc?

    A: Is it really better with UTF-16? Maybe so. ICU uses UTF-16 for historical reasons, thus it is quite hard to measure. However, most of the times strings are treated as cookies, not sorted or reversed every second use. Smaller encoding is then favorable for performance.
  13. Q: Isn’t UTF-8 merely an attempt to be compatible with ASCII? Why keep this old fossil?

    A: Maybe it was. Today, it is a better and more popular encoding of Unicode than any other.
  14. Q: Is it really a fault of UTF-16 that people misuse it, assuming that it is 16 bits per character?

    A: Not really. But yes, safety is an important feature of every design.
  15. Q: If std::string means UTF-8, wouldn’t that get confused with code that stores plain text in std::strings?

    A: There is no such thing as plain text. There is no reason for storing codepage-ANSI or ASCII-only text in a class named ‘string’.
  16. Q: Won’t the conversions between UTF-8 and UTF-16 when passing strings to Windows slow down my application?

    A: First, you will do some conversion either way. It’s either when calling the system, or when interacting with the rest of the world. Even if your interaction with the system is more frequent in your application, here is a little experiment.
    A typical use of the OS is to open files. This function executes in (184 ± 3)μs on my machine:
    void f(const wchar_t* name)
    {
        HANDLE f = CreateFile(name, GENERIC_WRITE, FILE_SHARE_READ, 0, CREATE_ALWAYS, 0, 0);
        DWORD written;
        WriteFile(f, "Hello world!\n", 13, &written, 0);
        CloseHandle(f);
    }
    While this runs in (186 ± 0.7)μs:
    void f(const char* name)
    {
        HANDLE f = CreateFile(widen(name).c_str(), GENERIC_WRITE, FILE_SHARE_READ, 0, CREATE_ALWAYS, 0, 0);
        DWORD written;
        WriteFile(f, "Hello world!\n", 13, &written, 0);
        CloseHandle(f);
    }
    (Run with name="D:\\a\\test\\subdir\\subsubdir\\this is the sub dir\\a.txt" in both cases. It was averaged over 5 runs. We used an optimized widen that relies on std::string contiguous storage guarantee given by C++11.)
    This is just (1 ± 2)% overhead. Moreover, MultiByteToWideChar is almost surely suboptimal. Better UTF-8↔UTF-16 conversion functions exist.
  17. Q: How do I write UTF-8 string literal in my C++ code?

    A: If you internationalize your software then all non-ASCII strings will be loaded from an external translation database, so it is not a problem.
    If you still want to embed a special character you can do it as follows. In C++11 you can do it as:
    u8"∃y ∀x ¬(x ≺ y)"
    With compilers that do not support ‘u8’ you can hard-code the UTF-8 code units as follows:
    "\xE2\x88\x83y \xE2\x88\x80x \xC2\xAC(x \xE2\x89\xBA y)"
    However the most straightforward way is to just write the string as-is and save the source file encoded in UTF-8:
    "∃y ∀x ¬(x ≺ y)"
    Unfortunately, MSVC converts it to some ANSI codepage, corrupting the string. To work around this, save the file in UTF-8 without BOM. MSVC will assume that it is in the correct codepage and will not touch your strings. However, it renders it impossible to use Unicode identifiers and wide string literals (that you will not be using anyway).
  18. Q: How can I check for presence of a specific ASCII character, e.g. apostrophe (') for SQL injection prevention, or HTML markup special characters, etc. in a UTF-8 encoded string?

    A: Do as you would for an ASCII string. Every non-ASCII character is encoded in UTF-8 as a sequence of bytes, each of them having value greater than 127. This leaves no place for collision for a naïve algorithm—simple, fast and elegant.
    Also, you can search for a UTF-8 encoded substring in a UTF-8 string as if it was a plain byte array—no need to mind code point boundaries. This is a design feature of UTF-8—a leading byte of an encoded code point can never hold value corresponding to one of trailing bytes of any other code point.
  19. Q: I have a complex large char-based Windows application. What is the easiest way to make it Unicode-aware?

    Keep the chars. Define _UNICODE to get compiler errors where narrow()/widen() should be used. Find all fstream and fopen() uses, and use wide overloads as described above. By now, you are almost done.
    If you use 3rd-party libraries that do not support Unicode, e.g. forwarding file name strings as-is to fopen(), you will have to work around with tools such as GetShortPathName() as shown above.
  20. Q: I already use this approach and I want to make our vision come true. What can I do?

    A: Review your code and see what library is most painful to use in portable Unicode-aware code. Open a bug report to the authors.
    If you are a C or C++ library author, use char* and std::string with UTF-8 implied, and refuse to support ANSI code pages—since they are inherently Unicode-broken.
    If you are a Microsoft employee, push for implementing support of the CP_UTF8 as one of narrow API code pages.

Myths

Note: If you are not familiar with the Unicode terminology, please read this FAQ first.
Note: For the purpose of this discussion, indexing into the string is also a kind of character counting.

Counting characters can be done in constant time with UTF-16.

This is a common mistake by those who think that UTF-16 is a fixed-width encoding. It is not. In fact UTF-16 is a variable length encoding. Refer to this FAQ if you still deny the existence of non-BMP characters.
Many try to fix this statement by switching encodings, and come with the following statement:

Counting characters can be done in constant time with UTF-32.

Now, the truth of this statement depends on the meaning of the ambiguous and overloaded word ‘character’. The only interpretations that would make the claim true are ‘code units’ and ‘code points’, which coincide in UTF-32. However, code points are not characters, neither according to Unicode nor according to the end user. Some of them are non-characters. These should not be interchanged though. So, assuming we can guarantee that the string does not contain non-characters, each code point would represent a single coded character, and we could count them.
But, is it so an important achievement? Why the above concern raises at all?

Counting coded characters or code points is important.

The importance of code points is frequently overstated. This is due to misunderstanding of the complexity of Unicode, which merely reflects the complexity of human languages. It is easy to tell how many characters are there in ‘Abracadabra’, but it is not so simple for the following string:
Приве́т नमस्ते שָׁלוֹם
The above string consists of 22 (!) code points but only 16 grapheme clusters. So, ‘Abracadabra’ consists of 11 code points, the above string consists of 22 code points, and further of 20 if converted to NFC. Yet, the number of code points is irrelevant to almost any software engineering question, with perhaps the only exception of converting the string to UTF-32. For example:
  • For cursor movement, text selection and alike, grapheme clusters shall be used.
  • For limiting the length of a string in input fields, file formats, protocols, or databases, the length is measured in code units of some predetermined encoding. The reason is that any length limit is derived from the fixed amount of memory allocated for the string at a lower level, be it in memory, disk or in a particular data structure.
  • The size of the string as it appears on the screen is unrelated to the number of code points in the string. One has to communicate with the rendering engine for this. Code points do not occupy one column even in monospace fonts and terminals. POSIX takes this into account.

In NFC each code point corresponds to one user-perceived character.

No, because the number of user-perceived characters that can be represented in Unicode is virtually infinite. Even in practice, most characters do not have a fully composed form. For example, the NFD string from the example above, which consists of three real words in three real languages, will consist of 20 code points in NFC. This is still far more than the 16 user-perceived characters it has.

The string length() operation must count user-perceived or coded characters. If not, it does not support Unicode properly.

Unicode support of libraries and programming languages is frequently judged by the value returned for the ‘length of the string’ operation. According to this evaluation of Unicode support, most popular languages, such as C#, Java, and even the ICU itself, would not support Unicode. For example, the length of the one character string ‘🐨’ will be often reported to be 2 where UTF-16 is used as for the internal string representation and 4 for the languages that internally use UTF-8. The source of the misconception is that the specification of these languages use the word ‘character’ to mean a code unit, while the programmer expects it to be something else.

About the authors

This manifesto was written by Pavel Radzivilovsky, Yakov Galka and Slava Novgorodov, as a result of much experience and research of real-world Unicode issues and mistakes done by real-world programmers. The goal is to improve awareness of text issues and to inspire industry-wide changes to make Unicode-aware programming easier, ultimately improving the experience of users of those programs written by human engineers. Neither of us is involved in the Unicode consortium.
Much of the text was inspired by discussions on StackOverflow initiated by Artyom Beilis, the author of Boost.Locale. You can leave comments/feedback there. Additional inspiration came from the development conventions at VisionMap and Michael Hartl’s tauday.org.

External links

Valid XHTML 1.0 Strict Valid CSS!
Last modified: 2013-02-15

niedziela, 15 grudnia 2013

C++ Libraries

  1. Dlib

    Dlib is a general purpose cross-platform C++ library designed using contract programming and modern C++ techniques. It is open source software and licensed under the Boost Software License.
  2. Origin

    Origin is a collection of experimental libraries written using the C++11 programming language. The purpose of this project is to foster experimentation with library design and programming techniques using the new version of the language.
    The Origin libraries is designed around a minimal set of core facilities that wrap and extend the C++ standard library. These facilities include an extensive type traits framework, support for concept-like type checking, new iterator adaptors, ranges, and support for specification testing.
  3. Folly

    Folly is an open-source C++ library developed and used at Facebook.
  4. Adobe Source Libraries

    The Adobe Source Libraries (ASL) are a collection of C++ libraries building foundation technology to allow the construction of commercial applications by assembling generic algorithms through declarative descriptions.
  5. Cinder

    Cinder is a community-developed, free and open source library for professional-quality creative coding in C++.
  6. JUCE

    JUCE is a wide-ranging C++ class library for building rich cross-platform applications and plugins for all the major operating systems.
  7. OpenFrameworks

    OpenFrameworks is an open source C++ toolkit for creative coding.


piątek, 30 sierpnia 2013

On measuring productivity

funnythingsmustdie(@Reddit)

  • Measuring productivity is easy: first define what results you want, then ask if you achieved those results. If you did, you're productive. Otherwise you're not productive.
  • If you achieve results, then look at how much slack you had, and consider having tighter requirements for the next round.
  • Of course, this requires judgement and intelligence, and you can't replace a manager with a formula in a spreadsheet.

sobota, 17 sierpnia 2013

Designing Qt-Style C++ APIs (by Matthias Ettrich)

Designing Qt-Style C++ APIs
by Matthias Ettrich
We have done substantial research at Trolltech into improving the Qt development experience. In this article, I want to share some of our findings and present the principles we've been using when designing Qt 4, and show you how to apply them to your code.
Designing application programmer interfaces, APIs, is hard. It is an art as difficult as designing programming languages. There are many different principles to choose from, many of which tend to contradict each other.
Computer science education today puts a lot of emphasis on algorithms and data structures, with less focus on the principles behind designing programming languages and frameworks. This leaves application programmers unprepared for an increasingly important task: the creation of reusable components.
Before the rise of object-oriented languages, reusable generic code was mostly written by library vendors rather than by application developers. In the Qt world, this situation has changed significantly. Programming with Qt is writing new components all the time. A typical Qt application has at least some customized components that are reused throughout the application. Often the same components are deployed as part of other applications. KDE, the K Desktop Environment, goes even further and extends Qt with many add-on libraries that implement hundreds of additional classes.
But what constitutes a good, efficient C++ API? What is good or bad depends on many factors -- for example, the task at hand and the specific target group. A good API has a number of features, some of which are generally desirable, and some of which are more specific to certain problem domains.
Six Characteristics of Good APIs
An API is to the programmer what a GUI is to the end-user. The 'P' in API stands for "Programmer", not "Program", to highlight the fact that APIs are used by programmers, who are humans.
We believe APIs should be minimal and complete, have clear and simple semantics, be intuitive, be easy to memorize, and lead to readable code.
  • Be minimal: A minimal API is one that has as few public members per class and as few classes as possible. This makes it easier to understand, remember, debug, and change the API.
  • Be complete: A complete API means the expected functionality should be there. This can conflict with keeping it minimal. Also, if a member function is in the wrong class, many potential users of the function won't find it.
  • Have clear and simple semantics: As with other design work, you should apply the principle of least surprise. Make common tasks easy. Rare tasks should be possible but not the focus. Solve the specific problem; don't make the solution overly general when this is not needed. (For example, QMimeSourceFactory in Qt 3 could have been called QImageLoader and have a different API.)
  • Be intuitive: As with anything else on a computer, an API should be intuitive. Different experience and background leads to different perceptions on what is intuitive and what isn't. An API is intuitive if a semi-experienced user gets away without reading the documentation, and if a programmer who doesn't know the API can understand code written using it.
  • Be easy to memorize: To make the API easy to remember, choose a consistent and precise naming convention. Use recognizable patterns and concepts, and avoid abbreviations.
  • Lead to readable code: Code is written once, but read (and debugged and changed) many times. Readable code may sometimes take longer to write, but saves time throughout the product's life cycle.
Finally, keep in mind that different kinds of users will use different parts of the API. While simply using an instance of a Qt class should be intuitive, it's reasonable to expect the user to read the documentation before attempting to subclass it.
The Convenience Trap
It is a common misconception that the less code you need to achieve something, the better the API. Keep in mind that code is written more than once but has to be understood over and over again. For example,

    QSlider *slider = new QSlider(12, 18, 3, 13, Qt::Vertical,
                                  0, "volume");
    
is much harder to read (and even to write) than
    QSlider *slider = new QSlider(Qt::Vertical);
    slider->setRange(12, 18);
    slider->setPageStep(3);
    slider->setValue(13);
    slider->setObjectName("volume");
    
The Boolean Parameter Trap
Boolean parameters often lead to unreadable code. In particular, it's almost invariably a mistake to add a bool parameter to an existing function. In Qt, the traditional example is repaint(), which takes an optional bool parameter specifying whether the background should be erased (the default) or not. This leads to code such as

    widget->repaint(false);
    
which beginners might read as meaning, "Don't repaint!"
The thinking is apparently that the bool parameter saves one function, thus helping reducing the bloat. In truth, it adds bloat; how many Qt users know by heart what each of the next three lines does?
    widget->repaint();
    widget->repaint(true);
    widget->repaint(false);
    
A somewhat better API might have been
    widget->repaint();
    widget->repaintWithoutErasing();
    
In Qt 4, we solved the problem by simply removing the possibility of repainting without erasing the widget. Qt 4's native support for double buffering made this feature obsolete.
Here come a few more examples:

    widget->setSizePolicy(QSizePolicy::Fixed,
                          QSizePolicy::Expanding, true);
    textEdit->insert("Where's Waldo?", true, true, false);
    QRegExp rx("moc_*.c??", false, true);
    
An obvious solution is to replace the bool parameters with enum types. This is what we've done in Qt 4 with case sensitivity in QString. Compare:
    str.replace("%USER%", user, false);               // Qt 3
    str.replace("%USER%", user, Qt::CaseInsensitive); // Qt 4
    
Static Polymorphism
Similar classes should have a similar API. This can be done using inheritance where it makes sense -- that is, when run-time polymorphism is used. But polymorphism also happens at design time. For example, if you exchange a QListBox with a QComboBox, or a QSlider with a QSpinBox, you'll find that the similarity of APIs makes this replacement very easy. This is what we call "static polymorphism".
Static polymorphism also makes it easier to memorize APIs and programming patterns. As a consequence, a similar API for a set of related classes is sometimes better than perfect individual APIs for each class.
The Art of Naming
Naming is probably the single most important issue when designing an API. What should the classes be called? What should the member functions be called?
General Naming Rules
A few rules apply equally well to all kinds of names. First, as I mentioned earlier, do not abbreviate. Even obvious abbreviations such as "prev" for "previous" don't pay off in the long run, because the user must remember which words are abbreviated.
Things naturally get worse if the API itself is inconsistent; for example, Qt 3 has activatePreviousWindow() and fetchPrev(). Sticking to the "no abbreviation" rule makes it simpler to create consistent APIs.
Another important but more subtle rule when designing classes is that you should try to keep the namespace for subclasses clean. In Qt 3, this principle wasn't always followed. To illustrate this, we will take the example of a QToolButton. If you call name(), caption(), text(), or textLabel() on a QToolButton in Qt 3, what do you expect? Just try playing around with a QToolButton in Qt Designer:
  • The name property is inherited from QObject and refers to an internal object name that can be used for debugging and testing.
  • The caption property is inherited from QWidget and refers to the window title, which has virtually no meaning for QToolButtons, since they usually are created with a parent.
  • The text property is inherited from QButton and is normally used on the button, unless useTextLabel is true.
  • The textLabel property is declared in QToolButton and is shown on the button if useTextLabel is true.
In the interest of readability, name is called objectName in Qt 4, caption has become windowTitle, and there is no longer any textLabel property distinct from text in QToolButton.
Naming Classes
Identify groups of classes instead of finding the perfect name for each individual class. For example, All the Qt 4 model-aware item view classes are suffixed with View (QListView, QTableView, and QTreeView), and the corresponding item-based classes are suffixed with Widget instead (QListWidget, QTableWidget, and QTreeWidget).
Naming Enum Types and Values
When declaring enums, we must keep in mind that in C++ (unlike in Java or C#), the enum values are used without the type. The following example shows illustrates the dangers of giving too general names to the enum values:
    namespace Qt
    {
        enum Corner { TopLeft, BottomRight, ... };
        enum CaseSensitivity { Insensitive, Sensitive };
        ...
    };
    
    tabWidget->setCornerWidget(widget, Qt::TopLeft);
    str.indexOf("$(QTDIR)", Qt::Insensitive);
    
In the last line, what does Insensitive mean? One guideline for naming enum types is to repeat at least one element of the enum type name in each of the enum values:
    namespace Qt
    {
        enum Corner { TopLeftCorner, BottomRightCorner, ... };
        enum CaseSensitivity { CaseInsensitive,
                               CaseSensitive };
        ...
    };
    
    tabWidget->setCornerWidget(widget, Qt::TopLeftCorner);
    str.indexOf("$(QTDIR)", Qt::CaseInsensitive);
    
When enumerator values can be OR'd together and be used as flags, the traditional solution is to store the result of the OR in an int, which isn't type-safe. Qt 4 offers a template class QFlags<T>, where T is the enum type. For convenience, Qt provides typedefs for the flag type names, so you can type Qt::Alignment instead of QFlags<Qt::AlignmentFlag>.
By convention, we give the enum type a singular name (since it can only hold one flag at a time) and the "flags" type a plural name. For example:
    enum RectangleEdge { LeftEdge, RightEdge, ... };
    typedef QFlags<RectangleEdge> RectangleEdges;
    
In some cases, the "flags" type has a singular name. In that case, the enum type is suffixed with Flag:
    enum AlignmentFlag { AlignLeft, AlignTop, ... };
    typedef QFlags<AlignmentFlag> Alignment;
    
Naming Functions and Parameters
The number one rule of function naming is that it should be clear from the name whether the function has side-effects or not. In Qt 3, the const function QString::simplifyWhiteSpace() violated this rule, since it returned a QString instead of modifying the string on which it is called, as the name suggests. In Qt 4, the function has been renamed QString::simplified().
Parameter names are an important source of information to the programmer, even though they don't show up in the code that uses the API. Since modern IDEs show them while the programmer is writing code, it's worthwhile to give decent names to parameters in the header files and to use the same names in the documentation.
Naming Boolean Getters, Setters, and Properties
Finding good names for the getter and setter of a bool property is always a special pain. Should the getter be called checked() or isChecked()? scrollBarsEnabled() or areScrollBarEnabled()?
In Qt 4, we used the following guidelines for naming the getter function:
  • Adjectives are prefixed with is-. Examples:
    • isChecked()
    • isDown()
    • isEmpty()
    • isMovingEnabled()
    However, adjectives applying to a plural noun have no prefix:
    • scrollBarsEnabled(), not areScrollBarsEnabled()
  • Verbs have no prefix and don't use the third person (-s):
    • acceptDrops(), not acceptsDrops()
    • allColumnsShowFocus()
  • Nouns generally have no prefix:
    • autoCompletion(), not isAutoCompletion()
    • boundaryChecking()
    Sometimes, having no prefix is misleading, in which case we prefix with is-:
    • isOpenGLAvailable(), not openGL()
    • isDialog(), not dialog()
    (From a function called dialog(), we would normally expect that it returns a QDialog *.)
The name of the setter is derived from that of the getter by removing any is prefix and putting a set at the front of the name; for example, setDown() and setScrollBarsEnabled(). The name of the property is the same as the getter, but without the is prefix.
Pointers or References?
Which is best for out-parameters, pointers or references?

    void getHsv(int *h, int *s, int *v) const
    void getHsv(int &h, int &s, int &v) const
    
Most C++ books recommend references whenever possible, according to the general perception that references are "safer and nicer" than pointers. In contrast, at Trolltech, we tend to prefer pointers because they make the user code more readable. Compare:
    color.getHsv(&h, &s, &v);
    color.getHsv(h, s, v);
    
Only the first line makes it clear that there's a high probability that h, s, and v will be modified by the function call.
Case Study: QProgressBar
To show some of these concepts in practice, we'll study the QProgressBar API of Qt 3 and compare it to the Qt 4 API. In Qt 3:
    class QProgressBar : public QWidget
    {
        ...
    public:
        int totalSteps() const;
        int progress() const;
    
        const QString &progressString() const;
        bool percentageVisible() const;
        void setPercentageVisible(bool);
    
        void setCenterIndicator(bool on);
        bool centerIndicator() const;
    
        void setIndicatorFollowsStyle(bool);
        bool indicatorFollowsStyle() const;
    
    public slots:
        void reset();
        virtual void setTotalSteps(int totalSteps);
        virtual void setProgress(int progress);
        void setProgress(int progress, int totalSteps);
    
    protected:
        virtual bool setIndicator(QString &progressStr,
                                  int progress,
                                  int totalSteps);
        ...
    };
    
The API is quite complex and inconsistent; for example, it's not clear from the naming that reset(), setTotalSteps(), and setProgress() are tightly related.
The key to improve the API is to notice that QProgressBar is similar to Qt 4's QAbstractSpinBox class and its subclasses, QSpinBox, QSlider and QDial. The solution? Replace progress and totalSteps with minimum, maximum and value. Add a valueChanged() signal. Add a setRange() convenience function.
The next observation is that progressString, percentage and indicator really refer to one thing: the text that is shown on the progress bar. Usually the text is a percentage, but it can be set to anything using the setIndicator() function. Here's the new API:
    virtual QString text() const;
    void setTextVisible(bool visible);
    bool isTextVisible() const;
    
By default, the text is a percentage indicator. This can be changed by reimplementing text().
The setCenterIndicator() and setIndicatorFollowsStyle() functions in the Qt 3 API are two functions that influence alignment. They can advantageously be replaced by one function, setAlignment():
    void setAlignment(Qt::Alignment alignment);
    
If the programmer doesn't call setAlignment(), the alignment is chosen based on the style. For Motif-based styles, the text is shown centered; for other styles, it is shown on the right hand side.
Here's the improved QProgressBar API:
    class QProgressBar : public QWidget
    {
        ...
    public:
        void setMinimum(int minimum);
        int minimum() const;
        void setMaximum(int maximum);
        int maximum() const;
        void setRange(int minimum, int maximum);
        int value() const;
    
        virtual QString text() const;
        void setTextVisible(bool visible);
        bool isTextVisible() const;
        Qt::Alignment alignment() const;
        void setAlignment(Qt::Alignment alignment);
    
    public slots:
        void reset();
        void setValue(int value);
    
    signals:
        void valueChanged(int value);
        ...
    };
    
How to Get APIs Right
APIs need quality assurance. The first revision is never right; you must test it. Make use cases by looking at code which uses this API and verify that the code is readable.
Other tricks include having somebody else use the API with or without documentation and documenting the class (both the class overview and the individual functions).
Documenting is also a good way of finding good names when you get stuck: just try to document the item (class, function, enum value, etc.) and use your first sentence as inspiration. If you cannot find a precise name, this is often a sign that the item shouldn't exist. If everything else fails and you are convinced that the concept makes sense, invent a new name. This is, after all, how "widget", "event", "focus", and "buddy" came to be.

This document is licensed under the Creative Commons Attribution-Share Alike 2.5 license.

Copyright © 2005 Trolltech

piątek, 2 sierpnia 2013

Programming tech talks

  1. John Carmack's QuakeCon 2011 Keynote 

  2. John Carmack's QuakeCon 2012 Keynote

  3. John Carmack's QuakeCon 2013 Keynote

    • functional programming
    • strong static type check
    • http://www.pcper.com/reviews/Editorial/John-Carmack-Keynote-Quakecon-2013
  4. GoingNative 2012

  5. GoingNative 2013

    1. Bjarne Stroustrup - The Essence of C++: With Examples in C++84, C++98, C++11, and C++14

    2. C++ Seasoning. (by Sean Parent)

      No Raw Loops
      • Difficult to reason about and difficult to prove post conditions
      • Error prone and likely to fail under non-obvious conditions
      • Introduce non-obvious performance problems
      • Complicates reasoning about the surrounding cod

      Alternatives to Raw Loops
      • Use an existing algorithm
      • Prefer standard algorithms if available
      No Raw Synchronization Primitives
      • You Will Likely Get It Wrong
      • They don't scale (Amdahl’s Law)
      No Raw Pointers
      • Prefer value semantic to reference semantic
    3. Writing Quick Code in C++, Quickly (by Andrei Alexandrescu)

      Intuition
      • Ignores aspects of a complex reality
      • Makes narrow/obsolete/wrong assumptions
      • “Fewer instructions = faster code”
      • “Data is faster than computation”
      • “Computation is faster than data”
      • The only good intuition: “I should time this.”

      Measuring gives you a leg up on experts who don’t need to measure

      Data Layout

      • Generally: small is fast
      • First cache line of an object is where it’s at
      • Sort member variables by hotness, descending
      Prefer zero to all other constants

      Returning containers by value worse than appending

    4. Don’t Help the Compiler (by Stephan T. Lavavej)

      Don't return by const value
      • Inhibits move semantics, doesn't achieve anything useful
      Don't move() when returning local X by value X

      • The NRVO and move semantics are designed to work together
      • NRVO applicable - direct construction is optimal
      • NRVO inapplicable - move semantics is efficient
      Don't return by rvalue reference
      • For experts only, extremely rare
      • Even the Standardization Committee got burned
      • Valid examples: forward, move, declval, get(tuple&&)
      Rely on template argument deduction
      • You control its inputs - the function arguments
      • Change their types/value categories to affect the output
      Avoid explicit template arguments, unless required
      • Required: forward<T>(t), make_shared<T>(a, b, c)
      • Wrong: make_shared<T, A, B, C>(a, b, c)
    5. Keynote: Herb Sutter - One C++
    6. An Effective C++11/14 Sampler (by Scott Meyers)

      Understand std::move and std::forward
      • std::move doesn't move
      • std::forward doesn't orward
      • neither generates code
      • they are simply casts
      • std::move unconditionally casts to rvalue (rvalue_cast)
      • std::forward conditionally casts to rvalue
      Declare functions noexcept whenever possible
      • fun() noexcept - more optimisation possibilities
      • fun() throw() - fewer optimisation possibilities
      • fun() - fewer optimisation possibilities
      • Some code may replace copying only with non-throwing moves
      • noexcept is part of function interface and client may depend on it. Don't use it only because current implementation allows it
      • noexcept is an operator, a bool value evaluated during compilation. Allows conditionally noexcept functions.
    7. The Care and Feeding of C++’s Dragons (by Chandler Carruth)

      Complexity
      • "You can solve every problem with another level of indirection, except for the problem of too many levels of indirection"
      • Complexity: the source and the solution to all programming problems
      • The cost of complexity is exponential.
      • Clever is not a Compliment!
      LLVM Sanitizers:
      • Compiler instrumentation dynamic analysis
      • Address Sanitizer, Thread Sanitizer, Memory Sanitizer, Undefined Behavior Sanitizer
      • Based on shadow memory, can’t be combined
      • Dynamic analysis onlyworks if you test your code!
    8. rand() Considered Harmful (by Stephan T. Lavavej)

      Uniform Random Number Generators
      • random_device
      • mt19937

      Distributions
      • uniform_int_distribution
    9. Inheritance Is The Base Class of Evil (by Sean Parent)

      Hide polymorphic inheritance from user of your API
      • There are no polymorphic types, only a polymorphic use of similar types
      • More flexible, Non-intrusive design doesn’t require class wrappers
      • More efficient,  Polymorphism is only paid for when needed
      • Less error prone,  Client doesn’t do any heap allocation, worry about object ownership or lifetimes
      • Exception and thread safe
  6. Exception-Safe Coding in C++ (Jon Kalb, Part 1, Part 2)

  7. The Problem with Time & Timezones - Computerphile

    • Leap second
    • Unix time
    • UTC
    • Astronomic time
  8. Linus Torvalds on git

  9. Unicode in C++ by James McNellis

    • UTF8, UTF-16, UTF-32
    • dynamic composition (Multiple representation)
      A(U+0041) + Umlaut(U+0308)=Ä
    • code unit vs. code point
    • What is length ? (Number of bytes, number of code units, number of code points)
    • Four normalisation forms
      NFC: Canonical Composed
      NFD: Canonical Decomposed
      NFKC: Compatibility Composed
      NFKD: Compatibility Decomposed
  10. Scott Meyers: Better Software — No Matter What

    1. 2\5
      • Inconsistency
    2. 3\5
      • Static analyse
      • Code review
      • Keyhole problem (on arbitrary restrictions - fixed size windows,fixed size hard coded in software)
    3. 4\5
    4. 5\5
      • Retrospectives !
  11. Plain threads are the 'GOTO' of today's computing (by  Hartmut Kaiser) Slides

    • Amdahl’s Law (Strong Scaling)
    • HPX - A General Purpose Runtime System for Applications of Any Scale
  12. CppCon 2014: The Philosophy of Google's C++ Code by Titus Winters

    • Optimize for reader not for the writer
    • Value the standard, but not idolize it
    • Be consistent
    • Avoid constructs that are dangerous or surprising
    • Avoid tricky and hard to maintain constructs
    • Don't use non-const references
    • Don't use exceptions
  13. Goals for Better Code - Implement Complete Types by Sean Parent

    • Regular type
    • Sometimes the most efficient basis operations are unsafe (violating class invariants)
  • The Silver Bullet Syndrome by Hadi Harir

    • There is no silver bullet (CORBA, COM, DCOM, WCF, node.js, J2EE, microservices, NoSQL)
    • Don't do "hype oriented programming)
    • Consider technology stability
    • Consider if it's proven technology
    • x86 mov is turing complete
    • MoVfuscator - compiler which compiles BrainFuck into mov only x86 assembly. Combined with compiler that compiles Basic into BF it gives Basic to movs compiler!
    • MoVfuscator2.0 - c compiler !!
    • x86 mov is turing complete
    • MoVfuscator - compiler which compiles BrainFuck into mov only x86 assembly. Combined with compiler that compiles Basic into BF it gives Basic to movs compiler!
    • MoVfuscator2.0 - c compiler !!

14. code::dive 2016 conference

15. cppCon 2016 conference

16. Rich Hickey


    • prefer simple over easy
    • data,values, functions are simple
    • declarative data manipulations
    • queues
    • transactions
    • avaoid incidental complexity (abything which is not required by the user0
    • Simplicity is the ultimate sophistication (Leonardo da Vinci)
  • Spec-ulatiion  by Rich Hickey
    • on how to manage software dependencies
    • SemVer 2 is broken - Minor and Patch components are irrelevant. Major component change is basically a library name change
    • Better would be to use time based versioning instead of SemVer2 

17 Pushing C# to the limit - Joe Albahari

  • very fast in-memory inter-process pipes
  • simple yet powerful remoting (10x faster that .net remoting)

18 7 Habits for Success (as explained by the ex-google tech lead)


  • Take Ownership - stop blaming other people, stop blaming you your boss, your teammates, your family or friends. If the project fails even if you did your part perfectly, it is also your failure. It is your responsibility, to make project to succeed. 
  • Accept Failure - do not pretend to be better than anyone else - you are not a superhuman - everyone make mistakes
  • What you do at 8pm matters - what do you do in your free time defines your future. If you watch TV you will be watching TV, if you learn new things you will likely use that knowledge.
  • Success is a lonely road, while failure is a crowded highway
  • The last 10% is the hardest. Finish it - when project is not finished it will not have any impact
  • Know why you want it - internal motivation is a key to be successful
  • Continuously learn 


C++ programming style

  1. No naked pointers

    • Keep them inside functions and classes
    • Keep arrays out of interfaces (prefer containers)
    • Pointers are implementation-level artifacts
    • A pointer in a function should not represent ownership
    • Always consider std::unique_ptr and sometimes std::shared_ptr
  2. No naked new or delete

    • They belong in implementations and as arguments to resource handles
  3. Return objects “by-value” (using move rather than copy)

    • Don’t fiddle with pointer, references, or reference arguments for

references:

niedziela, 19 maja 2013

Consistent Overhead Byte Stuffing (@Wiki)

Consistent Overhead Byte Stuffing

From Wikipedia, the free encyclopedia
Consistent Overhead Byte Stuffing (COBS) is an algorithm for encoding data bytes that results in efficient, reliable, unambiguous packet framing regardless of packet content, thus making it easy for receiving applications to recover from malformed packets.
Byte stuffing is a process that transforms a sequence of data bytes that may contain 'illegal' or 'reserved' values into a potentially longer sequence that contains no occurrences of those values. The extra length of the transformed sequence is typically referred to as the overhead of the algorithm. The COBS algorithm tightly bounds the worst case overhead, limiting it to no more than one byte in 254. The algorithm is computationally inexpensive and its average overhead is low compared to other unambiguous framing algorithms.[1]

Contents

Packet framing and stuffing

When packet data is sent over any serial medium, a protocol is needed by which to demarcate packet boundaries. This is done by using a special bit-sequence or character value to indicate where the boundaries between packets fall. Data stuffing is the process that transforms the packet data before transmission to eliminate any accidental occurrences of that special framing marker, so that when the receiver detects the marker, it knows, without any ambiguity, that it does indeed indicate a boundary between packets.
COBS takes an input consisting of bytes in the range [0,255] and produces an output consisting of bytes only in the range [1,255]. Having eliminated all zero bytes from the data, a zero byte can now be used unambiguously to mark boundaries between packets. This allows the receiver to synchronize reliably with the beginning of the next packet, even after an error. It also allows new listeners, which might join a broadcast stream at any time, to reliably detect the beginning of the first complete packet in the received byte stream.
With COBS, all packets up to 254 bytes in length are encoded with an overhead of exactly one byte. For packets over 254 bytes in length the overhead is at most one byte for every 254 bytes of packet data. The maximum overhead is therefore roughly 0.4% of the packet size, rounded up to a whole number of bytes. COBS encoding has low overhead (on average 0.23% of the packet size, rounded up to a whole number of bytes) and furthermore, for packets of any given length, the amount of overhead is virtually constant, regardless of the packet contents.

Zero Pair Elimination

An optimization that can reduce overhead for common payloads which contain pairs of zero bytes is to reduce the maximum encodable sequence length, freeing some codes to encode sequences terminated by pairs of zeros. In this case, bytes in the range [1,223] have the same meaning in as in the normal mode, the code 224 is used to encode a sequence of 223 bytes with no zero termination, and the remaining codes [225,255] encode sequences of length [1,30] terminated by a pair of zero bytes. This variation can achieve negative overhead (compression) for some sequences however it does complicate the en/decoding process.

Packet format

COBS encodes the input data as a series of variable length blocks. Each block, which may contain from 1 to 255 bytes, begins with a single byte that specifies the number of bytes in the block (including the length byte).
When decoding, a zero byte is appended to the decoded output after each block. As a special case, no zero is added after a block which begins with 0xFF.
Example encodings (block contents marked up in bold):

Plaintext Encoded with COBS
1. 0x00 0x01 0x01
2. 0x11 0x22 0x00 0x33 0x03 0x11 0x22 0x02 0x33
3. 0x11 0x00 0x00 0x00 0x02 0x11 0x01 0x01 0x01
4. 0x01 0x02 ... 0xFF 0xFF 0x01 0x02 ... 0xFE 0x02 0xFF
There exists one complication in this format: in the case 2. above, an extra 0x00 appears at the end of the decoded output. The only way to encode a block that does not end in zero is for it to have 254 bytes of contents, but the last block may be shorter than that. To solve this issue, a single trailing zero, if present, is removed by the decoder. If the real plaintext ends in a zero, an additional zero is added after it.

Implementation

/*
 * StuffData byte stuffs "length" bytes of
 * data at the location pointed to by "ptr",
 * writing the output to the location pointed
 * to by "dst".
 */
 
#define FinishBlock(X) (*code_ptr = (X), code_ptr = dst++, code = 0x01)
 
void StuffData(const unsigned char *ptr,
unsigned long length, unsigned char *dst)
{
  const unsigned char *end = ptr + length;
  unsigned char *code_ptr = dst++;
  unsigned char code = 0x01;
 
  while (ptr < end)
  {
    if (*ptr == 0)
      FinishBlock(code);
    else
    {
      *dst++ = *ptr;
      code++;
      if (code == 0xFF)
        FinishBlock(code);
    }
    ptr++;
  }
 
  FinishBlock(code);
}
 
/*
 * UnStuffData decodes "length" bytes of
 * data at the location pointed to by "ptr",
 * writing the output to the location pointed
 * to by "dst".
 */
 
void UnStuffData(const unsigned char *ptr,
unsigned long length, unsigned char *dst)
{
  const unsigned char *end = ptr + length;
  while (ptr < end)
  {
    int i, code = *ptr++;
    for (i=1; i<code; i++)
      *dst++ = *ptr++;
    if (code < 0xFF)
      *dst++ = 0;
  }
}

References

^ Cheshire, Stuart; Baker, Mary. "Consistent Overhead Byte Stuffing". ACM. Retrieved November 23, 2010.