wtorek, 27 maja 2014

A Hacker’s Guide to Git (by Joseph)

http://wildlyinaccurate.com/a-hackers-guide-to-git

A Hacker’s Guide to Git

This post is a work in progress. Please feel free to contact me with any corrections, requests or suggestions.

Introduction

Git is currently the most widely used version control system in the world, mostly thanks to GitHub. By that measure, I’d argue that it’s also the most misunderstood version control system in the world.
This statement probably doesn’t ring true straight away because on the surface, Git is pretty simple. It’s really easy to pick up if you’ve come from another VCS like Subversion or Mercurial. It’s even relatively easy to pick up if you’ve never used a VCS before. Everybody understands adding, committing, pushing and pulling; but this is about as far as Git’s simplicity goes. Past this point, Git is shrouded by fear, uncertainty and doubt.
Once you start talking about branching, merging, rebasing, multiple remotes, remote-tracking branches, detached HEAD states… Git becomes less of an easily-understood tool and more of a feared deity. Anybody who talks about no-fast-forward merges is regarded with quiet superstition, and even veteran hackers would rather stay away from rebasing “just to be safe”.
I think a big part of this is due to many people coming to Git from a conceptually simpler VCS — probably Subversion — and trying to apply their past knowledge to Git. It’s easy to understand why people want to do this. Take Subversion, for example. Subversion is simple, right? It’s just files and folders. Commits are numbered sequentially. Even branching and tagging is simple — it’s just like taking a backup of a folder.
Basically, Subversion fits in nicely with our existing computing paradigms. Everybody understands files and folders. Everybody knows that revision #10 was the one after #9 and before #11. But these paradigms break down when you try to apply them to Git’s advanced features.
That’s why trying to understand Git in this way is wrong. Git doesn’t work like Subversion at all. Which is pretty confusing, right? You can add and remove files. You can commit your changes. You can generate diffs and patches which look just like Subversion’s. How can something which appears so similar really be so different?
Complex systems like Git become much easier to understand once you figure out how they really work. The goal of this post is to shed some light on how Git works under the hood. We’re going to take a look at some of Git’s core concepts including its basic object storage, how commits work, how branches and tags work, and we’ll look at the different kinds of merging in Git including the much-feared rebase. Hopefully at the end of it all, you’ll have a solid understanding of these concepts and will be able to use some of Git’s more advanced features with confidence.
It’s worth noting at this point that this guide is not intended to be a beginner’s introduction to Git. This guide was written for people who already use Git, but would like to better understand it by taking a peek under the hood, and learn a few neat tricks along the way. With that said, let’s begin.

Repositories

At the core of Git, like other VCS, is the repository. A Git repository is really just a simple key-value data store. This is where Git stores, among other things:
  • Blobs, which are the most basic data type in Git. Essentially, a blob is just a bunch of bytes; usually a binary representation of a file.
  • Tree objects, which are a bit like directories. Tree objects can contain pointers to blobs and other tree objects.
  • Commit objects, which point to a single tree object, and contain some metadata including the commit author and any parent commits.
  • Tag objects, which point to a single commit object, and contain some metadata.
  • References, which are pointers to a single object (usually a commit or tag object).
You don’t need to worry about all of this just yet; we’ll cover these things in more detail later.
The important thing to remember about a Git repository is that it exists entirely in a single .git directory in your project root. There is no central repository like in Subversion or CVS. This is what allows Git to be a distributed version control system — everybody has their own self-contained version of a repository.
You can initialize a Git repository anywhere with the git init command. Take a look inside the .git folder to get a glimpse of what a repository looks like.
$ git init
Initialized empty Git repository in /home/demo/demo-repository/.git/
$ ls -l .git
total 32
drwxrwxr-x 2 demo demo 4096 May 24 20:10 branches
-rw-rw-r-- 1 demo demo 92 May 24 20:10 config
-rw-rw-r-- 1 demo demo 73 May 24 20:10 description
-rw-rw-r-- 1 demo demo 23 May 24 20:10 HEAD
drwxrwxr-x 2 demo demo 4096 May 24 20:10 hooks
drwxrwxr-x 2 demo demo 4096 May 24 20:10 info
drwxrwxr-x 4 demo demo 4096 May 24 20:10 objects
drwxrwxr-x 4 demo demo 4096 May 24 20:10 refs
The important directories are .git/objects, where Git stores all of its objects; and .git/refs, where Git stores all of its references.
We’ll see how all of this fits together as we learn about the rest of Git. For now, let’s learn a little bit more about tree objects.

Tree Objects

A tree object in Git can be thought of as a directory. It contains a list of blobs (files) and other tree objects (sub-directories).
Imagine we had a simple repository, with a README file and a src/ directory containing a hello.c file.
README
src/
    hello.c
This would be represented by two tree objects: one for the root directory, and another for the src/ directory. Here’s what they would look like.
tree 4da454..
blob 976165.. README
tree 81fc8b.. src
tree 81fc8b..
blob 1febef.. hello.c
If we draw the blobs (in green) as well as the tree objects (in blue), we end up with a diagram that looks a lot like our directory structure.
Git tree graph
Notice how given the root tree object, we can recurse through every tree object to figure out the state of the entire working tree. The root tree object, therefore, is essentially a snapshot of your repository at a given time. Usually when Git refers to “the tree”, it is referring to the root tree object.
Now let’s learn how you can track the history of your repository with commit objects.

Commits

A commit object is essentially a pointer that contains a few pieces of important metadata. The commit itself has a hash, which is built from a combination of the metadata that it contains:
  • The hash of the tree (the root tree object) at the time of the commit. As we learned in Tree Objects, this means that with a single commit, Git can build the entire working tree by recursing into the tree.
  • The hash of any parent commits. This is what gives a repository its history: every commit has a parent commit, all the way back to the very first commit.
  • The author’s name and email address, and the time that the changes were authored.
  • The committer’s name and email address, and the time that the commit was made.
  • The commit message.
Let’s see a commit object in action by creating a simple repository.
 $ git init
Initialized empty Git repository in /home/demo/simple-repository/.git/
 $ echo 'This is the readme.' > README
 $ git add README
 $ git commit -m "First commit"
[master (root-commit) d409ca7] First commit
 1 file changed, 1 insertion(+)
 create mode 100644 README
When you create a commit, Git will give you the hash of that commit. Using git show with the --format=raw flag, we can see this newly-created commit’s metadata.
$ git show --format=raw d409ca7

commit d409ca76bc919d9ca797f39ae724b7c65700fd27
tree 9d073fcdfaf07a39631ef94bcb3b8268bc2106b1
author Joseph Wynn <joseph@wildlyianccurate.com> 1400976134 -0400
committer Joseph Wynn <joseph@wildlyianccurate.com> 1400976134 -0400

    First commit

diff --git a/README b/README
new file mode 100644
index 0000000..9761654
--- /dev/null
+++ b/README
@@ -0,0 +1 @@
+This is the readme.
Notice how although we referenced the commit by the partial hash d409ca7, Git was able to figure out that we actually meant d409ca76bc919d9ca797f39ae724b7c65700fd27. This is because the hashes that Git assigns to objects are unique enough to be identified by the first few characters. You can see here that Git is able to find this commit with as few as four characters; after which point Git will tell you that the reference is ambiguous.
$ git show d409c
$ git show d409
$ git show d40
fatal: ambiguous argument 'd40': unknown revision or path not in the working tree.

References

In previous sections, we saw how objects in Git are identified by a hash. Since we want to manipulate objects quite often in Git, it’s important to know their hashes. You could run all your Git commands referencing each object’s hash, like git show d409ca7, but that would require you to remember the hash of every object you want to manipulate.
To save you from having to memorize these hashes, Git has references, or “refs”. A reference is simply a file stored somewhere in .git/refs, containing the hash of a commit object.
To carry on the example from Commits, let’s figure out the hash of “First commit” using references only.
$ git status
On branch master
nothing to commit, working directory clean
git status has told us that we are on branch master. As we will learn in a later section, branches are just references. We can see this by looking in .git/refs/heads.
$ ls -l .git/refs/heads/
total 4
-rw-rw-r-- 1 demo demo 41 May 24 20:02 master
We can easily see which commit master points to by reading the file.
$ cat .git/refs/heads/master
d409ca76bc919d9ca797f39ae724b7c65700fd27
Sure enough, master contains the hash of the “First commit” object.
Of course, it’s possible to simplify this process. Git can tell us which commit a reference is pointing to with the show and rev-parse commands.
$ git show --oneline master
d409ca7 First commit
$ git rev-parse master
d409ca76bc919d9ca797f39ae724b7c65700fd27
Git also has a special reference, HEAD. This is a “symbolic” reference which points to the tip of the current branch rather than an actual commit. If we inspect HEAD, we see that it simply points to refs/head/master.
$ cat .git/HEAD
ref: refs/heads/master
It is actually possible for HEAD to point directly to a commit object. When this happens, Git will tell you that you are in a “detached HEAD state”. We’ll talk a bit more about this later, but really all this means is that you’re not currently on a branch.

Branches

Git’s branches are often touted as being one of its strongest features. This is because branches in Git are very lightweight, compared to other VCS where a branch is usually a clone of the entire repository.
The reason branches are so lightweight in Git is because they’re just references. We saw in References that the master branch was simply a file inside .git/refs/heads. Let’s create another branch to see what happens under the hood.
$ git branch test-branch
$ cat .git/refs/heads/test-branch 
d409ca76bc919d9ca797f39ae724b7c65700fd27
It’s as simple as that. Git has created a new entry in .git/refs/heads and pointed it at the current commit.
We also saw in References that HEAD is Git’s reference to the current branch. Let’s see that in action by switching to our newly-created branch.
$ cat .git/HEAD
ref: refs/heads/master
$ git checkout test-branch 
Switched to branch 'test-branch'
$ cat .git/HEAD
ref: refs/heads/test-branch
When you create a new commit, Git simply changes the current branch to point to the newly-created commit object.
$ echo 'Some more information here.' >> README
$ git add README
$ git commit -m "Update README in a new branch"
[test-branch 7604067] Update README in a new branch
 1 file changed, 1 insertion(+)
$ cat .git/refs/heads/test-branch 
76040677d717fd090e327681064ac6af9f0083fb
Later on we’ll look at the difference between local branches and remote-tracking branches.

Tags

There are two types of tags in Git – lightweight tags and annotated tags.
On the surface, these two types of tags look very similar. Both of them are references stored in .git/refs/tags. However, that’s about as far as the similarities go. Let’s create a lightweight tag to see how they work.
$ git tag 1.0-lightweight
$ cat .git/refs/tags/1.0-lightweight 
d409ca76bc919d9ca797f39ae724b7c65700fd27
We can see that Git has created a tag reference which points to the current commit. By default, git tag will create a lightweight tag. Note that this is not a tag object. We can verify this by using git cat-file to inspect the tag.
$ git cat-file -p 1.0-lightweight
tree 9d073fcdfaf07a39631ef94bcb3b8268bc2106b1
author Joseph Wynn <joseph@wildlyianccurate.com> 1400976134 -0400
committer Joseph Wynn <joseph@wildlyianccurate.com> 1400976134 -0400

First commit
$ git cat-file -p d409ca7
tree 9d073fcdfaf07a39631ef94bcb3b8268bc2106b1
author Joseph Wynn <joseph@wildlyianccurate.com> 1400976134 -0400
committer Joseph Wynn <joseph@wildlyianccurate.com> 1400976134 -0400

First commit
You can see that as far as Git is concerned, the 1.0-lightweight tag and the d409ca7 commit are the same object. That’s because the lightweight tag is only a reference to the commit object.
Let’s compare this to an annotated tag.
$ git tag -a -m "Tagged 1.0" 1.0
$ cat .git/refs/tags/1.0
10589beae63c6e111e99a0cd631c28479e2d11bf
We’ve passed the -a (--annotate) flag to git tag to create an annotated tag. Notice how Git creates a reference for the tag just like the lightweight tag, but this reference is not pointing to the same object as the lightweight tag. Let’s use git cat-file again to inspect the object.
$ git cat-file -p 1.0
object d409ca76bc919d9ca797f39ae724b7c65700fd27
type commit
tag 1.0
tagger Joseph Wynn <joseph@wildlyianccurate.com> 1401029229 -0400

Tagged 1.0
This is a tag object, separate to the commit that it points to. As well as containing a pointer to a commit, tag objects also store a tag message and information about the tagger. Tag objects can also be signed with a GPG key to prevent commit or email spoofing.
Aside from being GPG-signable, there are a few reasons why annotated tags are preferred over lightweight tags.
Probably the most important reason is that annotated tags have their own author information. This can be helpful when you want to know who created the tag, rather than who created the commit that the tag is referring to.
Annotated tags are also timestamped. Since new versions are usually tagged right before they are released, an annotated tag can tell you when a version was released rather than just when the final commit was made.

Merging

Merging in Git is the process of joining two histories (usually branches) together. Let’s start with a simple example. Say you’ve created a new feature branch from master, and done some work on it.
$ git checkout -b feature-branch
Switched to a new branch 'feature-branch'
$ vim feature.html
$ git commit -am "Finished the new feature"
[feature-branch 0c21359] Finished the new feature
 1 file changed, 1 insertion(+)
At the same time, you need to fix an urgent bug. So you create a hotfix branch from master, and do some work in there.
$ git checkout master
Switched to branch 'master'
$ git checkout -b hotfix
Switched to a new branch 'hotfix'
$ vim index.html
$ git commit -am "Fixed some wording"
[hotfix 40837f1] Fixed some wording
 1 file changed, 1 insertion(+), 1 deletion(-)
At this point, the history will look something like this.
Branching -- hotfix and feature branch
Now you want to bring the bug fix into master so that you can tag it and release it.
$ git checkout master
Switched to branch 'master'
$ git merge hotfix
Updating d939a3a..40837f1
Fast-forward
 index.html | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
Notice how Git mentions fast-forward during the merge. What this means is that all of the commits in hotfix were directly upstream from master. This allows Git to simply move the master pointer up the tree to hotfix. What you end up with looks like this.
Branching -- after merging hotfix
Now let’s try and merge feature-branch into master.
$ git merge feature-branch 
Merge made by the 'recursive' strategy.
 feature.html | 1 +
 1 file changed, 1 insertion(+)
This time, Git wasn’t able to perform a fast-forward. This is because feature-branch isn’t directly upstream from master. This is clear on the graph above, where master is at commit D which is in a different history tree to feature-branch at commit C.
So how did Git handle this merge? Taking a look at the log, we see that Git has actually created a new “merge” commit, as well as bringing the commit from feature-branch.
$ git log --oneline
8ad0923 Merge branch 'feature-branch'
0c21359 Finished the new feature
40837f1 Fixed some wording
d939a3a Initial commit
Upon closer inspection, we can see that this is a special kind of commit object — it has two parent commits. This is referred to as a merge commit.
$ git show --format=raw 8ad0923

commit 8ad09238b0dff99e8a99c84d68161ebeebbfc714
tree e5ee97c8f9a4173f07aa4c46cb7f26b7a9ff7a17
parent 40837f14b8122ac6b37c0919743b1fd429b3bbab
parent 0c21359730915c7888c6144aa8e9063345330f1f
author Joseph Wynn <joseph@wildlyinaccurate.com> 1401134489 +0100
committer Joseph Wynn <joseph@wildlyinaccurate.com> 1401134489 +0100

 Merge branch 'feature-branch'
This means that our history graph now looks something like this (commit E is the new merge commit).
Branching -- after merging feature-branch
Some people believe that this sort of history graph is undesirable. In the Rebasing (Continued) section, we’ll learn how to prevent non-fast-forward merges by rebasing feature branches before merging them with master.

Rebasing

Rebasing is without a doubt one of Git’s most misunderstood features. For most people, git rebase is a command that should be avoided at all costs. This is probably due to the extraordinary amount of scaremongering around rebasing. “Rebase Considered Harmful”, and “Please, stay away from rebase” are just two of the many anti-rebase articles you will find in the vast archives of the Internet.
But rebase isn’t scary, or dangerous, so long as you understand what it does. But before we get into rebasing, I’m going to take a quick digression, because it’s actually much easier to explain rebasing in the context of cherry-picking.

Cherry-Picking

What git cherry-pick does is take one or more commits, and replay them on top of the current commit. Imagine a repository with the following history graph.
Node graph -- before cherry-pick
If you are on commit D and you run git cherry-pick F, Git will take the changes that were introduced in commit F and replay them as a new commit (shown as F’) on top of commit D.
Node graph -- after cherry-pick
The reason you end up with a copy of commit F rather than commit F itself is due to the way commits are constructed. Recall that the parent commit is part of a commit’s hash. So despite containing the exact same changes, author information and timestamp; F’ will have a different parent to F, giving it a different hash.
A common workflow in Git is to develop features on small branches, and merge the features one at a time into the master branch. Let’s recreate this scenario by adding some branch labels to the graphs.
Node graph -- with branch labels
As you can see, master has been updated since foo was created. To avoid potential conflicts when foo is merged with master, we want bring master‘s changes into foo. Because master is the base branch, we want to play foo‘s commits on top of master. Essentially, we want to change commit C‘s parent from B to F.
It’s not going to be easy, but we can achieve this with git cherry-pick. First, we need to create a temporary branch at commit F.
$ git checkout master
$ git checkout -b foo-tmp
Node graph -- after creating foo-tmp
Now that we have a base on commit F, we can cherry-pick all of foo‘s commits on top if it.
$ git cherry-pick C D
Node graph -- after cherry-picking C and D
Now all that’s left to do is point foo at commit D’, and delete the temporary branch foo-tmp. We do this with the reset command, which points HEAD (and therefore the current branch) at a specified commit. The --hard flag ensures our working tree is updated as well.
$ git checkout foo
$ git reset --hard foo-tmp
$ git branch -D foo-tmp
This gives the desired result of foo‘s commits being upstream of master. Note that the original C and D commits are no longer reachable because no branch points to them.
Node graph -- after resetting foo

Rebasing (Continued)

While the example in Cherry-Picking worked, it’s not practical. In Git, rebasing allows us to replace our verbose cherry-pick workflow…
$ git checkout master
$ git checkout -b foo-tmp
$ git cherry-pick C D
$ git checkout foo
$ git reset --hard foo-bar
$ git branch -D foo-bar
…With a single command.
$ git rebase master foo
With the format git rebase <base> <target>, the rebase command will take all of the commits from <target> and play them on top of <base> one by one. It does this without actually modifying <base>, so the end result is a linear history in which <base> can be fast-forwarded to <target>.
In a sense, performing a rebase is like telling Git, “Hey, I want to pretend that <target> was actually branched from <base>. Take all of the commits from <target>, and pretend that they happened after <base>.
Let’s take a look again at the example graph from Merging to see how rebasing can prevent us from having to do a non-fast-forward merge.
Branching -- after merging hotfix
All we have to do to enable a fast-forward merge of feature-branch into master is run git rebase master feature-branch before performing the merge.
$ git rebase master feature-branch
First, rewinding head to replay your work on top of it...
Applying: Finished the new feature
This has brought feature-branch directly upstream of master.
Rebasing -- rebase feature-branch with master
Now all that’s left to do is let Git perform the merge.
$ git checkout master
$ git merge feature-branch
Updating 40837f1..2a534dd
Fast-forward
 feature.html | 1 +
 1 file changed, 1 insertion(+)

Remotes

// TODO

Pushing

// TODO

Fetching

// TODO

Pulling

// TODO

Toolkit

With a solid understanding of Git’s inner workings, some of the more advanced Git tools start to make more sense.

git-reflog

Whenever you make a change in Git that affects the tip of a branch, Git records information about that change in what’s called the reflog. Usually you shouldn’t need to look at these logs, but sometimes they can come in very handy.
Let’s say you have a repository with a few commits.
$ git log --oneline
d6f2a84 Add empty LICENSE file
51c4b49 Add some actual content to readme
3413f46 Add TODO note to readme
322c826 Add empty readme
You decide, for some reason, to perform a destructive action on your master branch.
$ git reset --hard 3413f46
HEAD is now at 3413f46 Add TODO note to readme
Since performing this action, you’ve realised that you lost some commits and you have no idea what their hashes were. You never pushed the changes; they were only in your local repository. git log is no help, since the commits are no longer reachable from HEAD.
$ git log --oneline
3413f46 Add TODO note to readme
322c826 Add empty readme
This is where git reflog can be useful.
$ git reflog
3413f46 HEAD@{0}: reset: moving to 3413f46
d6f2a84 HEAD@{1}: commit: Add empty LICENSE file
51c4b49 HEAD@{2}: commit: Add some actual content to readme
3413f46 HEAD@{3}: commit: Add TODO note to readme
322c826 HEAD@{4}: commit (initial): Add empty readme
The reflog shows a list of all changes to HEAD in reverse chronological order. The hash in the first column is the value of HEAD after the change was made. We can see, therefore, that we were at commit d6f2a84 before the destructive change.
How you want to recover commits depends on the situation. In this particular example, we can simply do a git reset --hard d6f2a84 to restore HEAD to its original position. However if we have introduced new commits since the destructive change, we may need to do something like cherry-pick all the commits that were lost.
Note that Git’s reflog is only a record of changes for your local repository. If your local repository becomes corrupt or is deleted, the reflog won’t be of any use (if the repository is deleted the reflog won’t exist at all!)
Depending on the situation, you may find git fsck more suitable for recovering lost commits.

git-fsck

In a way, Git’s object storage works like a primitive file system — objects are like files on a hard drive, and their hashes are the objects’ physical address on the disk. The Git index is exactly like the index of a file system, in that it contains references which point at an object’s physical location.
By this analogy, git fsck is aptly named after fsck (“file system check”). This tool is able to check Git’s database and verify the validity and reachability of every object that it finds.
When a reference (like a branch) is deleted from Git’s index, the object(s) they refer to usually aren’t deleted, even if they are no longer reachable by any other references. Using a simple example, we can see this in practice.
$ git checkout -b foobar
Switched to a new branch 'foobar'
$ echo 'foobar' > foo.txt 
$* git commit -am "Update foo.txt with foobar"
[foobar bcbaac7] Update foo.txt with foobar
 1 file changed, 1 insertion(+), 1 deletion(-)
$ git checkout master
Switched to branch 'master'
$ git branch -D foobar
Deleted branch foobar (was bcbaac7).
At this point, commit bcbaac7 still exists in our repository, but there are no references pointing to it. By search through the database, git fsck is able to find it.
$ git fsck --lost-found
Checking object directories: 100% (256/256), done.
dangling commit bcbaac709e0b8abbd3f1f322990d204907be5841
For simple cases, git reflog may be preferred. Where git fsck excels over git reflog, though, is when you need to find objects which you never referenced in your local repository (and therefore would not be in your reflog). An example of this is when you delete a remote branch through an interface like GitHub. Assuming the objects haven’t been garbage-collected, you can clone the remote repository and use git fsck to recover the deleted branch.

git-stash

// TODO

git-describe

Git’s describe command is summed up pretty neatly in the documentation:
git-describe – Show the most recent tag that is reachable from a commit
This can be helpful for things like build and release scripts, as well as figuring out which version a change was introduced in.
git describe will take any reference or commit hash, and return the name of the most recent tag. If the tag points at the commit you gave it, git describe will return only the tag name. Otherwise, it will suffix the tag name with some information including the number of commits since the tag and an abbreviation of the commit hash.
$ git describe v1.2.15
v1.2.15
$ git describe 2db66f
v1.2.15-80-g2db66f5
If you want to ensure that only the tag name is returned, you can force Git to remove the suffix by passing --abbrev=0.
$ git describe --abbrev=0 2db66f
v1.2.15

git-rev-parse

git rev-parse is an ancillary plumbing command which takes a wide range of inputs and returns one or more commit hashes. The most common use case is figuring out which commit a tag or branch points to.
$ git rev-parse v1.2.15        
2a46f5e2fbe83ccb47a1cd42b81f815f2f36ee9d
$ git rev-parse --short v1.2.15        
2a46f5e

git-bisect

git bisect is an indispensable tool when you need to figure out which commit introduced a breaking change. The bisect command does a binary search through your commit history to help you find the breaking change as quickly as possible. To get started simply run git bisect start. Then you need to bisect a couple of important hints: you can tell Git that the commit you’re currently on is broken with git bisect bad. Then, you can give Git a commit that you know is working with git bisect good <commit>.
$ git bisect start
$ git bisect bad
$ git bisect good v1.2.15
Bisecting: 41 revisions left to test after this (roughly 5 steps)
[b87713687ecaa7a873eeb3b83952ebf95afdd853] docs(misc/index): add header; general links
Git will checkout a commit and ask you to test whether it’s broken or not. If the commit is broken, run git bisect bad, otherwise if it’s fine, run git bisect good. After a few goes of this, Git will be able to pinpoint at which commit the breaking change was first introduced.
$ git bisect bad
e145a8df72f309d5fb80eaa6469a6148b532c821 is the first bad commit
Once the bisect is finished (or when you want to abort it), be sure to run git bisect reset to reset HEAD to where it was before the bisect.



sobota, 15 marca 2014

When to use volatile in c++ ?


TL;DR:Use only for reading hardware registers, never use for shared data between threads. For shared data use locks.


Chandon:
Volatile is intended for poking hardware registers. Using it for threading is almost certainly going to go horribly wrong, because it provides a nearly-but-not-quite useful set of guarantees for that application.
It could do all kinds of things, including fun stuff like requiring that every read or write skip all levels of cache and go out to physical RAM. Needless to say, that's significantly slower than a lock would be.

References:
http://www.reddit.com/r/programming/comments/208x6o/can_i_skip_the_lock_when_reading_an_integer/cg1776e

niedziela, 9 lutego 2014

Smart Pointer Parameters (by Herb Sutter)

Original article: http://herbsutter.com/2013/06/05/gotw-91-solution-smart-pointer-parameters/

Guidelines:

  • Don’t pass a smart pointer as a function parameter unless you want to use or manipulate the smart pointer itself, such as to share or transfer ownership.
  • Prefer passing objects by value, *, or &, not by smart pointer.
  • Express a “sink” function using a by-value unique_ptr parameter. 
  • Don’t use a const unique_ptr& as a parameter; use widget* instead.
  • Express that a function will store and share ownership of a heap object using a by-value shared_ptr parameter.
  • Use a non-const shared_ptr& parameter only to modify the shared_ptr. Use a const shared_ptr& as a parameter only if you’re not sure whether or not you’ll take a copy and share ownership; otherwise use widget* instead (or if not nullable, a widget&).

GotW #91 Solution: Smart Pointer Parameters

NOTE: Last year, I posted three new GotWs numbered #103-105. I decided leaving a gap in the numbers wasn’t best after all, so I am renumbering them to #89-91 to continue the sequence. Here is the updated version of what was GotW #105.
How should you prefer to pass smart pointers, and why?

Problem

JG Question

1. What are the performance implications of the following function declaration? Explain.
void f( shared_ptr<widget> );

Guru Questions

2. What are the correctness implications of the function declaration in #1? Explain with clear examples.
3. A colleague is writing a function f that takes an existing object of type widget as a required input-only parameter, and trying to decide among the following basic ways to take the parameter (omitting const):
void f( widget* );              (a)
void f( widget& );              (b)
void f( unique_ptr<widget> );   (c)
void f( unique_ptr<widget>& );  (d)
void f( shared_ptr<widget> );   (e)
void f( shared_ptr<widget>& );  (f)
Under what circumstances is each appropriate? Explain your answer, including where const should or should not be added anywhere in the parameter type.
(There are other ways to pass the parameter, but we will consider only the ones shown above.)

Solution

1. What are the performance implications of the following function declaration? Explain.

void f( shared_ptr<widget> );
A shared_ptr stores strong and weak reference counts (see GotW #89). When you pass by value, you have to copy the argument (usually) on entry to the function, and then destroy it (always) on function exit. Let’s dig into what this means.
When you enter the function, the shared_ptr is copy-constructed, and this requires incrementing the strong reference count. (Yes, if the caller passes a temporary shared_ptr, you move-construct and so don’t have to update the count. But: (a) it’s quite rare to get a temporary shared_ptr in normal code, other than taking one function’s return value and immediately passing that to a second function; and (b) besides as we’ll see most of the expense is on the destruction of the parameter anyway.)
When exiting the function, the shared_ptr is destroyed, and this requires decrementing its internal reference count.
What’s so bad about a “shared reference count increment and decrement?” Two things, one related to the “shared reference count” and one related to the “increment and decrement.” It’s good to be aware of how this can incur performance costs for two reasons: one major and common, and one less likely in well-designed code and so probably more minor.
First, the major reason is the performance cost of the “increment and decrement”: Because the reference count is an atomic shared variable (or equivalent), incrementing and decrementing it are internally-synchronized read-modify-write shared memory operations.
Second, the less-likely minor reason is the potentially scalability-bustingly contentious nature of the “shared reference count”: Both increment and decrement update the reference count, which means that at the processor and memory level only one core at a time can be executing such an instruction on the same reference count because it needs exclusive access to the count’s cache line. The net result is that this causes some contention on the count’s cache line, which can affect scalability if it’s a popular cache line being touched by multiple threads in tight loops—such as if two threads are calling functions like this one in tight loops and accessing shared_ptrs that own the same object. “So don’t do that, thou heretic caller!” we might righteously say. Well and good, but the caller doesn’t always know when two shared_ptrs used on two different threads refer to the same object, so let’s not be quick to pile the wood around his stake just yet.
As we will see, an essential best practice for any reference-counted smart pointer type is to avoid copying it unless you really mean to add a new reference. This cannot be stressed enough. This directly addresses both of these costs and pushes their performance impact down into the noise for most applications, and especially eliminates the second cost because it is an antipattern to add and remove references in tight loops.
At this point, we will be tempted to solve the problem by passing the shared_ptr by reference. But is that really the right thing to do? It depends.

2. What are the correctness implications of the function declaration in #1?

The only correctness implication is that the function advertises in a clear type-enforced way that it will (or could) retain a copy of the shared_ptr.
That this is the only correctness implication might surprise some people, because there would seem to be one other major correctness benefit to taking a copy of the argument, namely lifetime: Assuming the pointer is not already null, taking a copy of the shared_ptr guarantees that the function itself holds a strong refcount on the owned object, and that therefore the object will remain alive for the duration of the function body, or until the function itself chooses to modify its parameter.
However, we already get this for free—thanks to structured lifetimes, the called function’s lifetime is a strict subset of the calling function’s call expression. Even if we passed the shared_ptr by reference, our function would as good as hold a strong refcount because the caller already has one—he passed us the shared_ptr in the first place, and won’t release it until we return. (Note this assumes the pointer is not aliased. You have to be careful if the smart pointer parameter could be aliased, but in this respect it’s no different than any other aliased object.)
Guideline: Don’t pass a smart pointer as a function parameter unless you want to use or manipulate the smart pointer itself, such as to share or transfer ownership.
Guideline: Prefer passing objects by value, *, or &, not by smart pointer.
If you’re saying, “hey, aren’t raw pointers evil?”, that’s excellent, because we’ll address that next.

3. A colleague is writing a function f that takes an existing object of type widget as a required input-only parameter, and trying to decide among the following basic ways to take the parameter (omitting const). Under what circumstances is each appropriate? Explain your answer, including where const should or should not be added anywhere in the parameter type.

(a) and (b): Prefer passing parameters by * or &.

void f( widget* );              (a)
void f( widget& );              (b)
These are the preferred way to pass normal object parameters, because they stay agnostic of whatever lifetime policy the caller happens to be using.
Non-owning raw * pointers and & references are okay to observe an object whose lifetime we know exceeds that of the pointer or reference, which is usually true for function parameters. Thanks to structured lifetimes, by default arguments passed to f in the caller outlive f‘s function call lifetime, which is extremely useful (not to mention efficient) and makes non-owning * and & appropriate for parameters.
Pass by * or & to accept a widget independently of how the caller is managing its lifetime. Most of the time, we don’t want to commit to a lifetime policy in the parameter type, such as requiring the object be held by a specific smart pointer, because this is usually needlessly restrictive. As usual, use a * if you need to express null (no widget), otherwise prefer to use a &; and if the object is input-only, write const widget* or const widget&.

(c) Passing unique_ptr by value means “sink.”

void f( unique_ptr<widget> );   (c)
This is the preferred way to express a widget-consuming function, also known as a “sink.”
Passing a unique_ptr by value is only possible by moving the object and its unique ownership from the caller to the callee. Any function like (c) takes ownership of the object away from the caller, and either destroys it or moves it onward to somewhere else.
Note that, unlike some of the other options below, this use of a by-value unique_ptr parameter actually doesn’t limit the kind of object that can be passed to those managed by a unique_ptr. Why not? Because any pointer can be explicitly converted to a unique_ptr. If we didn’t use a unique_ptr here we would still have to express “sink” semantics, just in a more brittle way such as by accepting a raw owning pointer (anathema!) and documenting the semantics in comments. Using (c) is vastly superior because it documents the semantics in code, and requires the caller to explicitly move ownership.
Consider the major alternative:
// Smelly 20th-century alternative
void bad_sink( widget* p );  // will destroy p; PLEASE READ THIS COMMENT

// Sweet self-documenting self-enforcing modern version (c)
void good_sink( unique_ptr<widget> p );
And how much better (c) is:
// Older calling code that calls the new good_sink is safer, because
// it's clearer in the calling code that ownership transfer is going on
// (this older code has an owning * which we shouldn't do in new code)
//
widget* pw = ... ; 

bad_sink ( pw );             // compiles: remember not to use pw again!

good_sink( pw );             // error: good
good_sink( unique_ptr<widget>{pw} );  // need explicit conversion: good

// Modern calling code that calls good_sink is safer, and cleaner too
//
unique_ptr<widget> pw = ... ;

bad_sink ( pw.get() );       // compiles: icky! doesn't reset pw
bad_sink ( pw.release() );   // compiles: must remember to use this way

good_sink( pw );             // error: good!
good_sink( move(pw) );       // compiles: crystal clear what's going on
Guideline: Express a “sink” function using a by-value unique_ptr parameter.
Because the callee will now own the object, usually there should be no const on the parameter because the const should be irrelevant.

(d) Passing unique_ptr by reference is for in/out unique_ptr parameters.

void f( unique_ptr<widget>& );  (d)
This should only be used to accept an in/out unique_ptr, when the function is supposed to actually accept an existing unique_ptr and potentially modify it to refer to a different object. It is a bad way to just accept a widget, because it is restricted to a particular lifetime strategy in the caller.
Guideline: Use a non-const unique_ptr& parameter only to modify the unique_ptr.
Passing a const unique_ptr<widget>& is strange because it can accept only either null or a widget whose lifetime happens to be managed in the calling code via a unique_ptr, and the callee generally shouldn’t care about the caller’s lifetime management choice. Passing widget* covers a strict superset of these cases and can accept “null or a widget” regardless of the lifetime policy the caller happens to be using.
Guideline: Don’t use a const unique_ptr& as a parameter; use widget* instead.
I mention widget* because that doesn’t change the (nullable) semantics; if you’re being tempted to pass const shared_ptr<widget>&, what you really meant was widget* which expresses the same information. If you additionally know it can’t be null, though, of course use widget&.

(e) Passing shared_ptr by value implies taking shared ownership.

void f( shared_ptr<widget> );   (e)
As we saw in #2, this is recommended only when the function wants to retain a copy of the shared_ptr and share ownership. In that case, a copy is needed anyway so the copying cost is fine. If the local scope is not the final destination, just std::move the shared_ptr onward to wherever it needs to go.
Guideline: Express that a function will store and share ownership of a heap object using a by-value shared_ptr parameter.
Otherwise, prefer passing a * or & (possibly to const) instead, since that doesn’t restrict the function to only objects that happen to be owned by shared_ptrs.

(f) Passing shared_ptr& is useful for in/out shared_ptr manipulation.

void f( shared_ptr<widget>& );  (f)
Similarly to (d), this should mainly be used to accept an in/out shared_ptr, when the function is supposed to actually modify the shared_ptr itself. It’s usually a bad way to accept a widget, because it is restricted to a particular lifetime strategy in the caller.
Note that per (e) we pass a shared_ptr by value if the function will share ownership. In the special case where the function might share ownership, but doesn’t necessarily take a copy of its parameter on a given call, then pass a const shared_ptr& to avoid the copy on the calls that don’t need it, and take a copy of the parameter if and when needed.
Guideline: Use a non-const shared_ptr& parameter only to modify the shared_ptr. Use a const shared_ptr& as a parameter only if you’re not sure whether or not you’ll take a copy and share ownership; otherwise use widget* instead (or if not nullable, a widget&).

Acknowledgments

Thanks in particular to the following for their feedback to improve this article: mttpd, zahirtezcan, Jon, GregM, Andrei Alexandrescu.

niedziela, 29 grudnia 2013

UTF-8 Everywhere (manifesto by Pavel Radzivilovsky, Yakov Galka and Slava Novgorodov)

UTF-8 Everywhere

Manifesto

Purpose of this document

This document contains special characters. Without proper rendering support, you may see question marks, boxes, or other symbols.
To promote usage and support of the UTF-8 encoding, to convince that this should be the default choice of encoding for storing text strings in memory or on disk, for communication and all other uses. We believe that all other encodings of Unicode (or text, in general) belong to rare edge-cases of optimization and should be avoided by mainstream users.
In particular, we believe that the very popular UTF-16 encoding (mistakenly used as a synonym to ‘widechar’ and ‘Unicode’ in the Windows world) has no place in library APIs (except for specialized libraries, which deal with text).
This document recommends choosing UTF-8 as string storage in Windows applications, despite the fact that this standard is less popular there, due to historical reasons and the lack of native UTF-8 support by the API. Yet, we believe that, even on this platform, the following arguments outweigh the lack of native support. Also, we recommend forgetting forever what ‘ANSI codepages’ are and what they were used for. It is in the customer’s bill of rights to mix any number of languages in any text string.
We recommend avoiding C++ application code that depends on _UNICODE define. This includes TCHAR/LPTSTR types on Windows and APIs defined as macros, such as CreateWindow. We also recommend alternative ways to reach the goals of these APIs.
We also believe that, if an application is not supposed to specialize in text, the infrastructure must make it possible for the program to be unaware of encoding issues. For instance, a file copy utility should not be written differently to support non-English file names. Joel’s great article on Unicode explains the encodings well for the beginners, but it lacks the most important part: how a programmer should proceed, if she does not care what is inside the string.

Background

In 1988, Joseph D. Becker published the first Unicode draft proposal. At the basis of his design was the naïve assumption that 16 bits per character would suffice. In 1991, the first version of the Unicode standard was published, with code points limited to 16 bits. In the following years many systems have added support for Unicode and switched to the UCS-2 encoding. It was especially attractive for new technologies, like Qt framework (1992), Windows NT 3.1 (1993) and Java (1995).
However, it was soon discovered that 16 bits per character will not do for Unicode. In 1996, the UTF-16 encoding was created so existing systems would be able to work with non-16-bit characters. This effectively nullified the rationale behind choosing 16-bit encoding in the first place, namely being a fixed-width encoding. Currently Unicode spans over 109449 characters, about 74500 of them being CJK ideographs.
A little child playing an encodings game in front of a large poster about encodings.
Nagoya City Science Museum. Photo by Vadim Zlotnik.
Microsoft has, ever since, mistakenly used ‘Unicode’ and ‘widechar’ as synonyms for ‘UCS-2’ and ‘UTF-16’. Furthermore, since UTF-8 cannot be set as the encoding for narrow string WinAPI, one must compile her code with _UNICODE rather than _MBCS. Windows C++ programmers are educated that Unicode must be done with ‘widechars’. As a result of this mess, they are now among the most confused ones about what is the right thing to do about text.
At the same time, in the Linux and the Web worlds, there is a silent agreement that UTF-8 is the most correct encoding for Unicode on the planet Earth. Even though it gives a strong preference to English and therefore to computer languages (such as C++, HTML, XML, etc) over any other text, it is seldom less efficient than UTF-16 for commonly used character sets.

The Facts

  • In both UTF-8 and UTF-16 encodings, code points may take up to 4 bytes (contrary to what Joel says).
  • UTF-8 is endianness independent. UTF-16 comes in two flavors: UTF-16LE and UTF-16BE (for different byte orders, respectively). Here we name them collectively as UTF-16.
  • Widechar is 2 bytes in size on some platforms, 4 on others.
  • UTF-8 and UTF-32 result the same order when sorted lexicographically. UTF-16 does not.
  • UTF-8 favors efficiency for English letters and other ASCII characters (one byte per character) while UTF-16 favors several Asian character sets (2 bytes instead of 3 in UTF-8). This is what made UTF-8 the favorite choice in the Web world, where English HTML/XML tags are intermixed with any-language text. Cyrillic, Hebrew and several other popular Unicode blocks are 2 bytes both in UTF-16 and UTF-8.
  • In the Linux world, narrow strings are considered UTF-8 by default almost everywhere. This way, for example, a file copy utility would not need to care about encodings. Once tested on ASCII strings for file name arguments, it would certainly work correctly for file names in any language, as arguments are treated as cookies. The code of the file copy utility would not need to change at all to support foreign languages. fopen() would accept Unicode seamlessly, and so would argv.
  • On Microsoft Windows, however, making a file copy utility that can accept file names in a mix of several different Unicode blocks requires advanced trickery. First, the application must be compiled as Unicode-aware. In this case, it cannot have main() function with standard-C parameters. It will then accept UTF-16 encoded argv. To convert a Windows program written with narrow text in mind to support Unicode, one has to refactor deep and to take care of each and every string variable.
  • On Windows, SetCodePage() API enables receiving non-ASCII characters, but only from a single ANSI codepage. An unimplemented parameter CF_UTF8 would enable doing the above, on Windows.
  • The standard library shipped with MSVC is poorly implemented. It forwards narrow-string parameters directly to the OS ANSI API. There is no way to override this. Changing std::locale does not work. It’s impossible to open a file with a Unicode name on MSVC using standard features of C++. The standard way to open a file is:
    std::fstream fout("abc.txt");
    The proper way to get around is by using Microsoft’s own hack that accepts wide-string parameter, which is a non-standard extension.
  • There is no way to return Unicode from std::exception::what() other than using UTF-8.
  • UTF-16 is often misused as a fixed-width encoding, even by Windows native programs themselves: in plain Windows edit control (until Vista), it takes two backspaces to delete a character which takes 4 bytes in UTF-16. On Windows 7 the console displays that character as two invalid characters, regardless of the font used.
  • Many third-party libraries for Windows do not support Unicode: they accept narrow string parameters and pass them to the ANSI API. Sometimes, even for file names. In the general case, it is impossible to work around this, as a string may not be representable completely in any ANSI code page (if it contains characters from a mix of Unicode blocks). What is normally done on Windows for file names is getting an 8.3 path to the file (if it already exists) and feeding it into such a library. It is not possible if the library is supposed to create a non-existing file. It is not possible if the path is very long and the 8.3 form is longer than MAX_PATH. It is not possible if short-name generation is disabled in OS settings.
  • UTF-16 is very popular today, even outside the Windows world. Qt, Java, C#, Python, the ICU—they all use UTF-16 for internal string representation.

Our Conclusions

UTF-16 is the worst of both worlds—variable length and too wide. It exists for historical reasons, adds a lot of confusion and will hopefully die out.
Portability, cross-platform interoperability and simplicity are more important than interoperability with existing platform APIs. So, the best approach is to use UTF-8 narrow strings everywhere and convert them back and forth on Windows before calling APIs that accept strings. Performance is seldom an issue of any relevance when dealing with string-accepting system APIs (e.g. UI code and file system APIs), but there is a huge advantage to using the same encoding everywhere, and we see no sufficient reason to do otherwise.
Speaking of performance, machines often use strings to communicate (e.g. HTTP headers, XML). Many see this as a mistake, but regardless of that it is nearly always done in English, giving UTF-8 further advantage there. Using different encodings for different kinds of strings significantly increases complexity and consequent bugs.
In particular, we believe that adding wchar_t to C++ was a mistake, and so are the Unicode additions to C++11. What must be demanded from the implementations, though, is that the basic execution character set would be capable of storing any Unicode data. Then, every std::string or char* parameter would be Unicode-compatible. ‘If this accepts text, it should be Unicode compatible’—and with UTF-8, it is also easy to do.
The standard facets have many design flaws. This includes std::numpunct, std::moneypunct and std::ctype not supporting variable-length encoded characters (non-ASCII UTF-8 and non-BMP UTF-16). They must be fixed:
  • decimal_point() and thousands_sep() should return a string rather than a single code unit. (By the way C locales do support this, albeit not customizable.)
  • toupper() and tolower() shall not be phrased in terms of code units, as it does not work in Unicode. For example, the German ß must be converted to SS and ffl to FFL.

How to do text on Windows

The following is what we recommend to everyone else for compile-time checked Unicode correctness, ease of use and better multi-platformness of the code. This substantially differs from what is usually recommended as the proper way of using Unicode on Windows. Yet, an in-depth research of these recommendations resulted in the same conclusion. So here it goes:
  • Do not use wchar_t or std::wstring in any place other than adjacent point to APIs accepting UTF-16.
  • Do not use _T("") or L"" literals in any place other than parameters to APIs accepting UTF-16.
  • Do not use types, functions, or their derivatives that are sensitive to the _UNICODE constant, such as LPTSTR or CreateWindow().
  • Yet, _UNICODE is always defined, to avoid passing narrow strings to WinAPI getting silently compiled.
  • std::strings and char*, anywhere in the program, are considered UTF-8 (if not said otherwise).
  • Only use Win32 functions that accept widechars (LPWSTR), never those which accept LPTSTR or LPSTR. Pass parameters this way:
    ::SetWindowTextW(widen(someStdString or "string litteral").c_str())
    (The policy uses conversion functions described below.)
  • With MFC strings:
    CString someoneElse; // something that arrived from MFC.
    
    // Converted as soon as possible, before passing any further away from the API call:
    std::string s = str(boost::format("Hello %s\n") % narrow(someoneElse));
    AfxMessageBox(widen(s).c_str(), L"Error", MB_OK);

Working with files, filenames and fstreams on Windows

  • Never produce text output files with non-UTF-8 content
  • Using fopen() should anyway be avoided for RAII/OOD reasons. However, if necessary, use _wfopen() and WinAPI conventions as described above.
  • Never pass std::string or const char* filename arguments to the fstream family. MSVC CRT does not support UTF-8 arguments, but it has a non-standard extension which should be used as follows:
  • Convert std::string arguments to std::wstring with widen:
    std::ifstream ifs(widen("hello"), std::ios_base::binary);
    We will have to manually remove the conversion, when MSVC’s attitude to fstream changes.
  • This code is not multi-platform and may have to be changed manually in the future.
  • Alternatively use a set of wrappers that hide the conversions.

Conversion functions

This guideline uses the conversion functions from the Boost.Nowide library (it is not yet a part of boost):
std::string narrow(const wchar_t *s);
std::wstring widen(const char *s);
std::string narrow(const std::wstring &s);
std::wstring widen(const std::string &s);
The library also provides a set of wrappers for commonly used standard C and C++ library functions that deal with files, as well as means of reading an writing UTF-8 through iostreams.
These functions and wrappers are easy to implement using Windows’ MultiByteToWideChar and WideCharToMultiByte functions. Any other (possibly faster) conversion routines can be used.

FAQ

  1. Q: Are you a linuxer? Is this a concealed religious fight against Windows?

    A: No, I grew up on Windows, and I am a Windows fan. I believe that they did a wrong choice in the text domain, because they did it earlier than others.—Pavel
  2. Q: Are you an Anglophile? Do you secretly think English alphabet and culture are superior to any other?

    A: No, and my country is non-ASCII speaking. I do not think that using a format which encodes ASCII characters in single byte is Anglo-centrism, or has anything to do with human interaction. Even though one can argue that source codes of programs, web pages and XML files, OS file names and other computer-to-computer text interfaces should never have existed, as long as they do exist, text is not only for human readers.
  3. Q: Why do you guys care? I program in C# and/or Java and I don’t need to care about encodings at all.

    A: Not true. Both C# and Java offer a 16 bit char type, which is less than a Unicode character, congratulations. The .NET indexer str[i] works in units of the internal representation, hence a leaky abstraction once again. Substring methods will happily return an invalid string, cutting a non-BMP character in parts.
    Furthermore, you have to mind encodings when you are writing your text to files on disk, network communications, external devices, or any place for other program to read from. Please be kind to use System.Text.Encoding.UTF8 (.NET) in these cases, never Encoding.ASCII, UTF-16 or cellphone PDU, regardless of the assumptions about the contents.
    Web frameworks like ASP.NET do suffer from the poor choice of internal string representation in the underlying framework: the expected string output (and input) of a web application is nearly always UTF-8, resulting in significant conversion overhead in high-throughput web applications and web services.
  4. Q: Why not just let any programmer use her favorite encoding internally, as long as she knows how to use it?

    A: We have nothing against correct using of any encoding. However, it becomes a problem when the same type, such as std::string, means different things in different contexts. While it is ‘ANSI codepage’ for some, for others, it means ‘this code is broken and does not support non-English text’. In our programs, it means Unicode-aware UTF-8 string. This diversity is a source of many bugs and much misery: this additional complexity is something that world does not really need, and the result is much Unicode-broken software, industry-wide.
  5. Q: UTF-16 characters that take more than two bytes are extremely rare in the real world. This practically makes UTF-16 a fixed-width encoding, giving it a whole bunch of advantages. Can’t we just neglect these characters?

    A: Are you serious about not supporting all of Unicode in your software design? And, if you are going to support it anyway, how does the fact that non-BMP characters are rare practically change anything, except for making software testing harder? What does matter, however, is that text manipulations are relatively rare in real applications—compared to just passing strings around as-is. This means the "almost fixed width" has little performance advantage (see Performance), while having shorter strings may be significant.
  6. Q: Why do you turn on the _UNICODE define, if you do not intend to use Windows’ LPTSTR/TCHAR/etc macros?

    A: This is a precaution against plugging a UTF-8 char* string into ANSI-expecting functions of Windows API. We want it to generate a compiler error. It is the same kind of a hard-to-find bug as passing an argv[] string to fopen() on Windows: it assumes that the user will never pass non-current-codepage filenames. You will be unlikely to find this kind of a bug by manual testing, unless your testers are trained to supply Chinese file names occasionally, and yet it is a broken program logic. Thanks to _UNICODE define, you get an error for that.
  7. Q: Isn’t it quite naïve to think that Microsoft will stop using widechars one day?

    A: Let’s first see when they start supporting CP_UTF8 as a valid locale. This should not be very hard to do. Then, we see no reason why anybody would continue using the widechar APIs. Also, adding support for CP_UTF8 would ‘unbreak’ some of existing unicode-broken programs and libraries.
    Some say that adding CP_UTF8 support would break existing applications that use the ANSI API, and that this was supposedly the reason why Microsoft had to resort to creating the wide string API. This is not true. Even some popular ANSI encodings are variable length (Shift JIS, for example), so no correct code would become broken. The reason Microsoft chose UCS-2 is purely historical. Back then UTF-8 hasn’t yet existed, Unicode was believed to be ‘just a wider ASCII’, and it was cosidered important to use a fixed-width encoding.
  8. Q: What are characters, code points, code units and grapheme clusters?

    A: Here is an excerpt of the definitions according to the Unicode Standard with our comments. Refer to the relevant sections of the standard for more detailed description.
    Code point
    Any numerical value in the Unicode codespace.[§3.4, D10] For instance: U+3243F.
    Code unit
    The minimal bit combination that can represent a unit of encoded text.[§3.9, D77] For example, UTF-8, UTF-16 and UTF-32 use 8-bit, 16-bit and 32-bit code units respectively. The above code point will be encoded as ‘f0 b2 90 bf’ in UTF-8, ‘d889 dc3f’ in UTF-16 and ‘0003243f’ in UTF-32. Note that these are just sequences of groups of bits; how they are stored further depends on the endianess of the particular encoding. So, when storing the above UTF-16 code units on an octet-oriented media, they will be converted to ‘d8 89 dc 3f’ for UTF-16BE and to ‘89 d8 3f dc’ for UTF-16LE.
    Abstract character
    A unit of information used for the organization, control, or representation of textual data.[§3.4, D7] The standard further says in §3.1:
    For the Unicode Standard, [...] the repertoire is inherently open. Because Unicode is a universal encoding, any abstract character that could ever be encoded is a potential candidate to be encoded, regardless of whether the character is currently known.
    The definition is indeed abstract. Whatever one can think of as a character—is an abstract character. For example, tengwar letter ungwe is an abstract character, although it is not yet representable in Unicode.
    Encoded character
    Coded character
    A mapping between a code point and an abstract character.[§3.4, D11] For example, U+1F428 is a coded character which represents the abstract character 🐨 koala.
    This mapping is neither total, nor injective, nor surjective:
    • Surragates, noncharacters and unassigned code points do not correspond to abstract characters at all.
    • Some abstract characters can be encoded by different code points; U+03A9 greek capital letter omega and U+2126 ohm sign both correspond to the same abstract character ‘Ω’, and must be treated idnetically.
    • Some abstract characters cannot be encoded by a single code point. These are represented by sequences of coded characters. For example, the only way to represent the abstract character ю́ cyrillic small letter yu with acute is by the sequence U+044E cyrillic small letter yu followed by U+0301 combining acute accent.
    Moreover, for some abstract characters, there exist representations using multiple code points, in addition to the single coded character form. The abstract character ǵ can be coded by the single code point U+01F5 latin small letter g with acute, or by the sequence <U+0067 latin small letter g, U+0301 combining acute accent>.
    User-perceived character
    Whatever the end user thinks of as a character. This notion is language dependent. For instance, ‘ch’ is two letters in English and Latin, but considered to be one letter in Czech and Slovak.
    Grapheme cluster
    A sequence of coded characters that ‘should be kept together’.[§2.11] Grapheme clusters approximate the notion of user-perceived characters in a language independent way. They are used for, e.g., cursor movement and selection.
    Character
    May mean any of the above. The Unicode Standard uses it as a synonym for coded character.[§3.4]
    When some programming language or library documentation says ‘character’, it almost always means a code unit. When an end user is asked about the number of characters in a string, she will count the user-perceived characters. When a programmer tries to count the number of characters, she will count the number of code units, code points, or grapheme clusters, according to the level of her expertise. All this is a source of confusion, as people conclude that, if for the length of the string ‘🐨’ the library returns a value other than one, then it ‘does not support Unicode’.
  9. Q: Why would the Asians give up on UTF-16 encoding, which saves them 50% the memory per character?

    A: It does so only in artificially constructed examples containing only characters in the U+0800 to U+FFFF range. However, computer-to-computer text interfaces dominate any other. This includes XML, HTTP, filesystem paths and configuration files—they all use almost exclusively ASCII characters, and in fact UTF-8 is used just as often in those countries.
    For a dedicated storage of Chinese books, UTF-16 may still be used as a fair optimization. As soon as the text is retrieved from such storage, it should be converted to the standard compatible with the rest of the world. Anyway, if storage is at premium, a lossless compression will be used. In such cases, UTF-8 and UTF-16 will take roughly the same space. Furthermore, ‘in the said languages, a glyph conveys more information than a [L]atin character so it is justified for it to take more space.’ (Tronic, UTF-16 harmful).
    Here are the results of a simple experiment. The space used by the HTML source of some web page (Japan article, retrieved from Japanese Wikipedia on 2012–01–01) is shown in the first column. The second column shows the results for text with markup removed, that is ‘select all, copy, paste into plain text file’.

    HTML Source (Δ UTF-8)Dense text (Δ UTF-8)
    UTF-8767 KB (0%)222 KB (0%)
    UTF-161 186 KB (+55%)176 KB (−21%)
    UTF-8 zipped179 KB (−77%)83 KB (−63%)
    UTF-16LE zipped192 KB (−75%)76 KB (−66%)
    UTF-16BE zipped194 KB (−75%)77 KB (−65%)
    As can be seen, UTF-16 takes about 50% more space than UTF-8 on real data, it only saves 20% for dense Asian text, and hardly competes with general purpose compression algorithms.
  10. Q: What do you think about BOMs?

    A: They are another reason not to use UTF-16. UTF-8 has a BOM too, even though byte order is not an issue in this encoding. This is to manifest that this is a UTF-8 stream. If UTF-8 remains the only popular encoding (as it already is in the internet world), the BOM becomes redundant. In practice, many UTF-8 text files omit BOMs today. The Unicode Standard does not recommend using BOMs.
  11. Q: What do you think about line endings?

    A: All files shall be read and written in binary mode since this guarantees interoperability—a program will always give the same output on any system. Since the C and C++ standards use \n as in-memory line endings, this will cause all files to be written in the POSIX convention. It may cause trouble when the file is opened in Notepad on Windows; however, any decent text viewer understands such line endings.
  12. Q: But what about performance of text processing algorithms, byte alignment, etc?

    A: Is it really better with UTF-16? Maybe so. ICU uses UTF-16 for historical reasons, thus it is quite hard to measure. However, most of the times strings are treated as cookies, not sorted or reversed every second use. Smaller encoding is then favorable for performance.
  13. Q: Isn’t UTF-8 merely an attempt to be compatible with ASCII? Why keep this old fossil?

    A: Maybe it was. Today, it is a better and more popular encoding of Unicode than any other.
  14. Q: Is it really a fault of UTF-16 that people misuse it, assuming that it is 16 bits per character?

    A: Not really. But yes, safety is an important feature of every design.
  15. Q: If std::string means UTF-8, wouldn’t that get confused with code that stores plain text in std::strings?

    A: There is no such thing as plain text. There is no reason for storing codepage-ANSI or ASCII-only text in a class named ‘string’.
  16. Q: Won’t the conversions between UTF-8 and UTF-16 when passing strings to Windows slow down my application?

    A: First, you will do some conversion either way. It’s either when calling the system, or when interacting with the rest of the world. Even if your interaction with the system is more frequent in your application, here is a little experiment.
    A typical use of the OS is to open files. This function executes in (184 ± 3)μs on my machine:
    void f(const wchar_t* name)
    {
        HANDLE f = CreateFile(name, GENERIC_WRITE, FILE_SHARE_READ, 0, CREATE_ALWAYS, 0, 0);
        DWORD written;
        WriteFile(f, "Hello world!\n", 13, &written, 0);
        CloseHandle(f);
    }
    While this runs in (186 ± 0.7)μs:
    void f(const char* name)
    {
        HANDLE f = CreateFile(widen(name).c_str(), GENERIC_WRITE, FILE_SHARE_READ, 0, CREATE_ALWAYS, 0, 0);
        DWORD written;
        WriteFile(f, "Hello world!\n", 13, &written, 0);
        CloseHandle(f);
    }
    (Run with name="D:\\a\\test\\subdir\\subsubdir\\this is the sub dir\\a.txt" in both cases. It was averaged over 5 runs. We used an optimized widen that relies on std::string contiguous storage guarantee given by C++11.)
    This is just (1 ± 2)% overhead. Moreover, MultiByteToWideChar is almost surely suboptimal. Better UTF-8↔UTF-16 conversion functions exist.
  17. Q: How do I write UTF-8 string literal in my C++ code?

    A: If you internationalize your software then all non-ASCII strings will be loaded from an external translation database, so it is not a problem.
    If you still want to embed a special character you can do it as follows. In C++11 you can do it as:
    u8"∃y ∀x ¬(x ≺ y)"
    With compilers that do not support ‘u8’ you can hard-code the UTF-8 code units as follows:
    "\xE2\x88\x83y \xE2\x88\x80x \xC2\xAC(x \xE2\x89\xBA y)"
    However the most straightforward way is to just write the string as-is and save the source file encoded in UTF-8:
    "∃y ∀x ¬(x ≺ y)"
    Unfortunately, MSVC converts it to some ANSI codepage, corrupting the string. To work around this, save the file in UTF-8 without BOM. MSVC will assume that it is in the correct codepage and will not touch your strings. However, it renders it impossible to use Unicode identifiers and wide string literals (that you will not be using anyway).
  18. Q: How can I check for presence of a specific ASCII character, e.g. apostrophe (') for SQL injection prevention, or HTML markup special characters, etc. in a UTF-8 encoded string?

    A: Do as you would for an ASCII string. Every non-ASCII character is encoded in UTF-8 as a sequence of bytes, each of them having value greater than 127. This leaves no place for collision for a naïve algorithm—simple, fast and elegant.
    Also, you can search for a UTF-8 encoded substring in a UTF-8 string as if it was a plain byte array—no need to mind code point boundaries. This is a design feature of UTF-8—a leading byte of an encoded code point can never hold value corresponding to one of trailing bytes of any other code point.
  19. Q: I have a complex large char-based Windows application. What is the easiest way to make it Unicode-aware?

    Keep the chars. Define _UNICODE to get compiler errors where narrow()/widen() should be used. Find all fstream and fopen() uses, and use wide overloads as described above. By now, you are almost done.
    If you use 3rd-party libraries that do not support Unicode, e.g. forwarding file name strings as-is to fopen(), you will have to work around with tools such as GetShortPathName() as shown above.
  20. Q: I already use this approach and I want to make our vision come true. What can I do?

    A: Review your code and see what library is most painful to use in portable Unicode-aware code. Open a bug report to the authors.
    If you are a C or C++ library author, use char* and std::string with UTF-8 implied, and refuse to support ANSI code pages—since they are inherently Unicode-broken.
    If you are a Microsoft employee, push for implementing support of the CP_UTF8 as one of narrow API code pages.

Myths

Note: If you are not familiar with the Unicode terminology, please read this FAQ first.
Note: For the purpose of this discussion, indexing into the string is also a kind of character counting.

Counting characters can be done in constant time with UTF-16.

This is a common mistake by those who think that UTF-16 is a fixed-width encoding. It is not. In fact UTF-16 is a variable length encoding. Refer to this FAQ if you still deny the existence of non-BMP characters.
Many try to fix this statement by switching encodings, and come with the following statement:

Counting characters can be done in constant time with UTF-32.

Now, the truth of this statement depends on the meaning of the ambiguous and overloaded word ‘character’. The only interpretations that would make the claim true are ‘code units’ and ‘code points’, which coincide in UTF-32. However, code points are not characters, neither according to Unicode nor according to the end user. Some of them are non-characters. These should not be interchanged though. So, assuming we can guarantee that the string does not contain non-characters, each code point would represent a single coded character, and we could count them.
But, is it so an important achievement? Why the above concern raises at all?

Counting coded characters or code points is important.

The importance of code points is frequently overstated. This is due to misunderstanding of the complexity of Unicode, which merely reflects the complexity of human languages. It is easy to tell how many characters are there in ‘Abracadabra’, but it is not so simple for the following string:
Приве́т नमस्ते שָׁלוֹם
The above string consists of 22 (!) code points but only 16 grapheme clusters. So, ‘Abracadabra’ consists of 11 code points, the above string consists of 22 code points, and further of 20 if converted to NFC. Yet, the number of code points is irrelevant to almost any software engineering question, with perhaps the only exception of converting the string to UTF-32. For example:
  • For cursor movement, text selection and alike, grapheme clusters shall be used.
  • For limiting the length of a string in input fields, file formats, protocols, or databases, the length is measured in code units of some predetermined encoding. The reason is that any length limit is derived from the fixed amount of memory allocated for the string at a lower level, be it in memory, disk or in a particular data structure.
  • The size of the string as it appears on the screen is unrelated to the number of code points in the string. One has to communicate with the rendering engine for this. Code points do not occupy one column even in monospace fonts and terminals. POSIX takes this into account.

In NFC each code point corresponds to one user-perceived character.

No, because the number of user-perceived characters that can be represented in Unicode is virtually infinite. Even in practice, most characters do not have a fully composed form. For example, the NFD string from the example above, which consists of three real words in three real languages, will consist of 20 code points in NFC. This is still far more than the 16 user-perceived characters it has.

The string length() operation must count user-perceived or coded characters. If not, it does not support Unicode properly.

Unicode support of libraries and programming languages is frequently judged by the value returned for the ‘length of the string’ operation. According to this evaluation of Unicode support, most popular languages, such as C#, Java, and even the ICU itself, would not support Unicode. For example, the length of the one character string ‘🐨’ will be often reported to be 2 where UTF-16 is used as for the internal string representation and 4 for the languages that internally use UTF-8. The source of the misconception is that the specification of these languages use the word ‘character’ to mean a code unit, while the programmer expects it to be something else.

About the authors

This manifesto was written by Pavel Radzivilovsky, Yakov Galka and Slava Novgorodov, as a result of much experience and research of real-world Unicode issues and mistakes done by real-world programmers. The goal is to improve awareness of text issues and to inspire industry-wide changes to make Unicode-aware programming easier, ultimately improving the experience of users of those programs written by human engineers. Neither of us is involved in the Unicode consortium.
Much of the text was inspired by discussions on StackOverflow initiated by Artyom Beilis, the author of Boost.Locale. You can leave comments/feedback there. Additional inspiration came from the development conventions at VisionMap and Michael Hartl’s tauday.org.

External links

Valid XHTML 1.0 Strict Valid CSS!
Last modified: 2013-02-15

niedziela, 15 grudnia 2013

C++ Libraries

  1. Dlib

    Dlib is a general purpose cross-platform C++ library designed using contract programming and modern C++ techniques. It is open source software and licensed under the Boost Software License.
  2. Origin

    Origin is a collection of experimental libraries written using the C++11 programming language. The purpose of this project is to foster experimentation with library design and programming techniques using the new version of the language.
    The Origin libraries is designed around a minimal set of core facilities that wrap and extend the C++ standard library. These facilities include an extensive type traits framework, support for concept-like type checking, new iterator adaptors, ranges, and support for specification testing.
  3. Folly

    Folly is an open-source C++ library developed and used at Facebook.
  4. Adobe Source Libraries

    The Adobe Source Libraries (ASL) are a collection of C++ libraries building foundation technology to allow the construction of commercial applications by assembling generic algorithms through declarative descriptions.
  5. Cinder

    Cinder is a community-developed, free and open source library for professional-quality creative coding in C++.
  6. JUCE

    JUCE is a wide-ranging C++ class library for building rich cross-platform applications and plugins for all the major operating systems.
  7. OpenFrameworks

    OpenFrameworks is an open source C++ toolkit for creative coding.