About Unicode NSStrings

Last week, this post titled Why is Swift’s String API So Hard? by Mike Ash made the rounds. It explains in depth the kinds of problems that arise when working with strings in different encodings, especially in unicode, and how the Swift’s String API is built with these things in mind.

A real world example

In iZurvive, the users profile displays a list of groups, each of which has amongst other information a name and a color. We want to take the first letter of the group’s name, uppercase it, and put it into a little circle with the group’s color. Simple enough.

This is the line of code that sets the labels text:

label.text = [[name substringToIndex:1] uppercaseString];

Here’s the result just as expected:

But wait, when the groups name starts with an emoji something is not quite right:

Somehow the emoji is not rendered correctly, but its no problem when displaying the full group name and there’s nothing special about this UILabel. The problem is actually related to the way Objective-C handles strings. When writing this code, I assumed the first “character” to be something different than it actually was.

This is how our string is stored in memory:

3D D8 2C DE 65 00 6D 00 6F 00 6A 00 69 00
----------- ----- ----- ----- ----- -----
    😬        e     m     o     j    i

As you can see, the emoji is actually 4 bytes long instead of 2 bytes as all the other characters. Objective-C uses UTF-16 as its canonical representation of strings. Calling [name substringToIndex:1] returns the first UTF-16 character, which consists of the two bytes 0x3D, 0xD8. Thats not a valid unicode symbol anymore (for all the details, go read Mike Ash’s post) and for that reason, our label can not render a correct symbol. We actually need the first two UTF-16 characters or the first four bytes, 0x3D 0xD8 0x2C 0xDE in that case.

How do we know how many characters are necessary in a concrete case? Luckily NSString has you covered: rangeOfComposedCharacterSequenceAtIndex returns the range of the composed character sequence at a given index.

By adjusting the above snippet to this:

label.text = [[name substringWithRange:[name rangeOfComposedCharacterSequenceAtIndex:0]] uppercaseString];

we get the expected result correct in all cases:

Conclusion

NSString might seem much easier than Swift’s String API, but it makes it also very easy to miss thinks like this. I almost fell in this trap today and who knows how many similar bugs are still lurking in our Objecctive-C codebase. I’m certainly glad that I’ve just read the aforementioned post and was thus able to identify the problem relatively easily.

A real world example

Conclusion

Markus Chmelar, MSc