Gavin Grover's Unicode-Ultra Wikiblog

Expanding Unicode, 2017 F.A.Q

Why do the utf-88 trailing surrogates double up as standalone characters?
answered on 26 Jan 2017:

In utf-16, both the leading surrogates (U+d800 to U+dbff) and trailing surrogates (U+dc00 to U+dfff) have no other use, which provides maximal self-synchronization. Some variations to Unicode let some of the 1024 trailing surrogates double as other characters used standalone (i.e. when not following a leading surrogate), but this isn't official Unicode policy. The encoded sequence is still self-synchronizing, but requires a single character lookback. (Of course, the leading surrogates could never be used that way.)

The reason utf-88 allows its trailing surrogates (U+100000 to U+10ffff) to double up as standalone characters is because utf-88 isn't intended to be a permanent encoding. It's only an interim encoding until the Unicode Consortium see sense and revert to the pre-2003 upper limit for utf-8 and utf-32 of just over 2 billion codepoints. utf-88 is a surrogation scheme over utf-8 that uses half of private use plane U+fxxxx (i.e. U+f8000 to U+fffff) as leading surrogates and all of plane U+10xxxx as trailing surrogates, but enables plane U+10xxxx to continue its use as a private use plane. It does that because that's how it will continue to be used after the Unicode Consortium introduces its own second-tier surrogation scheme over utf-16 to accompany the reversion of utf-8 and utf-32 back up to 2 billion characters.

utf-88 is an interim surrogation scheme that can be used now, and can be converted to an expanded utf-8 easily when the Consortium finally makes it official.


Why does utf-88 provide 1 million private use codepoints?
answered on 26 Jan 2017:

There are not enough private use codepoints in Unicode, only about 137,000, in 3 blocks (U+e000 to U+f8ff, U+f0000 to U+ffffd, and U+100000 to U+10fffd). If people want to design square-structured scripts like Unihan and Korean Hangul, they'll need many more codepoints. utf-88 extends the third private use block to over 1 million codepoints, by redefining U+10fffe and U+10ffff from Nonchar to Private Use, and adding planes U+11xxxx to U+1fxxxx. It uses these codepoints not only because they're contiguous but also because they're the remaining ones that can be encoded with 4 bytes under the original pre-2003 utf-8 encoding scheme as proposed by Rob Pike and Ken Thompson, thus showing respect for users of Unicode by providing this "prime real estate" for their private use needs.

This third private use block contiguity also suggests how the Unicode Consortium might eventually define a second-tier surrogation scheme for utf-16. Because planes U+0xxxx to U+fxxxx have Nonchars for their last 2 codepoints, using such a plane in its entirety for trailing surrogates (as utf-88 does) wouldn't work to encode the planes above this which won't have any Nonchars. The most likely scheme would use a single plane in its entirety for the leading surrogates (where it doesn't matter if the last two codepoints aren't used), and the first half of another plane for trailing surrogates. If special purpose plane, U+Exxxx, is to be used, then the middle half of that plane (U+e4000 to U+ebfff) would be suitable the trailing surrogates, along with perhaps all of plane U+dxxxx for the leading surrogates.

See earlier entries

Last edited Nov 3 at 5:03 AM by gavingrover, version 17

Comments

No comments yet.