Gavin Grover's Unicode-Ultra Wikiblog

See later entries

28 March 2014; updated 26 Sep 2017

Building Unicode Ultra...

Since digging into Unicode's background during my reboot of the Groovy Language, I've discovered it isn't finished being built. To be of use in Groovy. I need to finishing building Unicode before continuing with the Groovy reboot. The final build will comprise the following components:

  1. the ultra extensions to Unicode reinstating the pre-2003 upper limit to UTF-8 as originally designed by Pike and Thompson, and the ultra surrogate system for where this isn't available
  2. a 6-bit control language embedded in the 10xxxxxx bytes following a 11111111 byte, using characters U+20 to U+5F, to, among other uses, encode decomposition commands for Unihan characters
  3. intermediate decompositions of the 75,000 Unihan characters presently in Unicode (e.g. 丨 and 川 are included in Unicode, but not two vertical strokes which is used in many characters like in 介 under the 人) to also be included in Unicode
  4. decomposition tables for all the Unihan characters similar to my own first attempt so they can be converted to Ultra Decomposed Normal Form
  5. an algorithm to generate 2.1 billion Unihan characters formulaicly into the 5- and 6-bit encodings in UTF-8-Ultra so Unihan characters can be converted to Ultra Composed Normal Form
  6. correct standard fonts and generation code for all such codepoints so users don't have to rely on erroneous fonts such as Microsoft Windows 7's SimSun font which has the wrong pictures for most Unihan extension C and D characters
  7. standard keystrokes for entering Unihan characters pictorially so everyone can easily enter them, irrespective of their native language
The primary purpose of building Unicode Ultra is so Unihan ideographs can be used outside of their spoken language contexts in Asia. When Japanese use single Kanji to represent multisyllabic words, they read them visually not phonetically. By making Unihan tokens visually defined characters in Unicode Ultra, not only Japanese but Chinese and Westerners and others can also easily use them.

The primary purpose of rebooting the Groovy Language is to showcase Unihan in a programming language. Unicode's ID_Start and ID_Continue properties (and their X- versions) treat Unihan the same as other alphanumerics in identifier names, but Real Groovy will put an implied space before and after every Unihan so they can only be used as standalone characters, like in Japanese kunyomi Kanji and Chinese classical texts. Programming language grammar can thus be much terser, readable on a smartphone screen.


27 March 2014 (updated from 22 March)

Unicode Ultra Normal Forms

The proposed UTF-8-Ultra will need two new normal forms, bringing the total to six...
  • NFC: canonical composition
  • NFD: canonical decomposition
  • NFKC: compatibility composition
  • NFKD: compatibility decomposition
  • NFUC: ultra composition
  • NFUD: ultra decomposition

A very rough picture of these forms is:

     NFKD <-------+----> NFKC (=canonically compose NFKD)
                  |
           NFD <--+----------> NFC (=canonically compose NFD)
                  |
NFUD <------------+---------------> NFUC (=ultra-compose NFUD)

Ultra Decomposition Normal Form

NFUD relies on UTF-8-Ultra introducing a 6-bit control language to Unicode, embedded in the 10xxxxxx bytes following a 11111111 byte, as suggested in proposal 3(b) of the draft spec for Ultra Unicode. The characters U+20 to U+5F will be used, i.e. the space, 26 uppercoase letters, 10 digits, and 27 punctuation and symbols !?,.:;#$%&@^*=+-_'"/\<>()[]. NFUD will extend NFD, and deprecate NFKD and NFKC.

Presently if a character such as ª is compatibility decomposed (NFKD), it will become a, losing some information, namely that it had decomposition type super. Ultra decomposition (NFUD) will retain the compatibility decomposition types as ultra control commands, e.g. ª would become control command SUPER[, followed by a, followed by control command ]. Case mappings would similarly be decomposed, e.g. A becomes command UPPER[, token a, command ]. We would also decompose all Unihan tokens to a set of core components, e.g. would become command UNIHAN[ACROSS[, token , command DOWN[, token , token, command ]]]. Because NFUD commands can be nested, we could process UTF-8-Ultra Unicode in NFUD as a tree.

We would use the control commands for embedding rich text commands besides these, such as BOLD, enabling UTF-8-Ultra higher level protocols to extend, not envelope, Unicode encodings in a non-breaking manner, so the protocol will more likely become widely used, as did UTF-8 when it extended the ASCII encoding. But until we implement the embedded 6-bit control language in UTF-8, we need an alternative method of specifying such commands in Unicode texts, just as we use ultra surrogates with UTF-8 because the 5- and 6-bit extensions haven't been available since 2003.

We'll therefore use a markup language in our text and Real Groovy programs. If using backtick ` for escaping, we would write:
  • `SUPER[`a`]` in our text for ª,
  • `UPPER[`a`]` for A,
  • and `UNIHAN[ACROSS[`讠`DOWN[`五口`]]]` for .

Ultra Composition Normal Form

NFUC is an extension of NFC, so all canonical compositions are also ultra compositions. So Hangul will be the NFUC version of Jamo sequence 가ᄋ, and ñ will be the NFUC version of sequence n◌̃ , just as they are presently the NFC versions. The additional processing will be that where a formulaicly-generated version of a Unihan character exists in planes 0x20 to 0x7FFF (i.e. the planes representable by 5 or 6 bytes in UTF-8-Ultra), the formulaicly-generated version will be substituted.

When converting to NFUC, the character stream will first be converted to NFUD, the canonical reordering algorithm applied, then converted to NFUC, leaving decomposition sequences not having formulaic versions in the data. Whereas ultra-composition will process such 6-bit commands, canonical composition will ignore them and compatibility composition will remove them.


9 March 2014

Groovy Unicode UTF-8-Ultra

I've only just begun publicizing the Ultra extensions to Unicode I'll need when the Groovy Language reboot reaches maturity, but initial negative feedback made me think about the best next step. Besides the two main additions (i.e. extending UTF-16 with doubly-indirected surrogates, and removing the U+10FFFF upper limit in UTF-8), the proposal also presents a minor addition that's not recommended, i.e. extending the existing UTF-8 codepoint range with the same ultra surrogates proposed for UTF-16. Although I only put it in for completeness, it could be the best choice for building a reference implementation.

UTF-8-Ultra uses existing private use planes to define "ultra-surrogates". The top half of plane 0xF becomes 32,786 high ultra-surrogates (U+F_8000 to U+F_FFFF) and all of plane 0x10 becomes 65,536 low ultra-surrogates (U+10_0000 to U+10_FFFF). Together, they can access 2,147,483,648 codepoints (i.e. 32,786 planes) from U+0000_0000 to U+7FFF_FFFF. We would keep the first half of plane 0xF (U+F_0000 to U+F_7FFFF) as 32,786 private use codepoints alongside the 6400 private use ones in the BMP. Of the additional ultra planes, we would leave another 983,040 codepoints (U+11_0000 to U+1F_FFFF) alone for general private use, they being the codepoints representable by 4 bytes in the original 1993 spec for UTF-8.

The planes representable by 5 or 6 bytes in that spec, 0x20 to 0x7FFF, we would use to implement a proof-of-concept showing how we can represent many CJK Unihan ideographs using the same formulaic method used to represent Korean Hangul. Each Hangul block is made up of 2 or 3 jamo: a leading consonant, a vowel, and possibly a trailing consonant. There are 19 leading consonants, 21 vowels, and 28 trailing consonants (including none at all), giving a total of 11,172 possible syllable blocks, generated by formula into range U+AC00 to U+D7A3. But only 2350 of them are used often enough to justify their inclusion in South Korea's KS-X-1001 national standard, the other 8822 are only there for completeness.

There's many more CJK components than Korean jamo: semantic radicals and phonetic components together number 500 to 1000 depending on how you count them, and unlike jamo, they can combine recursively to form Chinese characters (e.g. 懂 is 忄 followed by 董, which is 艹 over 重, which is 千 joined to the top of 里, which is 土 joined under 田, which is 冂 surrounding around the top of 土 ). By discovering the best combination of base components and recursion depths, we can create 2.1 billion new Chinese characters which don't need their own glyphs in a font but can have their images calculated and rendered by formula. Most of them are just there for completeness, but some of which can deprecate a large chunk of the 75,000 already encoded.

Building such a UTF-8-Ultra reference implementation will put to rest some of the negative feedback coming my way:
  • Having conversions from UTF-8 or UTF-32 to UTF-16 fail due to a valid but unrepresentable-in-UTF-16 code point would be an extra headache to deal with (that many would forget to handle)
By encoding UTF-8 with ultra-surrogates we're showcasing the way UTF-16 can be extended so all valid pre-2003 UTF-8 (and UTF-32) codepoints are representable in UTF-16. The ridiculosity of using these surrogates with UTF-8 to access 0x7FFF_FFFF codepoints will raise serious questions about why UTF-8 was clipped to only 1.1 million characters back in November 2003, the exact same time Laforge joined a fledging 3-month-old programming language ultimately intended to showcase Unicode in programming. By severely restricting the lexis of Unicode and the grammar of Groovy, business consultants have been able to ply their fake productivity tools around IT shops as substitutes for the real thing.
  • for zero benefit
By discovering a good formula for building Chinese characters we're showcasing an application with benefit. Because the Korean version of the same application was already encoded in Unicode 2.0, no-one can seriously suggest it's not a viable application. But I'm certain some will try.
  • Good for us nobody will listen to you
Because of this comment and the many downvotes I assume a voting ring was in place and UTF8 Everywhere was responsible. If so, I would've expected a different attitude from them since my proposal makes UTF-8 more useful and UTF-16 less so. I won't really need to use more than 137,000 measly private use codepoints in Unicode for a while, but I'm getting in early on what I expect to be a long decade or two of extolling the virtues of Unicode Ultra to deaf ears. But eventually programmers will code entire programs while riding to work on the subway in Beijing, Hong Kong, and Tokyo using Groovy and Unicode.


11 January 2014

The Groovy Future of Unicode

In last week's Groovy History of Unicode, I alluded to some possible Unicode futures which I'll explore here...

1. UTF-16-extended

1996's UTF-16 can access 17 planes, i.e. 1_114_112 codepoints. In plane 0, i.e. the BMP from U+0000 to U+FFFF, the 1024 high surrogates (U+D800 to U+DBFF) and 1024 low surrogates (U+DC00 to U+DFFF) can access an extra 1_048_576 codepoints (i.e. 16 extra planes) from U+1_0000 to U+10_FFFF. UTF-16 was subsequently adopted by Java and Windows, then in 2003, UTF-8 was accepted by the Unicode Consortium, but with its maximum 2.1 billion codepoints restricted to the same 1.1 million points that UTF-16 can access.

The top 2 planes (15 and 16) had been reserved, however, for private use which means we can use them as second-stage "ultra surrogates" to access the remaining 2.1 billion codepoints available in the original UTF-8 proposal. We would keep the first half of plane 15 (U+F_0000 to U+F_7FFF) as 32_768 private use codepoints alongside the 6400 private use ones in the BMP. But we'd reassign the top half of plane 15 to be 32_768 high ultra-surrogates (U+F_8000 to U+F_FFFF) and all of plane 16 as 65_536 low ultra-surrogates (U+10_0000 to U+10_FFFF). Together, they can access 2_147_483_648 codepoints (i.e. 32_768 planes) from U+0000_0000 to U+7FFF_FFFF.

A few points:
  • Because U+10_FFFE and U+10_FFFF are non-characters, not surrogates, the last two points in each plane would be unreachable and by default would also be non-characters.
  • Because U+F_FFFE and U+F_FFFF are non-characters, the entire last two planes (U+7FFE_xxxx and U+7FFF_xxxx) would be unreachable and be non-character planes.
  • We would make the first 17 high ultra-surrogates (U+F_8000 to U+F_8010) illegal so there's no overlap with the 17 already available planes. Though if we only made the first 16 illegal, we could re-use low ultra-surrogate plane 16 to be the first extended UTF-16 plane. Anyone could therefore use it in development mode before they know whether to implement it as UTF-16-extended or just UTF-16.

2. Over 280 trillion codepoints

I'm not sure if Corrigendum 9 is really needed for Unicode to break free of its 2.1 billion character straitjacket, but UTF-8 bytes FE and FF would still be vacant even after re-extending UTF-8 back to 2.1 billion codepoints.

The original UTF-8 looks like:

    U+0000 to      U+007F:  0xxxxxxx
    U+0080 to      U+07FF:  110xxxxx 10xxxxxx
    U+0800 to      U+FFFF:  1110xxxx 10xxxxxx 10xxxxxx
  U+1_0000 to   U+1F_FFFF:  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
 U+20_0000 to  U+3FF_FFFF:  111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U+400_0000 to U+7FFF_FFFF:  1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx


Byte form 1111111x is unused. Last time I proposed extending it to 4.4 trillion codepoints, but I don't think that's enough. A better way is to use 11111110 to indicate a 9-byte sequence giving 281 trillion codepoints (281万亿):

U+8000_0000 to U+FFFF_FFFF_FFFF:
    11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx


Last time I proposed using this range to formulaicly encode Unihan characters not already encoded in the 70_000 presently encoded ones, as we already do Korean hangul. The Unihan character defined as the simple sequence "又出 across", like billions of other such examples, isn't encoded anywhere in those 70_000. To encode it using an ideographic descriptor would take 9 bytes, each character taking 3 bytes, as ⿰又出, the same as in my proposal (though we'd need to define a control code in the private use area instead of the descriptor to do it properly). But the payoff comes from all characters more complicated than that, if 281 trillion is enough to formulaicly generate them to a practical recursion depth. And we could eliminate most characters in the huge Unihan lookup table we presently require.

3. A 6-bit embedded language

The remaining unused byte form 11111111 could then be used as an embedded 6-bit control language.

The committee designing Unicode's predecessor, ASCII, decided it was important to support uppercase 64-character alphabets, and chose to pattern ASCII so it could be reduced easily to a usable 64-character set of graphic codes. So by using range U+20 to U+5F only, we can use everything in ASCII except the 33 control codes, 26 lowercase letters, and 5 symbols ` ~ { } |. We can embed these tokens within the continuation characters 10xxxxxx easily, build single-line commands after each 11111111, yet retain UTF-8's self-synchronization. Not only would the Unicode codepoint repetoire be reducible down to ASCII, but so would Unicode embedded with this control language be. But what would we use this control language for?

Not only are the UTF encodings self-synchronizing, but so is the Unicode repetoire, almost. The original Unicode 1.0 and 1.1 wasn't so much, having pairs of symbols (U+206A to U+206F) to indicate scopes like which script's digit shapes to use, whether to join Arabic letters together, and whether symmetric reversal should be used for certain symbols. These have since been deprecated by later codepoints and properties. Unicode still has codes (U+FFF9 to U+FFFB) for embedding annotations like Japanese rubies within text, but recommends a higher-level protocol be used. And there's the bidirectional embedding and overriding codes (U+202A to U+202E), and although replaced by the simplified isolate codes (U+2066 to U+2069) in 2013's Unicode 6.3, it still uses nested scopes.

The 6-bit embedded language has enough characters (i.e. space, 26 uppercoase letters, 10 digits, and 37 punctuation and symbols ! ? , . : ; # $ % & @ ^ * = + - _ ' " / \ < > ( ) [ ] to design a language to be embedded within Unicode to control all scoped functionality, not just taking over bidirectional control but any scoped function normally requiring a higher-level protocol. The possibilities are endless.


6 January 2014

The Groovy History of Unicode

Unicode began life as ASCII, when IBM's Bob Bemer submitted a proposal for a common computer code to the ANSI in May 1961, nine months before I was born. Unicode's history has always been tightly intertwined with my own. In late 1963, when I was moved to Auckland NZ, the ASCII codeset was finally agreed upon by committee heavyweights Bemer and Teletype Corp's John Auwaerter, and the ISO later accepted ASCII. But ASCII almost died young as only Teletype machines and the Univac 1050 used it for the next 18 years.

In 1981, ASCII was reborn when the popular IBM PC used it, and since then virtually every computer made has been ASCII-compliant. Its codeset included what we see on the keyboard, plus 33 control codes, some of which (0x0e and 0x0f) allowed other code pages to be swapped in and out, though the code was no longer self-synchronizing. Hundreds of other code pages were created for other natural languages, as well as separate standards by various East Asian governments to handle thousands of ideographs. On 29 August 1988, Xerox's Joe Becker outlined the original 16-bit design for Unicode 1.0, encoding Latin, Cyrillic, and Greek scripts, right-to-left scripts Arabic and Hebrew, the Devanagari 9, Thai, Korean Jamo, and punctuation and math symbols, and later expanded to include 20,000 Unihan (unified Chinese and Japanese) characters in 1992. Specs were written to cover case mapping and folding, diacritical marks, bidirectionality, Arabic joining, grapheme clusters and ordering, varying widths in Asian characters, and breaking algorithms.

But Koreans didn't want to encode any of their writing system as separate Jamo that make up the square-shaped Hangul syllables, which would require 6 to 10 bytes for each syllable, so in 1996's Unicode 2.0, over 10,000 codepoints were remapped to Hangul, and generated by formula. Chinese and Japanese weren't happy with only 20,000 Hanzi/Kanji being encoded, so Unicode 2.0 also brought the UTF-16 encoding using surrogates, adding a million 4-byte codes to the 65,000 2-byte ones, and dividing the codespace into 17 planes. Many users switched to Unicode, with two notable exceptions. Many Japanese were unhappy at their characters being unified with those of China, a situation which wouldn't need to have happened if UTF-16 had been decided on earlier. And Americans didn't like the incompatibility with ASCII, and its requirement for twice the space.

So Ken Thompson's 1993 self-synchronizing UTF-8 encoding became a supported alternative by the Unicode Consortium, with its varying length ensuring backwards compatibility with ASCII. In 2003, its maximum 2 billion codepoints were restricted to 1.1 million to match the UTF-16 encoding, but the top 2 planes (0xF and 0x10) had been reserved for private use. This ensures anyone can access the full 2 billion codepoints using UTF-16 by using those planes as "ultra surrogates", the top half of plane 0xF being high ultra-surrogates and all of plane 0x10 being low ones. We could even re-use plane 0x10 to encode the first "ultra plane" of characters. By decoding through surrogate indirection twice in UTF-16, the maximum rune is `\U7FFFFFFF`, not `\U0010FFFF`.

Unicode started being used in Java, Windows, and Linux, and by 2008, UTF-8 had replaced ASCII to become the most common encoding on the web. The main holdouts to uptake of Unicode is Japan's continuing preference for the JIS encoding, and of course programmers clinging to using ASCII-only tokens in their code. But the next major leap came on Unicode's 15th birthday: on 29 August 2003, the Groovy Language was announced by its creator James Strachan, the nascent vehicle which will introduce the full power of Unicode's vocabulary and possible resultant grammars to the world of software creation. A language spec and testkit was initiated in April 2004, and I joined the online community a year later, watching progress, excited at the opportunity to contribute Unicode to the ecosystem, and even create another implemention of the Groovy Language so it would be the best language ever.

But a great darkness has descended upon the Groovy Language: two voldemorts who hijacked its development and greedily turned it into a business venture to generate profits from web framework consulting and conferences. I endured years of being ignored then handled, and even, though unknown to me at the time, smeared and watched. One example a few months before G2One,Inc was incorporated, Laforge said without any discussion, (paraphrased) "We will not be doing anything like Unicode symbols in the foreseeable future". The business plan to stall Groovy development had been decided. To this day, little Groovy stands enslaved by the evil giant Grailiath. But because I've faced up to my groovy destiny to bring the full tersity of Unicode to the world of programming through the Groovy Language, I'm now rebooting it from scratch, spec and all. The day will come when programmers code on their mobile phone screens while riding the subway in Beijing, Hong Kong, and Tokyo.

Last year (2013) brought Unicode 6.3, weighing in at around 110,000 encoded tokens, encompassing exactly 100 scripts! But the greatest achievement in Unicode last year wasn't with the big version release, but in little Corrigendum 9, only 350 words long, changing the definition of non-characters, which are primarily the last two codepoints in each plane, 0xfffe and 0xffff. Unicode's F.A.Q on Noncharacters had previously said "Noncharacters are in a sense a kind of private-use character, because they are reserved for internal (private) use. However, that internal use is intended as a super private use, not normally interchanged with other users." But Corrigendum 9 will be remembered as the announcement that freed Unicode from its straightjacket of 2.1 billion codepoints possible with both the pre-2003 extended mode of UTF-8 and the doubly-indirected ultra surrogate system of UTF-16. For Corrigendum 9 now allows noncharacters to be interchanged, saying "Noncharacter: A code point that is permanently reserved for internal use", crossing out "and that should never be interchanged", because "for distributed software, it is often very difficult to determine what constitutes an internal versus an external context for any particular software process".

In pre-2003 UTF-8, an ASCII byte begins with `0`, a continuation byte begins with `10`, and a leading multibyte begins with `11`, the number of `1` bits indicating the number of bytes in the sequence. 6 bytes can encode up to U+7FFF_FFFF, which is 2.1 billion codepoints. But the leading multibytes `11111110` and `11111111` weren't used in Thompson and Pike's original UTF-8 spec because they were noncharacters `FE` and `FF`. Because of Corrigendum 9 we can now use them in interchange, and in any way we want! Maybe we could use `FE` to signify, not a 7-byte, but a 9-byte sequence; maybe we could use `FF` to signify program code using 6-bit tokens embedded in the following continuation bytes, which would still leave UTF-8 somewhat self-synchronizing. But even if we just extended the original definition of UTF-8, `11111110` would indicate a 7-byte sequence (68 billion codepoints) and `11111111` a 8-byte sequence, which could encode 4.4 trillion codepoints (=4.4万亿, =2^42, yes, 42), up to U+3FF_FFFF_FFFF, a huge improvement on U+7FFF_FFFF, and giving meaning to Unicode!

"And what would we use all these codepoints for?", you ask? Your destiny isn't intricately woven around Unicode's so you've never imagined it. From Unicode 2.0 on, over 10,000 Korean hangul codepoints are generated via a formula each having 2 to 5 jamo, selected from 24 jamo, non-recursively. A font for Hangul only needs to encode the jamo, not every hangul because they can be generated by the formula. This isn't possible for the 70,000 Unihan codepoints, each composed of 1 to perhaps 10 or more components, selected from hundreds of radicals and phonetic components, perhaps recursively. For now, each glyph must be encoded separately in the font, or a huge lookup table used, and when a possible Chinese character not in the 70,000 encoded is required, the definition sequence takes up many bytes, like with many Korean hangul before Unicode 2.0. But 4.4 trillion codepoints could mean Unihan tokens can be encoded by formula to a practical recursion depth, without requiring huge fonts or superlong encodings. And if 4.4 trillion isn't enough, we can define the meanings of `11111110` and `11111111` differently. Wow!

Great things are coming to Unicode thanks to Corrigendum 9 and the Groovy Language reboot!


Last edited Nov 3 at 5:11 AM by gavingrover, version 28

Comments

No comments yet.