This project is read-only.

Gavin Grover's Unicode-Ultra Wikiblog

See later entries

4 May 2014; updated 26 Sep 2017

2048 volumes of Unicode-Ultra

When extending Unicode's upper codepoint limit from U+10_FFFF back up to U+7FFF_FFFF, we need to introduce a new term: volume. A volume is 16 planes, so whereas presently there's only 1 volume and 1 plane, Unicode-Ultra will have 2048 volumes, from 0x0 to 0x7FF. We'll also introduce new notation V+xxx to reference volumes and P+xxxx to reference planes, both in hexadecimal. We've previously seen how we can represent the CJK Unihan ideographs using the same formulaic method already used to represent Korean Hangul since Unicode 2.0, but doing it recursively, and generate 2 billion possible ideographs. Let's look at other possible uses of the newly available 2047 volumes in Unicode-Ultra...

Volume V+000 (i.e. U+0x_xxxx) is the Unicode-controlled volume (UCV), consisting of the 16 planes from P+0000 to P+000F. The top half of P+000F will change from its present Private Use to High Ultra-surrogates. The Unicode Consortium will control use of this volume as at present. Already they've given names to 6 of its planes: 0 is BMP, 1 is SMP, 2 is SIP, 3 is TIP, 0xE is SSP, and 0xF is PUP. Each V+000 plane will continue to have the final 2 codepoints be nonchars as they are presently, however subsequent planes (P+0010 and higher) won't so it'll be easier to formulaicly generate characters into blocks spanning more than one plane.

Volume V+001 (i.e. U+1x_xxxx) will be the Private-use volume (PUV), consisting of the further 16 planes representable by 4 bytes in UTF-8-extended. P+0010 will have a dual use as both Low Ultra-surrogates, as well as Private use as at present. Developers would normally use P+0010 (rather than P+000F or the BMP block) for their initial private-use characters, so they can defer deciding whether to encode using the 2048-volume Unicode-Ultra or the presently crippled 1.0625-volume Unicode UTF-8, until they know how many private-use characters they need.

Volumes V+002 to V+03F (doubling 5 more times!) gives the 62 planes representable by 5 bytes in UTF-8-extended. V+002 could be the Japanese emoji volume (JEV). Simply give it to a consortium of Japanese telecom businesses (NTT,etc) to manage: it might even sway them into switching en mass from Shift-JIS to Unicode-Ultra.

V+003 could be the Basic syllabic volume (BSV). Korean Hangul gets over 10,000 codepoints in the UCV/BMP (15%) to represent its syllables, and Chinese Hanzi and Japanese Kanji, which are joint syllabic/ideograph, together get 26,000 in the BMP (40%), plus 50,000 more in the UCV/SIP, but alphabetic scripts such as Latin must represent each sound separately. To equal matters up, we can represent syllables in Latin-encoded languages such as English with a single codepoint generated formulaicly.

Presently in Korean there are 19 mandatory leading consonants, 21 mandatory vowels, and 27 optional trailing consonants, giving a total of 19 * 21 * 28 == 11,172 possible syllable blocks, generated by formula into range U+AC00 to U+D7A3. By using the same calculation in English, we see there's 42 optional leading consonant clusters (e.g. str), 20 mandatory vowels for British English (e.g. e), and 143 optional trailing consonant clusters (e.g. ngth) in a syllable, giving 43 * 20 * 144 == 123,840 possible syllables that can be generated by formula. Many of those will be unused in English but the codepoint needs to be reserved so the glyph can easily be generated formulaicly by the font if desired. We can also thus delegate compatibility decomposition issues such as fi ligatures to the font.

Syllabic English in Latin thus would take up 12% of a volume. By separately encoding syllables in all languages for all alphabetic scripts, including vowel marks in abjads and abugidas, we would use many more of the 5-byte UTF-8-extended volumes from V+004 to V+03F. We would reserve the rest for future use.

Volumes V+040 to V+7FE (doubling another 5 times!) gives the codepoints representable by 6 bytes in UTF-8-extended, except the last. Volume V+7FF would have all nonchars its last two planes (P+7FFE and P+7FFF) to cater for the high ultra-surrogates having nonchars as its last 2 codepoints. V+7FF would best be reserved as a Non-character volume (NCV), the other 14 planes having a sort of semi-nonchar status for now.

We would use the 2 billion codepoints in V+040 to V+7FE to generate Unihan characters by formula. If we had 210 basic components each with inbuilt combining behavior, we could formulaicly generate all CJK characters of up to 4 components each 210 ^ 4 ~= 2,000,000,000. By integrating some heuristics into our formula, we could generate many more complex characters. But what if even that isn't enought?...

The 6-bit Ultracode language we embed in the continuation bytes following each 11111110 byte would, among other tasks, specify the private usage of the 11111111 byte. Users could use the 11111111 byte to head up 8 continuation bytes, encoding points up to U+FFFF_FFFF_FFFF, which is volumes from V+000 to V+FFF_FFFF, giving 280 trillion codepoints if needed. Unihan component recursion depth is no longer a problem!

31 May 2014

Unicode-Ultra's embedded 6-bit language: Ultracode

Unicode-Ultra reinstates the 2.1 billion codepoint repetoire originally proposed for Unicode's UTF-8 encoding, but taken away in 2003. Unicode-Ultra also brings "Ultracode", a 6-bit language embedded in the 10xxxxxx bytes following a 11111110 byte, giving 64 possible tokens.

You may ask why Unicode needs a language embedded differently to the characters, instead of just assigning 64 default ignorable codepoints, perhaps from a private use area. (We will in fact be assigning 64 such points for interim implementations, but that's only a temporary measure.) Our embedded 6-bit language will use nested scopes, which makes it different to most of the other control-like codepoints in Unicode. Only a few codepoints provide nested scopes, and most of them have been deprecated or superceded:
  • U+206A to U+206F indicate three nested scopes which have all since been deprecated by later codepoints and properties: which script's digit shapes to use, whether to join Arabic letters together, and whether symmetric reversal should be used for certain symbols
  • U+FFF9 to U+FFFB enable embedding annotations like Japanese rubies within text, although the Unicode Consortium recommends a higher-level protocol be used
  • U+202A to U+202E are the bidirectional embedding and override codes, which were superceded by the simplified isolate codes in Unicode 6.3
  • U+2066 to U+2069 are the bidi isolate codes, the only "Unicode recommended" codes that use nested scoping
Unicode has slowly moved away from nested scopes so the codepoint repetoire is almost as self-synchronizing as its UTF-8 encoding. Bidirectionality is the only Unicode process that can't be represented any other way. But there's still an important difference between Unicode's nested scopes such as bidi isolates, and Unicode-Ultra's embedded 6-bit language: processing for all Unicode's nested scopes is reset by the paragraph break token U+2029, whereas the paragraph break is just another contained token within the 6-bit "Ultracode" language.

Ultracode's tokenset

Whereas Unicode's predescessor ASCII provided the 2nd and 3rd quadrants as a useful subset when a 6-bit code was required, Ultracode will use the 2nd and 4th quadrants...
ASCII for Ultracode.png
  • in the 1960's the Latin uppercase letters and underscore were considered more fundamental than the lowercase ones, but nowadays abc-defg is more accepted for variable names than ABC_DEFG
  • the 4th quadrant also provides a Unicode-undefined control character del (U+007F)
  • it's easier to recode ASCII's 0x1yyyyy to 10xyyyyy (whereby the 0x1 mutates to 10x, and yyyyy stays the same)

Ultracode's syntax

We can now work from our restrictions and outline a rough grammar for Ultracode...

The tokens available are the 26 lowercase letters a to z, 10 digits 0 to 9, space, delete, and 26 punctuation and symbol characters -.(){},;!&|#+*/=<>"':?$%~`. Keyboard tokens not available, i.e. the uppercase letters A to Z and symbols []\@^_, can be used as meta-syntax, translated to the 64 encodable characters. We will use the syntactic style used by CSS, Javascript, and Groovy/Kotlin-style builders.

So a to z and 0 to 9 will be valid in names and numbers, with letter first for names and digit first for numbers. A hyphen - and dot . can also be used with both names and numbers. When an uppercase (A to Z) appears in some markup text, it will be immediately translated to hyphen - followed by the lowercase equivalent. The CSS/JS-syntax will use (){} as delimiters and ,:; as separators. Quotes "'` will be shortcut symbols for commonly used calls such as bolding or italics. The del codepoint will be used for newline. Although such newline and space are available for encoding, they'll both be superfluous in the syntax, as will # to indicate line comments. !&| could be used for boolean logic, which leaves ?$%~+*/=<> for some other use.

Meta-tokens mean we don't need an escape character in Ultracode markup text. We'd use meta-token ] to end the markup text, meaning markup text would generally be indicated in some other context by a string of tokens ending in [. We'd also use nested [ and ] to embed other information in our markup, such as arbitrary private-use information embedded in the 10xxxxxx bytes following a 11111111 byte. We could use meta-token \ to precede a single Unicode character in our markup text, and @ to precede the alias for a non-graphic one. @ is also a good meta-character to indicate other embeddings, such as @" ... " for strings of Unicode characters, and @{ ... } for embedded code in some language for inline execution. This leaves ^_ for some other meta-token use.

See earlier entries

Last edited Nov 3, 2017 at 5:10 AM by gavingrover, version 30


No comments yet.