This project is read-only.

Inheritance Hierarchies for Unicode Characters

The more than 110,000 characters encoded in Unicode version 6.0 have almost 100 different properties of various types associated with them:
  • an integral one, 'ccc'
  • 5 code point ones, 'bmg', 'isc', 'suc', 'slc', 'stc'
  • 7 which are code point lists, 'dm', 'lc', 'tc', 'uc', 'cf', 'fcnfkc', 'nfkccf'
  • 25 string ones, 'blk', 'sc', 'age', 'namealias', 'ea', 'hst', 'jsn', 'gcb', 'lb', 'wb', 'sb', 'na', 'gc', 'bc', 'na1', 'dt', 'nt', 'nv', 'scf', 'jt', 'jg', 'nfcqc', 'nfdqc', 'nfkcqc', 'nfkdqc'
  • 59 boolean ones, 'bidim', 'ahex', 'bidic', 'ce', 'dash', 'dep', 'dia', 'ext', 'hex', 'hyphen', 'ideo', 'idsb', 'idst', 'joinc', 'loe', 'nchar', 'oalpha', 'odi', 'ogrext', 'oidc', 'oids', 'olower', 'omath', 'oupper', 'patsyn', 'patws', 'qmark', 'radical', 'sd', 'sterm', 'term', 'uideo', 'vs', 'wspace', 'math', 'alpha', 'lower', 'upper', 'cased', 'ci', 'cwl', 'cwu', 'cwt', 'cwcf', 'cwcm', 'ids', 'idc', 'xids', 'xidc', 'di', 'grext', 'grbase', 'grlink', 'compex', 'cwkcf', 'xonfc', 'xonfd', 'xonfkc', 'xonfkd'

The best way to grapple with this complexity is to use inheritance hierarchies...

1. Codepoint Hierarchy

One hierarchy represents codepoints:
  • based primarily on the 'gc' (general category) property
  • including boolean properties which are exclusive to a certain general category, i.e. 'nchar', 'bidic', 'joinc', 'idsb', 'idst', 'radical', 'uideo', 'loe', 'hst'
  • giving priority to the "Grapheme Breaking" behavior which is more important than other breaking behaviors

A first cut at this hierarchy, with the numbers in square brackets to show how many of Unicode characters belong to the class...

+---ReservedPoint //gc==Cn && ! nchar
+---NonCharacterPoint //gc==Cn && nchar [66]
    +---SurrogatePoint //gc==Cs [2048]
        +---PrivateUsePoint //gc==Co [137468]
            +---ControlPoint(abstract) //gc==Cc [65]
            |   +---NewlineControlPoint(abstract)
            |   |   +---CarriageReturnControlPoint //cr:0x000d
            |   |   +---LineFeedControlPoint //lf:0x000a
            |   |   +---OtherNewlineControlPoint //tab:0x000b, ff:0x000c, nel:0x0085
            |   +---OtherControlPoint //gc==Cc && not one of the others
                +---FormatPoint(abstract) //gc oneof Cf[140], Zl[1], Zp[1]
                |   +---BidiControlPoint //gc==Cf && bidic [7]
                |   +---JoinControlPoint(abstract) //gc==Cf && joinc
                |   |   +---ZeroWidthNonJoinerPoint //0x200c
                |   |   +---ZeroWidthJoinerPoint //0x200d
                |   +---LineSeparatorPoint //gc==Zl:2028
                |   +---ParagraphSeparatorPoint //gc==Zp:2029
                |   +---WordJoinerPoint //0x2060
                |   +---ZeroWidthSpacePoint //0x200b
                |   +---ZeroWidthNoBreakSpacePoint //0xfeff
                |   +---SoftHyphenPoint //0x00ad
                |   +---OtherFormatPoint //gc==Cf && not one of the others
                    |   +---SpacingMarkPoint //gc==Mc [287]
                    |   +---NonSpacingMarkPoint(abstract) //gc==Mn & Me
                    |       +---EnclosingMarkPoint //gc==Me [12]
                    |       +---VariationSelectorPoint //gc==Mn && vs [256]
                    |       +---OtherNonSpacingMarkPoint //gc==Mn && ! vs [943]
                        +---SpaceSeparatorPoint //gc==Zs [18]
                        |   +---PunctuationPoint(abstract)
                        |   |   +---ConnectorPunctuationPoint //gc==Pc [10]
                        |   |   +---DashPunctuationPoint //gc==Pd [21]
                        |   |   +---PairedPunctuationPoint(abstract)
                        |   |   |   |                  //[58 matched bidim pairs]
                        |   |   |   +---OpenPunctuationPoint //gc==Ps [72]
                        |   |   |   +---ClosePunctuationPoint //gc==Pe [71]
                        |   |   +---QuotePunctuationPoint(abstract)
                        |   |   |   |                  //[8 matched bidim pairs]
                        |   |   |   +---InitialQuotePunctuationPoint //gc==Pi [12]
                        |   |   |   +---FinalQuotePunctuationPoint //gc==Pf [10]
                        |   |   +---OtherPunctuationPoint //gc==Po [402]
                        |   +---SymbolPoint(abstract)
                        |       +---MathSymbolPoint(abstract) //gc==Sm [948]
                        |       |   +---MirroringMathSymbolPoint(abstract)
                        |       |   |                  //gc==Sm && bidim [411]
                        |       |   |   +---PairedMirroringMathSymbolPoint
                        |       |   |   |              //gc==Sm && bidim && bmg [230]
                        |       |   |   +---SelfMirroringMathSymbolPoint
                        |       |   |                  //gc==Sm && bidim && ! bmg
                        |       |   +---UnmirroredMathSymbolPoint //gc==Sm && ! bidim
                        |       +---CurrencySymbolPoint //gc==Sc [47]
                        |       +---ModifierSymbolPoint //gc==Sk [115]
                        |       +---OtherSymbolPoint(abstract) //gc==So [4398]
                        |           +---IdeographicDescriptorPoint(abstract)
                        |           |   +---BinaryIdeographicDescriptorPoint
                        |           |   |              //gc==So && idsb
                        |           |   +---TrinaryIdeographicDescriptorPoint
                        |           |                  //gc==So && idst
                        |           +---UnihanRadicalPoint //gc==So && radical
                        |           +---RemainingOtherSymbolPoint
                        |                             //gc==So and not one of the others
                            |   +---DigitNumberPoint //gc==Nd [420]
                            |   +---LetterNumberPoint //gc==Nl [224]
                            |   +---OtherNumberPoint //gc==No [456]
                                |   +---LowercaseLetterPoint //gc==Ll [1759]
                                |   +---TitlecaseLetterPoint //gc==Lt [31]
                                |   +---UppercaseLetterPoint //gc==Lu [1436]
                                +---ModifierLetterPoint //gc==Lm [210]
                                +---OtherLetterPoint(abstract) //gc==Lo [97084]
                                    |      //[0xe30,0xe32,0xe33,0xe45,0xeb0,0xeb2,0xeb3]
                                    +---LogicalOrderExceptionPoint //gc==Lo && loe [15]
                                    |       //gc==Lo && hst either LV[399] or LVT[10773]
                                    |   +---HangulLeadingConsonantPoint
                                    |   |                    //gc==Lo && hst==L [125]
                                    |   +---HangulVowelPoint //gc==Lo && hst==V [95]
                                    |   +---HangulTrailingConsonantPoint
                                    |                        //gc==Lo && hst==T [137]
                                    +---UnifiedIdeographPoint //gc==Lo && uideo [74616]
                                                  //gc==Lo && not one of the others

Non-graphic Points

Reserved points (about 850,000 of them) are those not yet assigned for use by Unicode, and shouldn't be present in the character stream for a particular Unicode version.

Non-character points (66 of them) are points that will never be assigned to characters by Unicode, usually because they are used at a lower level in one of Unicode's encodings, e.g. characters 0xfffe and 0xffff, plus all characters ending in fffe and ffff, such as 0x2fffe, 0x9ffff, 0x10ffff. They are ignored as characters when found in the character stream.

Surrogate points (2048 of them) are also used by a Unicode encoding (UTF-8), but are interpreted when encountered. In theory, they shouldn't appear in other Unicode encodings, such as UTF-16, but in practise can.

There are 137468 private use points, 6400 in the BMP (basic multilingual plane), the rest in the supplementary planes, which can have their properties defined by users.

The 65 control points are those defined by the extended ASCII encoding (i.e. 0x00-0x1F, 0x7f-0x9f), which are used unchanged by Unicode. There's 5 newline characters among these, with 0x0d and 0x0a distinguished so they can act as one when together to cater for the Windows platform.

There are 142 control points. 103 of them are deprecated, their function replaced before Unicode 6.0.

Some important control points are ZeroWidthNonJoinerPoint (0x200c), ZeroWidthJoinerPoint (0x200d), LineSeparatorPoint (0x2028), ParagraphSeparatorPoint (0x2029), WordJoinerPoint (0x2060), ZeroWidthSpacePoint (0x200b), ZeroWidthNoBreakSpacePoint (0xfeff), and SoftHyphenPoint (0x00ad).

There are 7 bidirectional control points to control left-to-right and right-to-left ordering when swapping between different scripts in one character stream.

Points for Graphemes

Marking points, which modify a base character point, are divided into spacing marks (287 of them) and non-spacing marks (1211 of them), which 12 enclosing marks and 256 variation selector points. Variation selectors are used to select between more than one possible glyph that may exist for certain characters, e.g. unified Chinese/Japanese characters.

The remaining Unicode characters are base character points. But these code points in Unicode can be abstracted up a level by considering graphemes, which are what users generally recognize to be "characters". Roughly, a grapheme is made up of a base character point or spacing mark, followed by any number of nonspacing marks or join control points. (There are presently 21 specified exceptions to this rule, as there are with many Unicode rules.)

When detecting such graphemes in a Unicode character stream, the stream must first be "normalized". This means decomposing the 1491 "canonically composite" characters into their constituent forms or equivalent form using the 'dt' and 'dm' properties (though some are excluded by the 'ce' property), then reordering the marks, both nonspacing and spacing, after each base character using the 'ccc' property.

2. Non-canonical Decomposition Hierarchy

There are 4 other ways characters can be "decomposed" whereby the form of decomposition must be retained:
  • compatibility decomposition
  • case-folding
  • bidirectional mirroring
  • Hangul to Jamo decomposition

The 'dt' and 'dm' properties further specify 673 other ways characters can be decomposed into constituent or equivalent forms using various functional tags:
  • noBreak (5)
  • fraction (20)
  • Arabic initial (171), medial (82), final (240), and isolated (238) forms
  • enclosed by circle (234) or square (205)
  • font (47)
  • small (26)
  • narrow (122), wide (104)
  • super (142), sub (38)
  • vertical (35)

The circle and square enclosing decompositions have the same functionality as the CombiningEnclosingCirle(0x20dd) and CombiningEnclosingSquare(0x20de) non-spacing marks, except that most of the circle and square enclosings decompose to 2 or more characters, whereas non-spacing marks only work with 1 base character.

Cased characters are mapped to each other using the 'slc', 'stc', 'suc', 'lc', 'tc', and 'uc' properties. Uppercase and titlecase characters can be "case-folded" to the lowercase form to aid in functions such as case-ignored pattern matching. Uppercase and titlecase forms of a character can be considered to be two further types of compatibility decomposition:
  • uppercase (1436)
  • titlecase (31)

Many punctuation points and math symbols are paired into bidirectional mirroring glyphs (115 pairs). A further 181 math symbols should have two glyphs each, one mirroring the other, depending on whether it's in left-to-right text or right-to-left text. One character in each pair of mirrored glyphs can be regarded as a compatibility decomposition of the other:
  • bidi-mirroring (115)

Finally, the 11172 Hangul syllables can be decomposed into their 357 Jamo points.

The CJK Decompositions of the 74616 Unihan characters will fit into the existing Unicode system of marks and decompositions for their basis. They ultimately decompose to the 36 CJK stroke codepoints.

A first look suggests by introducing decompositions into Groovy Unicode, we can reduce the 110,000 tokens down to only 22,000 by eliminating 88043 of them (i.e. 673 compat, 1467 cased, 115 mirrored, 11172 hangul, 74616 unihan).

Last edited Dec 18, 2011 at 12:41 PM by gavingrover, version 5


No comments yet.