Gavin Grover's GROOVY Wikiblog

6 February 2013

Unicode Pattern Syntax Tokens

Each of the million-plus Unicode characters has about 50 properties associated with it. Most of them can change between different versions of Unicode or for defining private-use characters, except for six of them:
  • Name (na)
  • Jamo Short Name (jsn)
  • Canonical Combining Class (ccc)
  • Decomposition Mapping (dm)
  • Pattern Syntax (PatSyn)
  • Pattern White Space (PatWS)
The first four will change from their default to a lifelong value for newly assigned characters, though. The other two, Pattern Syntax and Pattern White Space properties, both boolean values, will never change even when the character is being newly assigned. Only 11 characters have the PatWS property so they're not as interesting as those with the PatSyn property, 2760 of them, 296 of which are still unassigned in Unicode 6.1.

The unassigned characters are given the PatSyn property by defining all characters in certain blocks (rather than characters) as having the PatSyn property:
  • 2190..21FF; Arrows
  • 2200..22FF; Mathematical Operators
  • 2300..23FF; Miscellaneous Technical
  • 2400..243F; Control Pictures
  • 2440..245F; Optical Character Recognition
  • 2500..257F; Box Drawing
  • 2580..259F; Block Elements
  • 25A0..25FF; Geometric Shapes
  • 2600..26FF; Miscellaneous Symbols
  • 2700..27BF; Dingbats
  • 27C0..27EF; Miscellaneous Mathematical Symbols-A
  • 27F0..27FF; Supplemental Arrows-A
  • 2800..28FF; Braille Patterns
  • 2900..297F; Supplemental Arrows-B
  • 2980..29FF; Miscellaneous Mathematical Symbols-B
  • 2A00..2AFF; Supplemental Mathematical Operators
  • 2B00..2BFF; Miscellaneous Symbols and Arrows
  • 2E00..2E7F; Supplemental Punctuation
From these, 30 exceptions (0x2776..0x2793) are subtracted, and 150 are added. The added exceptions include the ASCII symbol and punctuation characters often used for syntax in programming languages.

According to Unicode standard annex 31: "With a fixed set of whitespace and syntax code points, a pattern language can then have a policy requiring all possible syntax characters (even ones currently unused) to be quoted if they are literals. Using this policy preserves the freedom to extend the syntax in the future by using those characters. Past patterns on future systems will always work; future patterns on past systems will signal an error instead of silently producing the wrong results."

There are, of course, many other symbol and punctuation characters in Unicode that aren't Pattern Syntax characters, and more can be added that can't become Pattern Syntax characters (unless they're encoded in one of the remaining 296 unassigned Pattern Syntax slots). The only language I know of that utilizes Unicode's pattern syntax invariance is XML 1.0, 5th edn. Perhaps the Groovy Language reboot will be the first Turing-complete language to do so.

47 of the 2464 assigned Pattern Syntax are canonically equivalent to other forms, so can be ignored. 45 of those decompositions are to some other Pattern Syntax character followed by nonspacing mark 0x338( ̸ ), e.g. ≠ is = then mark ̸ . The other two, 0x2329 and 0x232A, are singleton decompositions to 0x3008(〈) and 0x3009(〉), both also Pattern Syntax characters.

Within the assigned pattern characters, perhaps the second most important distinction are the bidi-mirroring glyphs, those with the bmg property, each of which must be swapped for its complement when used within a right-to-left rendering context, such as within Arabic and Hebrew text. There are 144 pairs of them, such as ( and ), or [ and ]. Because such pairs are used extensively in programming languages, with balancing often required to make language syntax more readable, they are an important subset of the Pattern Syntax characters.

Perhaps another important subset of the Pattern Syntax are the 161 bidi-mirrored characters, those with the bidim property, unmatched characters which must be mirrored when rendering, e.g. ∁ ∂ ∃ ∑ . A further subset could be characters that would be bidi-mirrored but aren't because they're already symmetrical, e.g. ∀ ∩ ∪ . They're not indicated by any specific property, but must be guessed at based on their physical proximity to related bidi-mirrored characters in the Unicode database, e.g. ∃ is bidi-mirrored so ∀ must be symmetrical. Perhaps a specific property in a future version of Unicode would be nice.

There's many other Pattern Syntax characters that aren't symmetrical but also aren't mirroring or mirrored in a right-to-left rendering context because they're considered to be pictorial or ornate characters, including those for box-drawing and all the arrows, such as 0x2190(←) and 0x2192(→). Unlike the Pattern Syntax property, the bidi properties of a character can change in future versions of Unicode, but for now we can't use ← in programming and expect it to be rendered as → when we change the naming language to Arabic or the thematic ordering of the referents. So languages that use arrows in their syntax, such as Scala, aren't future-proofed. Perhaps a future version of Unicode would add more mirroring pairs.

In designing the Groovy reboot, we do have 305 bidi-mirroring or -mirrored symbols to choose from, so not being able to use the arrows isn't a big deal. We just need to be aware of our choices, and not make syntactic mistakes, as did the first bootup of Groovy when its project managers committed to treating every Unicode token above 0xFF as an identifier character :-(

The Groovy Language reboot will utilize the full power of Unicode in both its syntax and its vocabulary.

Last edited Aug 17 at 8:45 PM by gavingrover, version 22


No comments yet.