Unicode Properties in GRegexes

Version 0.1.8 of GRegexes parsers can access Unicode properties using the magic _c variable inside ~closures.

Blocks

We can access the block property using blk. The block is specified using a string of lowercase letters with hyphens and spaces removed. The ~closure returns a 1 for a match that consumes the character, a 0 for a peek-match that doesn't consume anything, or a -1 for a failure.

def p1= 'x' & ~{ _c.blk == 'basiclatin'? 1: -1 }
def r= p1.parse("xyz")
assert r.result == ['x', 'y']

r= p1.parse("x中z")
assert r.result == null

def p2= 'x' & ~{ _c.blk == 'basiclatin'? 1: -1 } & ~{ _c.blk != 'basiclatin'? 1: -1 }
assert p2.parse("xyzw").result == null
assert p2.parse("x中zw").result == null
assert p2.parse("x中下w").result == null
assert p2.parse("xy下w").result == ['x', 'y', '下']

Examples of block names used by GRegexes 0.1.7 are:

    (0x0000..0x007f): 'basiclatin',
    (0x0080..0x00ff): 'latin1supplement',
    (0x0100..0x017f): 'latinextendeda',
    (0x0180..0x024f): 'latinextendedb',
    (0x0250..0x02af): 'ipaextensions',
    (0x02b0..0x02ff): 'spacingmodifierletters',
    (0x0300..0x036f): 'combiningdiacriticalmarks',
    (0x0370..0x03ff): 'greekandcoptic',
    (0x0400..0x04ff): 'cyrillic',
    (0x0500..0x052f): 'cyrillicsupplement',
    (0x0530..0x058f): 'armenian',
    (0x0590..0x05ff): 'hebrew',
    (0x0600..0x06ff): 'arabic',
    (0x0700..0x074f): 'syriac',
    (0x0750..0x077f): 'arabicsupplement',
    (0x0780..0x07bf): 'thaana',
    (0x07c0..0x07ff): 'nko',
    //...
    (0x20000..0x2a6df): 'cjkunifiedideographsextensionb',
    (0x2a700..0x2b73f): 'cjkunifiedideographsextensionc',
    (0x2b740..0x2b81f): 'cjkunifiedideographsextensiond',
    (0x2f800..0x2fa1f): 'cjkcompatibilityideographssupplement',
    (0xe0000..0xe007f): 'tags',
    (0xe0100..0xe01ef): 'variationselectorssupplement',
    (0xf0000..0xfffff): 'supplementaryprivateuseareaa',
    (0x100000..0x10ffff): 'supplementaryprivateuseareab',

The next version of GRegexes will support heuristic names.

Scripts

We can access the script property using 'sc' in a similar manner as for blocks.

def p1= 'x' & ~{ _c.sc == 'latin'? 1: -1 }
def r= p1.parse("xyz")
assert r.result == ['x', 'y']

r= p1.parse("x中z")
assert r.result == null

Other common scripts are 'common', 'inherited', 'latin', 'greek', 'cyrillic', 'arabic', 'han', etc.

Other string properties

Property names such as 'blk' and 'sc' are taken from the Unicode property aliases file, lowercased with hyphens and spaces removed. Their values are from the property values file, also lowercased with hyphens and spaces removed.

The other general string properties in the Unicode property aliases file are:

gc, bc, scf, dt, na, na1, namealias, age, nt, nv, jsn, ea, hst, jg, jt, nfcqc, nfdqc, nfkcqc, nfkdqc, lb, gcb, sb, wb

Boolean properties

Unicode boolean properties can also be used, e.g ideographic property:

def p1= 'x' & ~{ _c.ideo? 1: -1 }
def r= p1.parse("x中z")
assert r.result == ['x', '中']

r= p1.parse("xyz")
assert r.result == null

Other boolean properties are:

ahex, bidic, bidim, ce, dash, dep, dia, ext, hex, idc, ids, idsb, idst, joinc
loe, lower, nchar, oalpha, odi, ogrext, oidc, oids, olower, omath, oupper, patsyn, patws
qmark, radical, sd, sterm, term, uideo, upper, vs, wspace, xidc, xids

Integer properties

There's one integer property in Unicode (the canonical combining class):

def p1= 'x' & ~{ _c.ccc == 0? 1: -1 }
assert p1.parse("xyz").result == ['x', 'y']

def p2= 'x' & ~{ _c.ccc == 230? 1: -1 }
assert p2.parse("xyz").result == null

We can also access the character as an integer using 'cp':

def p= 'x' & ~{ _c.cp == 0x61? 1: -1 } //0x61 is 'a'
assert p.parse("xaz").result == ['x', 'a']
assert p.parse("xbz").result == null

Codepoint properties

Some Unicode properties are related codepoints, e.g simple uppercase (suc):

def p= 'x' & ~{ _c.suc == 0x41? 1: -1 } //0x41 is 'A'
assert p.parse("xaz").result == ['x', 'a']
assert p.parse("xbz").result == null

Other codepoint properties are:

bmg, slc, stc

Other properties are a list of codepoints, e.g lowercase (lc):

def p= 'x' & ~{ 0x6a in _c.lc? 1: -1 } //0x6a is 'j'
assert p.parse("xJz").result == ['x', 'J']
assert p.parse("xbz").result == null

Other codepoint list properties are:

uc, tc, cf, nfkccf, dm

Last edited Sep 6, 2011 at 3:04 AM by gavingrover, version 2

Comments

No comments yet.