Gregexes

Gregexes are to regexes what GStrings are to strings in Groovy: they add lots of extra functionality to regexes. They're a lightweight combinator parsing library for Groovy, inspired by Perl 6 Grammars, providing lazy evaluation of recursive parsing expressions and in-place processing of intermediate results. They can detect left-recursion at the parse phase, providing meaningful error messages. They're designed to only work with string inputs. Their use case is midway between simpler regexes and more sophisticated parsing tools.

Gregex is released under LGPL version 3. The source code and license text are inside the jar file. Version 0.1.5 of the gregexes are written in Groovy++ 0.4.150, and its use has been tested on Groovy 1.7.5 running on Java 1.6.0_21. To use them, drop the gregex.jar in your Groovy classpath, along with the Groovy++ jar. Gregexes are modeled after regexes in that they use two main classes: Parser, similar to regex's Pattern, and ParseMatcher, similar to regex's Matcher. To use them in the longhand way:

import gregex.*
import static gregex.Parser.*
Parser.use{
  Parser p= match('x')
  ParseMatcher r= p.parse("xyz")
  assert r.result == "x" && r.position == 1
}

We must import gregex, and surround our code with Parse.use{}. (An annotation simplifying all this will come in the next version.) We'll omit the import and Parse.use{} statements in the examples in this tutorial. Of course, there's a shorthand way to build and run gregexes, which we'll use from now on.

Below are many short examples showing how to use gregexes. If you prefer, you can learn by building the canonical calculator example usually presented with combinator parser tutorials.

Sequences

Gregexes provide a shorthand way to specify parsing expressions, powered by Groovy's operator overloading. To specify sequencing, we use & between the expressions, which can be a mix of strings, regexes, and other gregexes:

def p1= ~/T./ & "ka"
def p2= p1 & "pu" & ~/n./
def r= p2.parse("Takapuna")
assert r.result == ["Ta", "ka", "pu", "na"] && r.position == 8

The output is a list of results matched by the sequences, including those in the referenced gregex. When the referenced or bracketed gregex is on the right side, though, the resulting list will be embedded:

def p1= ~/El*/ & ('ers' & ~/l../)
assert (p1 << 'Ellerslie').result == ['Ell', ['ers', 'lie']] //result from rightside parser embedded in list

def p2= ~/El*/ & ('er' & 'sl') & ~/../
assert (p2 << 'Ellerslie').result == ['Ell', ['er', 'sl'], 'ie'] //results from middle parser also embedded

def p3= (~/El*/ & 'ers') & ~/l../
assert (p3 << 'Ellerslie').result == ['Ell', 'ers', 'lie'] //but results from leftside parser flattened in list

The above example also shows using the streaming operator << for parsing.

When we want to suppress the result of a parser from the results, we put a - sign in front of it:

def p= -'a' & 'b' & -~/.{2}/ & 'e' & -~/f/ & ~/g/
p<< 'abcdefgh'
assert _r == ['b', 'e', 'g'] && _p == 7

This example also shows two handy magic variables: r is the result of the last parse, or null if it failed, and p is the position in the input string parsed up to.

The results from a parse are returned as a list when there's more than one result, or no results. A single result is always stripped from the list and returned individually:

(-'a' & 'b' & 'c') << 'abc'; assert _r == ['b', 'c']
(-'a' & -'b' & 'c') << 'abc'; assert _r == 'c'
(-'a' & -'b' & -'c') << 'abc'; assert _r == []
(+~/a(b)c/) << 'abc'; assert _r == ['abc', 'b'] //more than one result from regex with groups

The last line above also shows how we can coerce a standalone string or regex into a parser by using the + operator. This is essential when the string or regex isn't part of a larger parsing expression.

To require end-of-input, we put the null string '' in the sequence:

+'xxx'<< 'xxx'; assert _r == 'xxx' && _p == 3
+'xxx'<< 'xxxx'; assert _r == 'xxx' && _p == 3
('xxx' & '')<< 'xxx'; assert _r == 'xxx' && _p == 3
('xxx' & '')<< 'xxxx'; assert _r == null && _p == 0

All these sequencing functions of gregexes can be duplicated by regex sequencing and grouping. What gregexes add to the sequencing paradigm is the ability to use logic in closures for matching, simply by prefixing the closure with ~. We return an integer to show how many characters we want to match in the string, or -1 for failure:

('abc' & ~{it[0..2]=='def'? 3: -1} & 'g') << "abcdefgh"
assert _r == ['abc', 'def', 'g']

We can use a closure that returns 0 to make assertions during the parse.

Lookarounds

If we supply two parameters to the closure, the first is the portion of the string already matched, so we can do lookbehinds:

def p=
  'abc' &
  ~{prev, succ-> prev[-1]=='c' && succ[0..2]=='def'? 3: -1} &
  'g'
p<< "abcdefgh"
assert _r == ['abc', 'def', 'g'] && _p == 7

Of course, we can do lookarounds inside a regex. Regexes, even when disjoint within the gregex expression, always respect the region and use transparent bounds:

def p= ~/a../ & ~/(?<=bc)d.f/ & 'g'
p<< 'abcdefghij'
assert _r == ['abc', 'def', 'g']

We can also use the peek() and peekNot() parsers:

( 'x' & peek(+'yz') & ~/y./ )<< 'xyzw'
assert _r == ['x', 'yz']

( 'x' & peekNot(+'y') )<< 'xxx'
assert _r == 'x' && _p == 1

Inplace Result Modification

Gregexes enable us to modify the result anywhere during the parse using the ^ operator. It's common practice to do so at the end of a sequencing expression, and Groovy's operator precedences allow us to do so without parentheses:

def p= "x" & ~/./ ^{"{" + it[0] + ", " + it[1] + "}"}
assert (p<< 'xyz').result == "{x, y}"

def q= -~/ */ & ~/[0-9]+/ ^{new BigInteger(it)}
assert (q<< '  7 ').result == 7g

Instead of a closure after the ^ operator, we can put a format string:

def p= "2" & "3" ^ 'a%2$2s%1$3s d' //REM: use single-quotes to suppress GString processing
p<< '2345'
assert _r == "a 3  2 d" && _p == 2

If we don't want to process the current result, we can put the new result within brackets , either in the sequence or after the ^ operator:

(["g"] & "x") << 'xyz'; assert _r == ['g', 'x']
(('uv' ^ ["g"]) & ~/w./) << 'uvwxyz'; assert _r == ['g', 'wx']
('uv' & ["g"] & ~/w./) << 'uvwxyz'; assert _r == ['uv', 'g', 'wx']

If we have two or more values within , or an empty list, the list itself will be the new result:

(["g", "h"] & ~/w./) << 'wxyz'; assert _r == [['g', 'h'], 'wx']
('uv' & ["g", "h"] & ~/w./) <<'uvwxyz'; assert _r == ['uv', ['g', 'h'], 'wx']
('uv' & [] & ~/w./) << 'uvwxyz'; assert _r == ['uv', [], 'wx']

Alternation

We show alternation by putting | between the choices. Because Gregex parsers are atomic by default, we must put the longest match first:

def decimalP= ~/[0-9]+\.[0-9]+/ ^{new BigDecimal(it)}
def integerP= ~/[0-9]+/ ^{new BigInteger(it)}
def p= (decimalP | integerP) & 'g'

p<< '123g '; assert _r == [123, 'g']
p<< '123.45g '; assert _r == [123.45, 'g']

We can put the choices in any order if we enable backtracking using the backtrack() parser:

def decimalP= ~/[0-9]+\.[0-9]+/ ^{new BigDecimal(it)}
def integerP= ~/[0-9]+/ ^{new BigInteger(it)}
def p1= (integerP | decimalP) & 'g'

p1<< '123.45g '; assert _r == null //i.e. failed

def p2= backtrack( (integerP | decimalP) & 'g' )

p2<< '123g '; assert _r == [123, 'g']
p2<< '123.45g '; assert _r == [123.45, 'g']

We can use | with any strings, regexes, and gregexes:

def p= (('x' & ~/y/) | 'p') | (-('a' & 'b') | 'f')
p<< 'xyz'; assert _r == ['x', 'y'] && _p == 2
p<< 'abc'; assert _r == [] && _p == 2
p<< 'ijk'; assert _r == null && _p == 0 //i.e. failure

We've seen an example of backtracking using alternation. Gregex also provides packratting, caching intermediate results, but they're mutually exclusive to backtracking. We can only use one of them at a time:

def p=( ('x' & 'y' & 'z') | "xy" | "x" ) & "y" & "z"
def p1= backtrack(p)
p1<<'xyz'; assert _r == ['x', 'y', 'z']

def p2= packrat(p)
p2<<'xyz'; assert _r == null //i.e. fails

Repeating

We can greedily repeat a parser many times by postfixing (Many) to it:

[ xxxxyzw: ['x']*4,
  xyzw: ['x'],
  yzw: [],
].each{str, res->
  ('x')(Many)<< str; assert _r == res
}

(It would be groovy if we could postfix * or + or ? or *? etc as we can with regexes. Although Groovy supplies a great range of left-associative infix operators at different precedences, it's embarrassingly sparce on other kinds of operator to overload. They'll come in a future Groovy, perhaps. For now, we're constrained by Groovy's grammar.)

The (Maybe) parser always returns a list, perhaps empty or perhaps with a single entry:

[ xxxyz: ['x'],
  yyz: [],
].each{str, res->
  ('x')(Maybe)<< str; assert _r == res
}

By putting a number inside the parentheses, we can repeat the parser a certain number of times:

(~/x./)(3)<< 'xaxbxcxd'
assert _r == ['xa', 'xb', 'xc']

We put a range inside the parentheses to specify the minimum and maximum times the parser will execute. Many denotes an infinite maximum:

def p= ~/./
def str= 'abcdefg'

p(0..3)<< str; assert _p == 3
p(3..5)<< str; assert _p == 5
p(5..9)<< str; assert _p == 7
p(4..Many)<< str; assert _p == 7

Like with alternation, repeating is atomic by default. By using the backtrack() parser, though, we can enable backtracking:

def p= backtrack( ('x')(Many) & 'x' )
p<< 'xxxx'; assert [['x']*3, 'x']

By giving a second parameter to a repeat() parser, we can skip text in between. This acts as a splitting mechanism. By setting the third parameter to true, we can also skip the trailing text if it matches:

def p1= (+"x")(Many, -"=")
p1<< "x=x=x=abc"; assert _r == ['x']*3 && _p == 5

def p2= (+"x")(Many, -"=", true)
p2<< "x=x=x=abc"; assert _r == ['x']*3 && _p == 6
p2<< "x=x=xabc"; assert _r == ['x']*3 && _p == 5

Lazy Evaluation of Parsers

The typical pattern for using lazy parser evaluation is to, first, predefine a variable, second, use it on the right side of a parse expression enclosed by a closure, and lastly, assign a parse expression to it. In our example, we prepend the standalone closure with a + to coerce it to a parser:

def u
def p= +{u}
u= ("(" & p & ")") | "a"

p << '(a)'; assert _r == ['(', 'a', ')']
p << '(((a)))'; assert _r == ["(", ["(", ["(", "a", ")"], ")"], ")"]

To learn more about the lazy-evaluation parser, see the calculator example.

Monadic Binding

A bind parser is a closure that processes the result, and returns a new parser. The bind symbol >> is the same one used by Haskell and Scala. The returned parser is often a value parser, enclosing the new result:

def p= 'x' >>{ ['>'+ it +'<'] } & 'y'
p<< 'xyz'; assert _r == ['>x<', 'y'] && _p == 2

Last edited Jan 28, 2011 at 11:42 AM by gavingrover, version 1

Comments

No comments yet.