Precedences, Parentheses, and Path Expressions

Lisp languages like Clojure use S-expressions. Two early proposed alternatives were M-expressions and I-expressions, which I'm guessing stand for Math Expressions and Indent Expressions respectively.

Guy Steele and Richard Gabriel said about M-expressions "because Lisp makes it easy to play with program representations, it is always easy for the novice to experiment with alternative notations. Therefore we expect future generations of Lisp programmers to continue to reinvent Algol-style syntax for Lisp, over and over and over again, and we are equally confident that they will continue, after an initial period of infatuation, to reject it."

Clojure evolves from CLisp and Scheme by using 3 basic aggregate types (i.e. lists, vectors, and maps) instead of one in its syntax, each based around a different delimiter: parens() for lazy lists, brackets[] for vectors, and curlies{} for maps. This syntax makes it more readable than Lisp, but still falls short of languages with Algol-derived syntax.

C evolved from Algol by using curlies to delimit statement blocks, instead of Algol's more wordy begin/end, and C++, C#, Java, and JavaScript copied C in this. These 5 C-syntax languages make up well over half of all language use as measured by the Tiobe index. And 3 of them are the target platforms of Clojure.

I believe two features make C-derived syntax more programmer-friendly than Lisp syntax: path expressions and infix operators. Programmers learn infix notation for math in primary school, i.e. a + b + c instead of (+ a b c). Programmers learn path expressions for math in high school, i.e. f(x,y) instead of (f x y). C-derived syntax uses this math notation and appeals to programmers because of it. Clojure is becoming popular because of its advanced concurrency constructs, while retaining the list-based macros of Lisp. Its popularity is in spite of its bare-bones syntax, not because of it.

If the path expressions and infix operators of C-derived languages were available to Clojure programmers, Clojure would increase its appeal to programmers in its target platforms, i.e. Java, C#, and JavaScript, as well as those coming straight from high-school math.

Path expressions

f(x,y) is a simple path expression. We put the first expression, f, before the pair of delimiters, and so creating a path, i.e. the function name followed by the arguments. C also introduces a different delimiter pair, brackets, for referencing array entries, e.g. a7 instead of (get a 7), as a substitute for subscripting in math notation. C/C++ uses . and -> to reference constituents of C structs and members of C++ objects, e.g. a.b->f instead of (f a b). Codehaus Groovy extends Java syntax by allowing curly-delimited closures on the path after other parameters, e.g. o.f(x, y){s; t} instead of (f o x y (fn [] s t)). With these notations, we can build up long readable paths in various languages, e.g. a7("abc", b)->f(8,9){it*2}.c. While Clojure and its macros provide some syntactic shortcuts, such path expressions are far terser and clearer to read.

C-syntax languages also give its control structures path syntax, e.g. while(i<0){i++} has the same syntactic form as a method calling boolean and closure arguments in Codehaus Groovy. Others have extra embellishments while retaining the basic path format, such as if-else, try-catch-finally, switch-case-default, and for-statements. E.g. the if-else statement can have an optional else "method call" after the first "closure" form in the path.

Most languages allow unary operators. Prefix unaries look like method calls on a single argument, where the parentheses are missing, e.g. ~darg(7,"g").field. Postfix unaries look like member references on a single object, where the dot is missing, e.g. trig("ij").open++. Prefix operators are right associative while postfix operators are left associative.

Infix operators

Infix operators are learnt in primary school around the world, and seem more natural than S-expressions for programming arithmetic. It just feels natural for * and / to associate stronger than + and -, and so they have different precedences. When enabling this in a programming language, we need some form of grouping symbol in case we need to associate out of precedence. In both math and most programming, parens() are used around such expressions. But because parens are the same symbol as for function calls, we need to separate function arguments with a symbol such as a comma to avoid ambiguity, e.g. o.f(x, (a+b)).

Precedence hierarchies are typically listed beginning from parentheses which bind tightest, then path expressions, then unary operators, then the infix operators at various precedences, sometimes many operators at the same precedence. Exponents are represented by superscripts in math. Math subscripts are denoted by brackets[] in programming, but there's no surplus bracketing symbols left on the keyboard to do the same with superscripts, so an infix operator is usually used, e.g. ** or ^. Such exponents bind tighter than multiplication in math, so are usually the highest-precedent infix operator, and are right-associative as in math. Next come * / and % all at the same precedence, left-associative, followed by + and - at a lower precedence.

Some programmers complain about not knowing the precedence hierarchy of various languages such as C++ and Java, though many others don't have problems. IDE's can't display implicit parenthesizing as visually as they can suggest method names, and I suspect programmers who don't really understand API's are the ones who have problems with precendence hierarchies. If in doubt, parenthesize it! Some languages reduce the number of levels to help programmers, e.g. Google GO only has 5 precedences, putting AND and * at the same level, and OR and + together. Haskell has 10 precedences, including the $ operator whose sole job is to force groupings with a single character simply by being at the lowest precedence, in the same way parentheses force groupings with a pair of characters by being of highest precedence.

In math, equality and inequality symbols == != < <= > >= come after arithmetic operators, followed by boolean-NOT, boolean-AND, then boolean-OR. Apart from unary not!, these operators in C-syntax languages typically follow the same precedence and associativity as in math. A typical hierarchy is that found in Codehaus Groovy or C#:
  • shift operators (left-assoc) << >> >>> (and range .. ..<)
  • comparation (left-assoc) < <= > >= (and instanceof as in !in)
  • equality (left-assoc) == != <=> === !==
  • match (left-assoc) =~ ==~ !~ !=~
  • boolean (left-assoc, at different precedences) & ^ | && ||
  • ternary conditional (right-assoc) ? : (and null coalescing op)
  • assignment (right-assoc) = = *= /= %= += -= <<= >>= >>>= &= ^= |= &&= ||= etc
  • arrow (left-assoc) =>

The way to remember such groupings is to remember them as groups and sub-groups, e.g. "Parens, Unaries, Arithmetic, Equality, Boolean, Assignment" as the groups, then sub-groups within each group. Most of it is intuitively obvious.

It can even be extended quite intuitively so programmers can still remember it easily:
  • regex-style postfix unary ops * + ? with the other postfix ops
  • macro expansion ~@ and ` prefix unary ops with the other prefix ops
  • <<< and <<<= at the same respective precedences as >>> and >>>=
  • add range operators <.. <..< with the others
  • extend the boolean operators to nine & ^ | && ^^ || &&& ^^^ ||| each at a different precedence
  • add &^ at the same precedence as & (for Go's bit-clear)
  • other assignments := and <- and <-> at the same precedence as =
  • other arrows -> <~ ~> at the same precedence as =>
  • the comma , and semicolon ; can be considered left-associative operators at suitably low precedences
  • the Haskell-style $ operator at the lowest precedence above the semicolon, very useful for quick adhoc edits

We can change certain infix operators from left associative to right associative by adding a colon as in Scala, e.g. +: -: *: /: %:

I-expressions

Statements are separated by semicolons between curlies{}, list items by commas between brackets[], and arguments by commas between parens() in C-syntax. Python replaces this with indentation, which has pros and cons. There's no dangling parenthesizers at the end of statement groups, but whitespace management can be more difficult when pasting code. Haskell allows both styles to be mixed, though prefering the indentation style. Programmers using C syntax virtually always use indentation anyway for clarity. The Groovy reboot will enable both styles, without even stating a preference. Programmers should always have the choice without being told by busybody "project managers" what to do.

Primary expressions

Both Lisp- and Algol-derived languages can use a common set of literals and other primary expressions:
  • null a.k.a ()
  • booleans true, false
  • numbers in various formats
  • characters
  • strings/regexes
  • arrays/vectors
  • cons-lists
  • associative maps
  • closures
  • object creation

Variables naming such expressions are usually alpha followed by alphanumerics, but modern languages cater for imported names by using strings, e.g. "set-to-date"(1.5). This would be necessary when referencing Clojure names using Groovy syntax.

Unicode has come...

One reason Steel and Gabriel believe M-expression syntax hasn't caught on is "there aren't enough special symbols to go around. When our domain of discourse is limited to numbers or characters, there's only so many operations of interest, so it's not difficult to assign one special character to each and be done with it. But Lisp has a much richer domain of discourse, and a Lisp programmer often approaches an application as yet another exercise in language design; the style typically involves designing new data structures and new functions to operate on them --perhaps dozens or hundreds-- and it's too hard to invent that many distinct symbols (though the APL community certainly has tried). Ultimately we must always fall back on a general function-call notation; it's just that Lisp programmers don't wait until they fail".

Unicode has come, of which the APL symbols are a small subset. There are now more than enough special symbols to go around, but of course there're difficult to enter, a problem which needs to be solved. When solved, we won't always need to fall back on a function-call notation: an Algol-like syntax such as Groovy atop Clojure could catch on.

When we introduce the Unicode symbols and punctuation, we'll insert them at the most intuitive places in the operator hierarchy so programmers can remember their precedence and associativity easily.

Another reason Steele and Gabriel give is "more importantly, Algol-style syntax makes programs look less like the data structures used to represent them. In a culture where the ability to manipulate representations of programs is a central paradigm, a notation that distances the appearance of a program from the appearance of its representation as data is not likely to be warmly received".

Perhaps seasoned Lisp programmers wouldn't want to use such syntax, but what about those moving closer to Lisp or Clojure from other blubbier languages?

Last edited Apr 10, 2012 at 8:57 AM by gavingrover, version 2

Comments

No comments yet.