Gavin Grover's GROOVY Wikiblog

See later entries

Page 8 Contents
16 January 2017

Designing Gro

My previous post on Go keywords from 17 December 2016 shows how all keywords could be removed from a future version of Go. The keywords that take first position in a statement, such as if and for, could keep their roles without actually being keywords if bare statements weren't allowed. This would need something like the word do being required before all assignments, definitions, and function calls. The downside is that simple stuff like short variable declarations are a little wordier. So:

func dream () {
	do a := 123 //instead of: a := 123
	do callMe(a) //instead of: callMe(a)
	do package := a //instead of a syntax error
}


Gro has the same dilemma if we want it to use any lower-case-initial name for identifiers, but also if we want to use extra keywords, such as assert, to begin statements in our syntax. It's difficult to modify the syntax to allow, say, assert to be used as a keyword yet continue to allow it as an identifier.

At first sight, we don't have this dilemma for new top-level keywords because only 6 keywords, import, func, type, const, var and package, are allowed to begin top-level declarations. But we do have it here because Gro intends to extend Go in the same way Apache Groovy extends Java, which means we want to allow all statements to be placed at the top level, so they'll be automatically wrapped inside a main () or init () function. If we want to allow new special identifiers, which are "keywords" at the expression level, or new type aliases, it's almost impossible unless we make multiple passes through the parser or fiddle with the AST afterwards.

Of course we could extend the Go syntax differently by introducing new symbols and punctuation, but we also want Gro to be future-proof against possible future expansions to Go's version 1.x syntax. Go 1.9 expands on the Go syntax by introducing an = in type declarations, and Go 1.10 to Go 1.99 could expand on Go's syntax in many various unpredictable ways with new symbols and punctuation. It seems impossible to have a syntax for Gro that both allows existing and plausible future Go 1.x code to be embedded seamlessly in it, and enables many new keywords to begin declarations, statements, built-in functions, and types. Let alone allowing any lowercase-initial name to be used as an identifier.

The next version of Gro will allow files ending in suffix .gr to use English keywords such as assert and source at the expense of prohibiting naked statements. All statements must begin with a keyword which can be do, so Go code, i.e. that conforming to the syntax of .go files, can't be embedded (unless a syntax "go" statement is used). But for actual Gro code, that with suffix .gro, we have that dilemma. Apache Groovy (in its pre-Apache days) solved this for Groovy 1.0 and 1.5 by changing the meaning of small amounts of Java syntax, e.g. ==. Perhaps its backers had already decided at that time they intended for Groovy to later on be a replacement for Java instead of always just a complement to it. Gro will always complement Go, never compete with it. Groovy 1.6 introduced the @Annotation used by Java for adding new functionality, but Golang doesn't have such annotation hooks in its syntax.

Gro solves this dilemma by prohibiting Unihan in identifier names in the grammar. So myName is valid in both Go and Gro, but my名 and 性名 are invalid in Gro. Virtually no-one uses Unihan in identifiers anyway, only inside strings and comments, so in practise this shouldn't be a problem for anyone wanting to program in Gro. (When Unihan is required in an identifier name, use 引, e.g. 引"X世界" to represent X世界.) We thus free up all the over 75,000 Unihan in Unicode for use in the Gro grammar for all sorts of extra uses, and enable Go 1.x code, both present and what's plausible in the future, to be embedded in Gro code.

One extra use for the Unihan is to enable all possible Gro code to be written without any whitespace at all. Another use is to have a more intuitive mapping between lexical tokens and semantic use. So all lowercase-initials are local variables, all uppercase-initials are exported names, and all Unihan are syntactic control tokens.

The .gro code for the .gr code above is:

功dream(){
	a:=123
	callMe(a)
	做package:=a //做 required here because Go keyword used as identifier
}


which could be stored in the .gro file as:

功dream(){a:=123;callMe(a);做package:=a}


and only formatted with whitespace when it's displayed, at the same time it's colorized.


26 September 2016; updated 17 December 2016

Go keywords

The Syntax section of Golang's first committed draft of its spec says:

The syntax of Go borrows from the C tradition with respect to statements and from the Pascal tradition with respect to declarations. Go programs are written using a lean notation with a small set of keywords, without filler keywords (such as 'of', 'to', etc.) or other gratuitous syntax, and with a slight preference for expressive keywords (e.g. 'function') over operators or other syntactic mechanisms. Generally, "light" language features (variables, simple control flow, etc.) are expressed using a light-weight notation (short keywords, little syntax), while "heavy" language features use a more heavy-weight notation (longer keywords, more syntax). (Emphasis added)

Golang eliminated many keywords that are in other languages such as Java and C# by changing many of them to special identifiers that can be reassigned to:

true := false

so the Go lexer and parser only needs 25 keywords. It's good that programmers who don't use clunky IDE's only need to remember not to use 25 names of identifiers, rather than to not use 52 or even 78 of them. But the ideal is no keywords. Because the first draft of the spec says Go has a "slight preference for expressive keywords over operators or other syntactic mechanisms", one wonders if someone intends for the Go 2.x syntax to eliminate all keywords. There's 6 symbols @ # $ ~ ? \ not used by Go 1.x so perhaps they're reserved for replacing keywords in Go 2.x.

Gro would be easier to implement if there were no keywords in Go because it wouldn't need to add an underscore to the front of names when generating Go code.

Some of Go's 25 keywords are already superfluous, e.g. if map were absent, the correct semantics could still be deduced by the Go parser. Other keywords could be replaced with new operators composed of existing symbols, e.g. <-chan could be replaced by <-, chan<- by ->, and chan by <->. Yet others could be replaced by the new symbols, e.g. struct could be replaced by # and interface by @.

Because the 6 toplevel keywords, import, func, type, const, var and package, are the only identifiers that can take first position in a toplevel declaration, they could keep their roles without actually being keywords if their other uses, type in a switch statement and func in a function literal, were removed. .(type) is superfluous because it can be inferred if types are used as cases. func in a function literal could be replaced by $. The package keyword could even be made into a directive comment, like the //+build directive comment, because it's not really part of the language semantics. Identifiers could be named, say, const and there'd be no confusion with the use of const in first position of a toplevel declaration.

The 11 keywords that take first position in a statement, switch, if, for, select, go, defer, break, continue, return, fallthrough and goto, could similarly keep their roles without actually being keywords if bare statements weren't allowed. Perhaps the word do could be required before all assignments, definitions, and function calls. First position keywords, case and default, could also keep their roles. The mid-statement keywords, if's else and for's range, are superfluous. The if and switch keywords could even be merged into one.

I've only offered a few rough suggestions to show that it's feasible to eliminate all the keywords from Go so identifiers could use any lowercase-initial name. I suspect the core Go developers already have their own ideas for the Go 2.x replacements.


10 October 2016

Groovy's (and Go's ?) TIOBE Fraud

Apache Groovy PMS chair Guillaume Laforge boasts in the _Health_ section of his May 2016 report to the quarterly ASF Directors Meeting: "Since the beginning of the year, Groovy has been in the TIOBE programming language index in the top 20 most popular languages. This month of May, Groovy is ranked 17th most popular language."

The TIOBE Index for October 2016 opens "Who are the candidates for the title of programming language of 2016? There are only 2 languages with an increase of more than 1% if compared to the same period last year, i.e. Go and Groovy. Note that Groovy ended 2015 with a bang, so its annual growth will be much less around January 2017. Google's Go language seems to be unrivalled."

The bang Groovy ended 2015 with was a 2-month jump from 0.328% to 1.182%. Go jumped from 0.160% to 1.625% in a recent 2-month period. Are these languages really jumping in popularity?

I've become concerned someone will boast of Go's position in the TIOBE rankings, and later embarrass the Go developers, so I've finally taken 15 minutes to see what's going on by using a subset of search engines and languages in TIOBE's calculation instructions...

search term bing.com baidu.com wikipedia.org "content" yahoo.com
Groovy 41,500 596,000 18 7,200
GPATH 1 5,690,000 2 0
GSQL 0 2,300 2 151,000
Groovy++ 6 895,000 18 2
Groovy TOTAL 41,507 7,183,300 40 158,202
Go 458,000 18,400,000 81 53,500
Golang 35,600 340,000 2 3,230
Go TOTAL 493,600 18,740,000 83 56,730
Python 1,930,000 5,930,000 351 358,000
Ruby 549,000 5,320,000 162 73,300
Java 3,750,000 7,550,000 1,026 649,000
Scala 133,000 1,630,000 41 133,000
Clojure 58,500 536,000 12 6,090
Kotlin 16,200 69,500 9 2,540


What we see here are some extreme irregularities in TIOBE's calculations for Go and Groovy:
  • +"GPATH programming" giving exorbitantly high figures for Groovy in Baidu
  • +"Groovy programming" in Baidu counted twice (also as +"Groovy++ programming")
  • +"GSQL programming" giving exorbitantly high figures for Groovy in Yahoo
  • +"Go programming" giving exorbitantly high figures for Go in Baidu

The most accurate rankings seem to be in Bing, which puts Groovy a little lower than Clojure. For Laforge to boast of Groovy's Top 20 ranking in TIOBE to the Apache Directors when it's probably below #50-ranked language Clojure is a major distortion. It's unlikely he didn't know about the erroreous calculation, which would make his distorted report a deception. Let's hope the Go developers don't bring a similar shame on themselves by quoting TIOBE.


2 October 2016

Purify Groovy's PMC

When James Strachan founded the Groovy language at Codehaus in 2003, technical people made up 100% of its top leadership. After programmer Jochen Theodorou and manager Guillaume Laforge brought the despotry total to three in 2005, that proportion dropped to 67%. In 2010 when the number of despots increased to five, consecutive non-participators Codehaus admin Ben Walding and Grails rep Graeme Rocher brought the proportion down to 60%.

When Groovy switched to being managed by the Apache Software Foundation (ASF) a year ago (Nov 2015), the number of project managers increased to nine, and only four of them (Theodorou, Paul King, Cedric Champeau, and Pascal Schumacher) have participated in its technical development since then. None of the other five Project Management Committee (PMC) members have any history of participation in the commits or notifications mailing lists (for Github commits or Jira changes) since Groovy moved to Apache. The proportion of technical people in Groovy's top leadership has thus dropped to an all-time low of 44%.

When VMware retrenched its Groovy and Grails developers in March last year (2015), I could sense this travesty coming. The only reason Laforge survived an overthrow by Rocher is because he personally owned the groovy-lang.org domain name at the time. To protect himself against those who were actually doing the technical grunt work building Groovy, Laforge found four like-minded "business" people at the ASF (i.e. Jim Jagielski, Andrew Bayer, Konstantin Boudnik, Roman Shaposhnik) to sponsor Groovy into the Apache incubator. When Groovy became a top-level project, the 5 politicians outnumbered the 4 real contributors in Groovy's PMC, and Laforge got voted in as chairperson.

It was because of that very scenario I issued article 1 of the Groovist Manifesto in March last year, that Groovy should be managed by its technical people. I hereby call on everyone who hasn't contributed a github commit, or even a jira comment, in the past year to unilaterally leave the Groovy PMS now! You have no business being there. Let the next election for PMC chair be voted on by its technical contributors only, and with candidates who actually do the hard work improving Groovy's codebase.

Repository timeline.gif


6 October 2016

Groovy's Comments and Strings

(content originally published in September 2007, but recovered here from gavingrover.blogspot.com)

One of the messiest parts of programming languages are comments and strings. Their syntax seems to be an afterthought in many cases. Let's look at Apache Groovy's, an extension of Java's, as a case study...

Lexing Logic

Apache Groovy has a UnicodeEscapingReader, which reads \u, then maybe more u's, followed by a 4-digit hex number, eg, \uCDEF, \uuu89AB. In Java, for the \u to be interpreted as a Unicode escape, it may not be immediately preceded by an odd number of other consecutive backslashes. The reason there may be more than one u is so tools can convert Unicode files into ASCII-only files, and back again. When converting to ASCII, an extra u is added to the hex codes, so \uCDEF becomes \uuCDEF, and \uuu89AB becomes \uuuu89AB, and the Unicodes in the text are converted to, say, \uCF73. This makes it straightforward converting such an ASCII file back to Unicode. To represent a supplementary Unicode character, two escaped surrogates are needed.

Next up, whitespace (formfeed, tab, and space) and comments are stripped out. Single-line comments run from // until end of line or \uFFFF; multi-line comments begin with /* and end with */. The file may have OS-specific characteristics: if the first line begins with #!, it's ignored; if the last line ends with control-Z, it's ignored. The end-of-lines are then removed, unless immediately preceded by backslash. Each line may be ended by a newline, a carriage-return, or both, depending on the operating system. The /* style comments may begin with /** and contain html and @ markups, so another parser, javadoc, can process them.

A string literal has many forms. They can be surrounded by single-quotes, double-quotes, triple-single-quotes, triple-double-quotes, or forward-slashes. The triple-quoted strings can be spread across many lines. Escaped characters inside strings are \n, \r, \t, \f, \b, \\, octal numbers from \0 to \377, \$ inside triple-quoted strings, \" inside single-quoted (or triple single-quoted) strings, and \' inside double-quoted (or triple double-quoted) strings.

When an unescaped ${ is found within a double-quoted string, end-quote-marks in the text following are ignored until the matching } is found, with nested curlies taken into consideration. The embedded blocks in the lexed GStrings are parsed a little later. They are a special case where Apache Groovy code can be embedded in strings, and strings in Apache Groovy code, perhaps in a mutually recursive manner. For example, Apache Groovy code embedding a string embedding Apache Groovy code embedding a string embedding Apache Groovy code embedding a double-quoted string:

"hello, ${def s='pqr'; "wo${s-"pq"}ld"}"


Little Languages

Regexes have a syntax so different to Apache Groovy's the only way to implement them is to quote them as a string. They can be encoded within the normal quote-marks. 12 special characters {[().\^$|?*+ in the regex syntax need to be escaped with a backslash if used literally, including the backslash itself. Because the backslash is also a special character needing escaping in strings, we need to represent one backslash with four in a regex pattern. Because of incompatibilities such as this, Apache Groovy has optional special forward-slash delimiters / / for regexes, with their own regex-friendly escaping rules. Only the forward-slash needs to be backslashed, though the regex cannot end with a backslash. However, for now they're only single-line capable, so to embed multi-line regexes in / / delimiters, we need to add a backslash at the end of each line.

There are many other little languages within Apache Groovy and Java that must be quoted as strings, eg, the printf / Formatter syntax. Each such string-embedded language introduces more conflicts with Apache Groovy's string delimiting and escaping rules. For example, regex has a replaceAll() method requiring back-references to be escaped with $, the same as for GString, which means they can't be used together elegantly. One of the goals of Apache Groovy is to allow such languages to be embedded as part of the syntax as a DSL (domain-specific language), but many of them are so syntactically different, they must be embedded as strings (eg, regex) or comments (eg, javadoc).

Limitations

Strings are often used instead of comments when programming. When we need to comment out many lines of code, and there's already a /* */ comment in that code, we need to embed it in triple-quotes as an unused string. (Though in fact, if we know the code is syntactically correct, we can also embed it in an uncalled closure.)

We can use triple-quotes to enclose data containing single-quotes, but if the data has backslashes, it often won't compile, eg:

def t='''
abc\defg
'''


We need to backslash-escape the backslashes, which can be quite a nuisance for large items of text. Even when the data has valid backslashed escape codes, such as program code, we still need to backslash-escape the backslashes if we want that program code to keep the same form when assigned to a variable and printed out. It would be nice if we could quote text without needing to go through and add backslashes, just as we can comment it out without doing so.

So in Apache Groovy, strings allow more nesting than comments, while comments inhibit more escaping than strings. What we need is the best of both worlds for comments and strings. Perhaps they could be the same syntactic entity in a programming language, having the same delimiters and same corresponding rules for escaping, with only some extra indicator for comments. Apache Groovy doesn't store comments in its AST (abstract syntax tree), but there's been some hints it may do in the future. If this happens, that extra indicator would also indicate what the comment is attached to.

Levels of Embedding and Indenting

What if we need to string up data that contains triple-quotes? We would need septo-quotes (""""""" or ''''''') to enclose it. That may be readable, but to enclose strings containing septo-quotes by using 15 quote marks at each end wouldn't be. We need some other solution to handle such cascading levels of embedding. Such a solution could be tied in with the levels of escaping in the UnicodeEscapingReader.

Often we need to embed a multiline string within code, which upsets our indenting and readability:

def m(){
  try{
    def c={
      def s="""\
some
data
here
"""
    }
  }
}


We can fix this by stripping spaces off the beginning of each line of a string when we run the program, but a lexical solution would be cleaner.

Delimiters

There's many conflicts between the different delimiter syntaxes for strings and comments. For example, if a string delimited by slashes / / includes an asterix * as its last character, as regexes often do, it can't be embedded within /* */ style comments. There's also redundancy of string delimiters: different delimiters are used in Apache Groovy code for single-line and multi-line strings, even though one delimiter would be enough.

Not only is there redundancy of delimiters, but there's a shortage of functionality also. Just as we have a comment type (//) delimited by end-of-line, it would be useful to have one delimited by end-of-file. I'm forever going to the bottom of my code and adding or removing a */ close-comment marker. Could strings also have a version delimited at end of line or end of file? A common use case for Apache Groovy scripts is to put the program code at the top, and the data to be read as a large string at the bottom, and such a use case would benefit from a string delimited by the end-of-file.

In fact, even characters only need opening quotes, they being only one token. Java never needed to have a close-single-quote for characters: using 'a to indicate the character a would have been enough. Apache Groovy, requiring 'a' as char, has made this more verbose instead of terser. And there's no syntactic reason why escaped characters such as \n need to be enclosed in single-quotes ('\n') instead of standing alone (\n) in the syntax.

Could there be a style of comment sensitive to the syntax? We could comment out a multi-line statement my marking its beginning only, instead of the beginning of every line with //, or both beginning and end with /* and */.


6 October 2016

Symbols for Real Groovy

(content originally published in February 2008, but recovered here from gavingrover.blogspot.com)

Parsing technology has advanced rapidly over the last decade or two, so now programmers can easily define languages using recursive-descent definitions, which perform almost as well as hand-coded bottom-up parsers. Java's Antlr and Haskell's Parsec spring to mind. It will become increasingly common for programmers to use their own custom languages to write code, just as they presently use their own IDE settings and code formatting, and even choose their own IDE's. Program code they write will be generated from its AST-form into a standard syntax and format before being shared.

Real Groovy is a custom programming language I'm writing, for myself to use, that writes to the Apache Groovy API. I intend it to be much terser than Apache Groovy, so all grammar is represented by symbols, such symbols optionally overloadable with names from any natural language, and the JDK/GDK English-language names optionally aliasable by names in any other natural language. When I'm finally using this language myself, I'll release it as open source so any others interested can copy and modify it to define their own custom languages for the Apache Groovy API also.

I've been looking at the syntax of other programming languages to get some ideas on what extra symbols I can put in Real Groovy. I've previously blogged about Real Groovy's symbolic replacements for keywords in Apache Groovy, with some very tentative examples. Let's look now at other possibilities, first considering the existing symbols in Apache Groovy. (I've ignored symbols with alphanumerics here, such as \t or 0xFF, as there'd be too many to list.)

Java has 50 such symbols:
\ various escapes
/* */ comment
// comment until end of line
; separate statements, etc
" " quoted string
' ' character (Java); quoted string (Apache Groovy)
.* all classes in package, or methods in class
- unary minus, binary minus
++ in-place increment, both prefix and postfix
-- in-place decrement, both prefix and postfix
% modulus
/ divide
* multiply
& binary and
^ xor
| binary or
! logical not
~ bitwise negate
!= not equals
>>> big right shift
>> right shift
<< left shift
== equals
= assign
-= assign with minus
+= assign with plus
&= assign with logical and
>>>= assign with big right shift
>>= assign with right shift
<<= assign with left shift
%= assign with mod
/= assign with divide
*= assign with multiply
|= assign with logical or
^= assign with xor
( ) method or closure call, etc
[ ] subscript, in-place list or map def, etc
{ } enclose statements block, closure def, etc
? : conditional expression
: labels; separate map key and value
@ annotation
. path to next element; decimal point
... variable args
+ unary plus, binary plus
> greater than, close generic brackets
<= less than or equal to
< less than, open generic brackets
>= greater than or equal to
&& logical and
|| logical or


Apache Groovy, its developers wanting to avoid "perlishness", only adds another 20:
[:] empty map
=~ regex find
==~ regex match
<=> compare
** power
**= assign with power
.. exclusive range
..< inclusive range
.@ path to field, never to property
-> closure parameters
?: Elvis operator
?. null-safe path to next element
*. spread path to next element
.& path to method pointer
#! comment at beginning
$ interpolate string
/ / quoted string
""" """ quoted multi-line string
''' ''' quoted multi-line string
, separate list elements, etc


Java 1.4, wanting to keep up with other languages, added the standard syntax of regexes. Because they're such an integral component of Apache Groovy, they should be considered part of Apache Groovy's syntax. Regex uses 30 symbols, though many are also used in Apache Groovy proper:
. match any character
{ } match exact number of times
{ , } match in range of times
[ ] character class
( ) capturing group
\ various escapes; quote following character
^ character class not; match beginning of line/input
$ match end of line/input
| alternation
? greedy option
* greedy repetition (zero or more)
+ greedy repetition (one or more)
(? ) set flag/s
(?- ) unset flag/s
# comment
- character class range
&& character class intersection
(?: ) non-capturing group
(?> ) atomic group
?? lazy option
*? lazy repetition (zero or more)
+? lazy repetition (one or more)
{ , }? lazy match in range of times
?+ possessive option
*+ possessive repetition (zero or more)
++ possessive repetition (one or more)
{ , }+ possessive match in range of times
(?= ) lookahead
(?! ) negative lookahead
(?<= ) lookbehind
(?<! ) negative lookbehind

Remember I'm not counting the huge number of mixed symbol/alphanumeric combinations, such as \Z and (?x).

Formatter syntax (aka printf), added in Java 5, gives many more symbols, though most are mixed with alphanumerics:
% substitute value
% $ substitute specified value
%< substitute previous value

There's also a Java internationalization API defined in java.lang.text, with its own little formatting language, though I suspect most Java/Groovy programmers prefer to use the Formatter/printf syntax, being familiar with it already from Unix and C.

There's other miscellaneous places where symbolic syntax is added to Apache Groovy:
/** */ Javadoc comments
@ various extra info in Javadoc
$ match reference in regex replacement string


Some other JVM languages considered to various degrees to be Apache Groovy's peers might have symbols I could use in Real Groovy. The syntax of JavaScript, embedded in Java 6, either copies Java's or is copied by Apache Groovy. The only unique symbols I could find:
=== strict equals
!== strict not equals


Scala, although statically typed, does have inferred typing and is much terser than Java. Most of its symbols are names within its class libraries. The only syntactic symbols are:
<: upper bound generic type
>: lower bound generic type
<% view upper bound generic type
# static path selection
=> call by name; etc
@ bind to following pattern
<- for-comprehensions
_ wildcard pattern

The <: and >: symbols might be nice Real Groovy replacements for Apache Groovy's extends and super keywords within generics.

JRuby has some more symbols:
# comment to end of line
? character
?\ - control and/or meta character
\ various escapes
' ' string without interpolation
" " string with interpolation
` ` string with echoed interpolation
% string/array/regex follows
<< here-doc with interpolation
<<' here-doc without interpolation
<<" here-doc with interpolation
<<- here-doc with indented end-marker
: symbol
:' ' symbol without interpolation
:" " symbol with interpolation
.. inclusive range
... exclusive range
[ , ] arrays
{ => , } hashes
$global global variable
@@class class name
@instance instance variable
$ various predefined globals
$x various predefined globals, for punctuation x
__XXXX__ various predefined variables
:: class::member
!~ similar symbols to Apache Groovy, but also !~
[< ] superclass in class defn

Much programming language syntax has already been catalogued elsewhere. I'm comparing symbols in different programming languages to see which are the generally accepted ones. I believe programming language symbols will merge towards a single standard, just as math symbols have over the centuries. The present C++/Java symbols are in that standard, being the most popular languages and having influenced others such as Apache Groovy and PHP. I only want recognized symbols in Real Groovy.


25 September 2016

Groovy Symbology

(content originally published in November 2008, but recovered here from gavingrover.blogspot.com)

In a previous post on rebuilding Groovy, I suggested we should reserve alphanumeric tokens (Unicode categories L, N, M) for lexical items (e.g. names and numbers) in a programming language, and the other tokens (categories S, P, Z) for grammatical items. That way, people can use any alphanumeric tokens for user-defined names. Programs are also terser because programmers don't need to put whitespaces between name and symbol, or (usually) between two symbols.

The grammatical tokens should have less emphasis than the lexical ones. When a programming language uses names for grammatical items, it's not following the way of natural languages, which can confuse those new to programming. Just as unusual, those same grammatical words are bolded in many IDE's, not the lexical ones. In Fortran these grammatical words were even capitalized, while user-defined names weren't. Programming languages should follow natural languages, or else programmers can't analyze and model the problem space in programs as easily.

Let's look at some of the culprits...

Keywords

Perhaps Smalltalk has the least number of keywords:
true false nil self super thisContext


The first language I ever programmed in was a dialect of Algol60, the forerunner of C. We can see there the origins of today's well-known keywords:
true false boolean integer real string array procedure own value label comment while do for step until if then else switch goto begin end


C had 32 keywords:
auto break case char const continue default do double else enum extern float for goto if int long register return short signed sizeof static struct switch typedef union unsigned void volatile while


C++, besides using those in C, added these many more:
asm bool catch class const_cast delete dynamic_cast explicit export false friend inline mutable namespace new operator private protected public reinterpret_cast static_cast template this throw true try typeid typename using virtual wchar_t


C# has 79 keywords, removing a few of C++'s, but adding all these:
abstract as base byte checked decimal delegate event finally fixed foreach get implicit in interface internal is lock null object out override params readonly ref sbyte sealed set stackalloc string typeof uint ulong unchecked unsafe

though 2 of them, get and set, are only keywords in the right context.

Java has 53 keywords:
abstract assert boolean break byte case catch char class continue default do double else extends final finally float for if implements import instanceof int interface long native new package private protected public return short static super switch synchronized this throw throws transient try void volatile while true false null


Apache Groovy adds in as it and def to these.

It's very difficult to extend a language without adding keywords. AspectJ adds many to Java:
aspect aspectOf() issingleton() perthis() pertarget() percflow() percflowbelow() privileged declare precedence parents error warning soft thisJoinPoint thisEnclosingJoinPointStaticPart pointcut before around after returning throwing execution call get set target args initialization preinitialization staticinitialization handler adviceexecution within withincode cflow cflowbelow proceed()

though many are sensitive to the context.

Python has a more frugal attitude to keywords:
False None True class continue def and as assert break finally for from del elif else except is lambda nonlocal global if import in return try while not or pass raise with yield


Likewise, Ruby:
BEGIN END alias and begin break case class def defined do else elsif end ensure false for if in module next nil not or redo rescue retry return self super then true undef unless until when while yield
{{

PHP is far more wasteful of the naming space:
{{
and or xor __FILE__ exception __LINE__ array() as break case class const continue declare default die() do echo() else elseif empty() enddeclare endfor endforeach endif endswitch endwhile eval() exit() extends for foreach function global if include() include_once() isset() list() new print() require() require_once() return() static switch unset() use var while __FUNCTION__ __CLASS__ __METHOD__ final php_user_filter interface implements instanceof public private protected abstract clone try catch throw cfunction old_function this final __NAMESPACE__ namespace goto __DIR__


Perl has so many keywords I can't list them.

And those Fortran keywords I mentioned before:
ACCEPT ASSIGN AUTOMATIC BACKSPACE BLOCK BYTE CALL CHARACTER CLOSE COMMON COMPLEX CONTINUE DATA DECODE DIMENSION DO DOUBLE ELSE ENCODE END ENTRY EQUIVALENCE EXTERNAL FILE FORMAT FUNCTION GOTO IF IMPLICIT INCLUDE INQUIRE INTEGER INTRINSIC LOGICAL MAP NAMELIST OPEN OPTIONS PARAMETER PAUSE POINTER PRAGMA PRECISION PROGRAM REAL RECORD RETURN REWIND SAVE STATIC STOP STRUCTURE SUBROUTINE TYPE UNION VIRTUAL VOLATILE WHILE WRITE


Operators

The original Fortran operators had only a few precedences:
R: **
L: * /
L: + -
.eq. .ne. .lt. .le. .gt. .ge.
.not. .and. .or. .eqv. .neqv.
=


When choosing symbols and punctuation to use, we don't want any alphanumeric symbols, we should keep the same precedences and associativity, and, at first, we should only use those already existing in other languages.

The Algol60 operators were:
^
* %
+ -
< > <= >= = <>
== => ∨ ∧ ~
:= ; .

Note: % meant divide.

The modern operators were first used by C:
L: ++postfix –-postfix ()funcCall [] . ->
R: ++prefix --prefix +unary -unary ! ~ (castType) *deref &addr sizeof
L: *mult / %
L: +binary -binary
L: << >>
L: < <= > >=
L: == !=
L: &bitAnd
L: ^
L: |
L: &&
L: ||
R: ?:conditional
R: = += -= *= /= %= <<= >>= &= ^= |=
L: ,


Apart from alpha token sizeof, all these operators can be used in Real Groovy. We can find another use for dereferencing * and addressing &, but keep their precedence and associativity. And the comma , and -> operators will be reinstated.

C++ adds non-alpha tokens :: ::* .* ->*. The GCC version of C++ adds operators &&label <? >?.

The Java operators are:
L: [] ()methCall .
R: ++prefix ++postfix --prefix --postfix +unary -unary ~ ! (typeCast) new
L: * / %
L: +addition -binary +stringConcat
L: << >> >>>
L: < <= > >= instanceof
L: == !=
L: &bitAnd &boolAnd
L: ^bitXor ^boolXor
L: |bitOr |boolOr
L: &&
L: ||
R: ?:conditional
R: = += -= *= /= %= <<= >>= >>>= &= ^= |=

Apache Groovy adds:
**
<=>
*. ?. .& .@ *.@
as
.. ..< in
?:elvis
=~ ==~

Real Groovy will replace as in new and instanceof with symbols.

C# adds alpha operators ?? and => at the lowest precedence. JavaScript adds === and !== at the same precedence as ==. Ruby adds =~ and !~ at the same level as ===, plus ... &&= and ||=. There's no further unique operators in PHP.

Perl 5 uses variable prefixes $ @ % & \. We could swipe $ to mean an escape even when we're not inside an interpolated string. Perl 6 plans on adding many more operators: see the Perl 6 Synopsis 3. Once it's out, I'll use it as a source for even more symbols.

As well as those operator symbols from those programming languages, Real Groovy will also add:
!=~ !~ !in !instanceof


For integer division and modulo, returning two values:
/%


For defining parser combinators:
::= &&& ||| ???


I don't know what these will do yet, but might as well keep symbols balanced:
<<< <<<=

Special Variables

Apache Groovy disallows names containing $, reserving them for its internal use. Because Real Groovy also needs internal variables, it needs to reserve some names, those containing underscore . Some pre-existing names in Apache Groovy and Java are coded with all capitals separated by underscores. Real Groovy will code them in camel case with a trailing underscore, e.g. SYMBOLICCONSTANTFORIDEC will be coded as the easier-to-type symbolicConstantForIdec_, the trailing underscore showing it must be converted to the all-capitals name.

Names beginning with _ will be used as special variables, similar to those in Perl and Ruby. When choosing the meanings of special variables, we'll consider their use in regexes and Format strings, as well as those languages. Some examples from Ruby:
$! $@ $_ $. $& $~ $1 $2 $3 $= $/ $\ $0 $* $$ $?


The _ standing alone will refer to the result of the previous statement, similar to Python's interactive mode. Natural languages have pronouns, so should programming languages.

Numeric Formats

Formats of numbers must also be terse. C and C++ used the well-known syntax:
34 -72
34.1 -23.4 .77 34.1e6 78.9E-2
077 0xFF -0Xff
34U 24L 42UL


Java removes the suffix for unsigned numbers, but adds:
34.7F 22f 34.8d 77D 67.89e-2d
34 -72
Double.NEGATIVE_INFINITY Float.NaN


When supplying a string to a Float object, we can use a special syntax:
new Float('0xDEDEp-3')
new Float('0x.defP-3f')

We'll inline this into Real Groovy's syntax.

Apache Groovy adds 34i 22G 34.8g. C# removes octal numbers, but adds decimal, e.g. 34.7m. We'll keep octal numbers, but require an o, as in 0o377.

Perl allows an underline for legibility, e,g. 1234567890. We'll certainly enable that in Real Groovy.

Ruby mainly copies Perl, but adds ASCII codes and binary:
?a //Ascii code for 'a'
0b01101 //binary


Python has optional decimal parts after the decimal point, and imaginary numbers:
5. 5.e7
10j 11.1j .1j 1.j 7e5j

There's no further unique syntax in JavaScript and PHP.

One source of good ideas is the J language:
2b1001011 //base 2
2b12202102 //base 3, etc
3r5 // 3/5, similar to Scheme's rational numbers
3j5 // 3+5i, terser than Python's complex numbers
3p5 // 3*pi^5
3x5 // 3*e^5


Real Groovy will be about extreme tersity in programming. It will be a better choice of syntax than Apache Groovy.


written: May 2016; published: August 2016

Apache Groovy's continuing fabrications

One year ago, Apache Groovy changed its website-based downloads from Codehaus to Apache (which redirects to groovy-lang.org, addressing an IP currently hosted in Germany). Their website says "all downloads (except the source) are hosted in Bintray's Groovy repository"...

DL Fakery 1 groovy-lang.org download.jpg

which means clicking on download at groovy-lang.org forwards the download request to Bintray. One of Groovy's backers claimed anonymously in a reddit comment that this was the reason for the sudden spike in downloads in May 2015 on Bintray...

DL Fakery 2 bintray downloads march 2016.jpg

..., over 80% of which come from an IP address hosted in Germany...

DL Fakery 3 bintray downloads to July 2015.jpg

Of course, we don't really know how many of the Bintray downloads are genuine user requests on groovy-lang.org, or if they're generating ten downloads for every one real user, or even fifty for every one, but the Apache Groovy backers deserve the benefit of the doubt. Until, that is, groovy-lang.org went down for 2 days one recent weekend...

DL Fakery 4 MailList Groovy Down.jpg

Looking at the Bintray download activity for Groovy on those days makes an interesting picture...

DL Fakery 5 Bintray May June 2016 one month.jpg

Not a single dent in the download numbers for that weekend, despite the groovy-lang.org domain name being down for 2 whole days. Wherever did those requests come from? The most likely explanation is that the German-based downloads are being generated by a timer script running on the machine being pointed to by groovy-lang.org, and genuine download requests are few.

Also interesting in that picture is the overall trend for the non-German downloads -- they start trending downwards a month ago, just about when Jetbrains and Gradleware announced their partnership to make Kotlin instead of Groovy the prefered language for build scripts and plugins in Gradle 3.

Groovy's "popularity" is a fabrication.


22 August 2016

Groovy Dilemma

(this content originally published in January 2010, but recovered here from gavingrover.blogspot.com)

In chapter 7 of Steven Pinker's 1994 book The Language Instinct, he gives an example of a perfect right-branching sentence:

Remarkable is the rapidity of the motion of the wing of the hummingbird.


This is parsed in the human brain as shown by the parentheses:

(Remarkable (is (the (rapidity (of (the (motion (of (the (wing (of (the (hummingbird))))))))))))).


remarkable is the subject, the remainder is the predicate. is is the main verb, the remainder is its object (here, called the complement). the is the article, the remainder is its referent. rapidity is a phrasal head, the remainder is a prepositional phrase as tail. of is a preposition, the remainder is its tail in the phrase. And so on.

Pinker gives another example easy for the brain to parse, one that includes relative and subordinate clauses:

(He gave (the candy (to the girl (that (he met (in New York) while (visiting his parents (for ten days (around Christmas and New Year's)))))))).


He rearranges it so its far harder for our minds to parse:

(He gave (the girl (that (he met (in New York) while (visiting his parents (for ten days (around Christmas and New Year's)))))) the candy).


The direct object the candy after the many closing parentheses forces our short-term memories to keep track of dangling phrases that need particular words to complete them. It seems our brains, unlike computers, can only remember a few dangled branches when parsing sentences.

Perhaps that's why the Lisp code that's easiest for humans to read ends with many closing parens, such as this tail-recursive sample from chapter 2 of Paul Graham's On Lisp:

(defun our-length (lst)
  (if (null lst)
      0
      (1+ (our-length (cdr lst)))))


Left-branching sentences are also easy for humans to parse. Pinker gives another example with two arrangements, one harder for humans to parse:

((The rapidity (of the motion (of the wing (of the hummingbird)))) is remarkable).


and the other, a perfect left-branching sentence, easy:

(((((The hummingbird)'s wing)'s motion)'s rapidity) is remarkable).


English has just a few left-branching structures, but some languages, such as Japanese, are primarily based on them.

One of the universals in Universal Grammar theory, which both Pinker and Noam Chomsky support, is that if a language has verbs before objects, as English does, then it uses prepositions, while if a language has objects before verbs, as Japanese does, it uses postpositions. Pinker mentions a possible reason this universal holds is so the language can enforce a consistent branching decision, either left-branching or right-branching, so our brains can parse it easily.

Some grammatical English sentences are impossible for our brains to parse simply because there's too many dangling branches. The first of these examples parses in our brains OK, but the other two simply don't parse:

(The rapidity (that the motion has) is remarkable).
(The rapidity (that the motion (that the wing has) has) is remarkable).
(The rapidity (that the motion (that the wing (that the hummingbird has) has) has) is remarkable).


They do parse in computer languages, though. When I discovered closures in Groovy, I started using this type of unreadable embedding, but I now realize I should be making my code either all left-branching or all right-branching to make it more readable.


3 September 2016

Tech blogs 2011-2016

Following on from the summary below of my (now deleted) tech blogs on blogspot from 2007 to 2010, here's my technical progress on building Real Groovy since then...

I ended 2010 targetting the GroovyASTBuilder with a grammar parsed by combinator parsers written in Groovy++, an independent plugin to Groovy. By early 2011, I'd started work on an annotation-based plugin to Groovy, written in the speedier Groovy++, so those combinator parsers could be specified using strings and operators in Groovy's grammar, similar to Groovy's GStrings. I named them GRegexes, and branded the combined distro of Groovy, Groovy++, and GRegexes as Real Groovy. By mid-2011, the Codehaus Groovy despots had responded with two products that cloned Groovy++ and GRegex's functionalities: Grumpy and GParsec. Grumpy later became the static typing facility in Groovy.

In early 2012, I tried rebooting Groovy again by writing it in Clojure, and simply calling it the Groovy Language. With this name, I was challenging Guillaume Laforge's claim that Groovy names the codebase of the Codehaus implementation rather than the language specification begun by James Strachan. By mid-2012, my Clojure-based rewrite of Groovy was testing successfully on both the JVM and ClojureCLR. Then in early 2013, I changed direction by going off the JVM and attempting to reboot Groovy in Haskell.

On 28 April 2013, I'd returned to Clojure and had named it Grojure. On 17 November, I released Real Groovy 0.10 which bundled both Grojure for dynamic typing and Kotlin for static typing. By 2014, I'd abandoned Grojure, and switched from the JVM to Go as the implementation platform for the Groovy reboot. In May, I announced GroovyCode to be an extension to Unicode's UTF-8 encoding, and Ultracode to be a 6-bit language embedded in the 10xxxxxx bytes that follow a 11111110 byte.

In late 2014, I renamed GroovyCode to UTF-88, and released a fully functioning Go package implementing it. By then I'd decided to reboot Groovy module by module, using Go as the platform. In 2015 I released Gro 0.1, incorporating utf-88. I released package kern for combinator parsing in October 2015, and thomp for dynamic typing in February 2016. In April 2016, I released Qu, a recursive descent parser based on that in Go 1.6, enabling Unihan for keywords, special identifiers, and package names. I then rebuilt Gro using Qu as a base, and bundled it all as grolang on its own github.com account.

20 October 2016 update:

I've now finished republishing suitable content embedded in the deleted 2007-2010 blogspot posts as standalones, as well as duplicating my posts at tumblr.

I'm now purging my blog here at Codeplex of all my abandoned technical directions in a bid to rebuild the Groovy Language, for the same reason as my earlier gavingrover.blogspot.com deletions -- to unclutter the longer-lasting entries by Gavin O.O. Grover, future Real Groovy consultant (not to take the mickey out of any churches). So I'm turning my blog here into a wiki!


21 August 2016

Technical Blogs 2007-2010

Lately I deleted some blog entries from gavingrover.blogspot.com, my blog from 2007 to 2010. They were technical entries describing earlier attempts at building Real Groovy, but which cluttered up the longer lasting entries, hence the deletion. You can still see them all if you download the zip of my blogs from here at groovy.codeplex.com, but here's a quick summary of what they said...

On 6 April 2007, I announced Grerl, a preprocessor for the Groovy Language, being "the spirit of Groovy with the clothes of Perl". It was to enable programmers to code in their favorite syntactic shortcuts, natural language, and formatting, but also to convert their code to a standard Groovy source listing when giving it to others to read. It would've provided a context-sensitive lexical macro system. I was using JSE (Java Syntactic Extender) as a starting point to implement such a macro system. Grerl would've also enabled us to use Groovy's MOP to alias names in programs into other natural languages, and even map foreign words and CJK characters to each English word in an identifier, rather than directly to the whole identifier name.

On 9 August, I switched to writing a GroovyASTBuilder to make it easier to generate AST nodes while parsing the program source of any syntax. The end result, being written in dynamically typed Groovy, wouldn't have been nearly as fast as the tools written in static Java, though in future compiling Groovy in Groovy may have become a good benchmark. For the lexer and parser, I switched to JParsec, a context-sensitive combinator parser to allow both more readable code and syntactic mutability, and on 9 November, branded it Vy. On 16 December, I started designing Grerl-Vy's syntax to be aimed not at Groovy programmers but at others, even non-Java programmers.

By then (late 2007), I had realized a cohort made up of parties from the Groovy/Grails community, from Australia/New Zealand, and from mainland China were already surveilling me in my apartment as I worked on Grerl-Vy. My subsequent blog entries changed in tone to reflect this situation. On 19 January 2008, I defined Grerl-Vy to be all the software that fits completely between a graphical editor and the Groovy AST, and so replacing both clunky IDE's and IME's for CJK characters. On 14 February, I started building Grerl-Vy again from scratch, but using my own combinator parsing library in Groovy, using Ken Barclay's GParsec as a starting point, instead of JParsec. And on 25 March, I renamed GrerlVy to GroovyScript.

On 19 May, I started experimenting with using Scheme to create a Groovy-like syntax, called Ghreme. On 28 September, I started looking at the feasibility of porting Groovy to .NET, using the then-nascent DLR as the target. By mid-2008, although I was being encouraged publicly for my contributions to the Groovy ecosystem including documentation, I was getting a different message from somewhere through backchannels, telling me to "stop messing with the brand". I didn't know for sure where the antagonism was coming from, nor could I prove it.

In May 2009, I switched back to the JVM, using Scala for its type-checking and greater functional paradigm, and renamed what I was building Groovier. On 9 July, I switched from the Groovy AST to the Scala parse tree as the target. On 9 September, I called this GroovyScala, and on 27 September, Groovy 2.0. And on 31 October, I announced Strach, an IME (input method editor) to help programmers enter all Unicode characters for Groovy.

On 4 November, I switched back to targeting the Groovy AST, using the new Groovy 1.7 ASTBuilder, but still writing it in Scala. On 1 January 2010, I decided to bust open Groovy by stripping out the added cruft and splitting up core components such as the MOP and DGM (default Groovy methods) so programmers could use Groovy's functionality from other JVM-based languages. On 3 March 2010, I decided to switch to the newly-announced Groovy++ for writing the lexer/parser code targeting Groovy's ASTBuilder. This is how I finished logging my technical progress on gavingrover.blogspot.com at the end of 2010.

Embedded in the blog entries I deleted was some content I would've kept if they'd been standalone entries. I'll republish those here as standalones over the next few months as I get time to sift through them.

See earlier entries

Last edited Jan 16 at 12:23 PM by gavingrover, version 22

Comments

No comments yet.