Gavin Grover's GROOVY Wikiblog

See later entries

Page 6 Contents

28 March 2014

Introducing GroovyCode...

Since digging into Unicode's background during my reboot of the Groovy Language, I've discovered it isn't finished being built. To be of use in Groovy. I need to finishing building Unicode before continuing with the Groovy reboot. The final build will be called GroovyCode, and comprise the following components:

  1. the ultra extensions to Unicode reinstating the pre-2003 upper limit to UTF-8 as originally designed by Pike and Thompson, and the ultra surrogate system for where this isn't available
  2. a 6-bit control language embedded in the 10xxxxxx bytes following a 11111111 byte, using characters U+20 to U+5F, to, among other uses, encode decomposition commands for Unihan characters
  3. intermediate decompositions of the 75,000 Unihan characters presently in Unicode (e.g. 丨 and 川 are included in Unicode, but not two vertical strokes which is used in many characters like in 介 under the 人) to also be included in Unicode
  4. decomposition tables for all the Unihan characters similar to my own first attempt so they can be converted to Ultra Decomposed Normal Form
  5. an algorithm to generate 2.1 billion Unihan characters formulaicly into the 5- and 6-bit encodings in UTF-8-Ultra so Unihan characters can be converted to Ultra Composed Normal Form
  6. correct standard fonts and generation code for all such codepoints so users don't have to rely on erroneous fonts such as Microsoft Windows 7's SimSun font which has the wrong pictures for most Unihan extension C and D characters
  7. standard keystrokes for entering Unihan characters pictorially so everyone can easily enter them, irrespective of their native language
The primary purpose of building GroovyCode is so Unihan ideographs can be used outside of their spoken language contexts in Asia. When Japanese use single Kanji to represent multisyllabic words, they read them visually not phonetically. By making Unihan tokens visually defined characters in Unicode/GroovyCode, not only Japanese but Chinese and Westerners and others can also easily use them.

...and GroovyLang

The primary purpose of rebooting the Groovy Language is to showcase Unihan in a programming language. Unicode's ID_Start and ID_Continue properties (and their X- versions) treat Unihan the same as other alphanumerics in identifier names, but GroovyLang will put an implied space before and after every Unihan so they can only be used as standalone characters, like in Japanese kunyomi Kanji and Chinese classical texts. Programming language grammar can thus be much terser, readable on a smartphone screen.

With GroovyCode, Real Groovy is now no longer just a language...

RealGroovy = GroovyCode + GroovyLang

27 March 2014 (updated from 22 March)

Unicode Ultra Normal Forms

The proposed UTF-8-Ultra will need two new normal forms, bringing the total to six...
  • NFC: canonical composition
  • NFD: canonical decomposition
  • NFKC: compatibility composition
  • NFKD: compatibility decomposition
  • NFUC: ultra composition
  • NFUD: ultra decomposition

A very rough picture of these forms is:

     NFKD <-------+----> NFKC (=canonically compose NFKD)
           NFD <--+----------> NFC (=canonically compose NFD)
NFUD <------------+---------------> NFUC (=ultra-compose NFUD)

Ultra Decomposition Normal Form

NFUD relies on UTF-8-Ultra introducing a 6-bit control language to Unicode, embedded in the 10xxxxxx bytes following a 11111111 byte, as suggested in proposal 3(b) of the draft spec for Ultra Unicode. The characters U+20 to U+5F will be used, i.e. the space, 26 uppercoase letters, 10 digits, and 27 punctuation and symbols !?,.:;#$%&@^*=+-_'"/\<>()[]. NFUD will extend NFD, and deprecate NFKD and NFKC.

Presently if a character such as ª is compatibility decomposed (NFKD), it will become a, losing some information, namely that it had decomposition type super. Ultra decomposition (NFUD) will retain the compatibility decomposition types as ultra control commands, e.g. ª would become control command SUPER[, followed by a, followed by control command ]. Case mappings would similarly be decomposed, e.g. A becomes command UPPER[, token a, command ]. We would also decompose all Unihan tokens to a set of core components, e.g. would become command UNIHAN[ACROSS[, token , command DOWN[, token , token, command ]]]. Because NFUD commands can be nested, we could process UTF-8-Ultra Unicode in NFUD as a tree.

We would use the control commands for embedding rich text commands besides these, such as BOLD, enabling UTF-8-Ultra higher level protocols to extend, not envelope, Unicode encodings in a non-breaking manner, so the protocol will more likely become widely used, as did UTF-8 when it extended the ASCII encoding. But until we implement the embedded 6-bit control language in UTF-8, we need an alternative method of specifying such commands in Unicode texts, just as we use ultra surrogates with UTF-8 because the 5- and 6-bit extensions haven't been available since 2003.

We'll therefore use a markup language in our text and Real Groovy programs. If using backtick ` for escaping, we would write:
  • `SUPER[`a`]` in our text for ª,
  • `UPPER[`a`]` for A,
  • and `UNIHAN[ACROSS[`讠`DOWN[`五口`]]]` for .

Ultra Composition Normal Form

NFUC is an extension of NFC, so all canonical compositions are also ultra compositions. So Hangul will be the NFUC version of Jamo sequence 가ᄋ, and ñ will be the NFUC version of sequence n◌̃ , just as they are presently the NFC versions. The additional processing will be that where a formulaicly-generated version of a Unihan character exists in planes 0x20 to 0x7FFF (i.e. the planes representable by 5 or 6 bytes in UTF-8-Ultra), the formulaicly-generated version will be substituted.

When converting to NFUC, the character stream will first be converted to NFUD, the canonical reordering algorithm applied, then converted to NFUC, leaving decomposition sequences not having formulaic versions in the data. Whereas ultra-composition will process such 6-bit commands, canonical composition will ignore them and compatibility composition will remove them.

9 March 2014

Groovy Unicode UTF-8-Ultra

I've only just begun publicizing the Ultra extensions to Unicode I'll need when the Groovy Language reboot reaches maturity, but initial negative feedback made me think about the best next step. Besides the two main additions (i.e. extending UTF-16 with doubly-indirected surrogates, and removing the U+10FFFF upper limit in UTF-8), the proposal also presents a minor addition that's not recommended, i.e. extending the existing UTF-8 codepoint range with the same ultra surrogates proposed for UTF-16. Although I only put it in for completeness, it could be the best choice for building a reference implementation.

UTF-8-Ultra uses existing private use planes to define "ultra-surrogates". The top half of plane 0xF becomes 32,786 high ultra-surrogates (U+F_8000 to U+F_FFFF) and all of plane 0x10 becomes 65,536 low ultra-surrogates (U+10_0000 to U+10_FFFF). Together, they can access 2,147,483,648 codepoints (i.e. 32,786 planes) from U+0000_0000 to U+7FFF_FFFF. We would keep the first half of plane 0xF (U+F_0000 to U+F_7FFFF) as 32,786 private use codepoints alongside the 6400 private use ones in the BMP. Of the additional ultra planes, we would leave another 983,040 codepoints (U+11_0000 to U+1F_FFFF) alone for general private use, they being the codepoints representable by 4 bytes in the original 1993 spec for UTF-8.

The planes representable by 5 or 6 bytes in that spec, 0x20 to 0x7FFF, we would use to implement a proof-of-concept showing how we can represent many CJK Unihan ideographs using the same formulaic method used to represent Korean Hangul. Each Hangul block is made up of 2 or 3 jamo: a leading consonant, a vowel, and possibly a trailing consonant. There are 19 leading consonants, 21 vowels, and 28 trailing consonants (including none at all), giving a total of 11,172 possible syllable blocks, generated by formula into range U+AC00 to U+D7A3. But only 2350 of them are used often enough to justify their inclusion in South Korea's KS-X-1001 national standard, the other 8822 are only there for completeness.

There's many more CJK components than Korean jamo: semantic radicals and phonetic components together number 500 to 1000 depending on how you count them, and unlike jamo, they can combine recursively to form Chinese characters (e.g. 懂 is 忄 followed by 董, which is 艹 over 重, which is 千 joined to the top of 里, which is 土 joined under 田, which is 冂 surrounding around the top of 土 ). By discovering the best combination of base components and recursion depths, we can create 2.1 billion new Chinese characters which don't need their own glyphs in a font but can have their images calculated and rendered by formula. Most of them are just there for completeness, but some of which can deprecate a large chunk of the 75,000 already encoded.

Building such a UTF-8-Ultra reference implementation will put to rest some of the negative feedback coming my way:
  • Having conversions from UTF-8 or UTF-32 to UTF-16 fail due to a valid but unrepresentable-in-UTF-16 code point would be an extra headache to deal with (that many would forget to handle)
By encoding UTF-8 with ultra-surrogates we're showcasing the way UTF-16 can be extended so all valid pre-2003 UTF-8 (and UTF-32) codepoints are representable in UTF-16. The ridiculosity of using these surrogates with UTF-8 to access 0x7FFF_FFFF codepoints will raise serious questions about why UTF-8 was clipped to only 1.1 million characters back in November 2003, the exact same time Laforge joined a fledging 3-month-old programming language ultimately intended to showcase Unicode in programming. By severely restricting the lexis of Unicode and the grammar of Groovy, business consultants have been able to ply their fake productivity tools around IT shops as substitutes for the real thing.
  • for zero benefit
By discovering a good formula for building Chinese characters we're showcasing an application with benefit. Because the Korean version of the same application was already encoded in Unicode 2.0, no-one can seriously suggest it's not a viable application. But I'm certain some will try.
  • Good for us nobody will listen to you
Because of this comment and the many downvotes I assume a voting ring was in place and UTF8 Everywhere was responsible. If so, I would've expected a different attitude from them since my proposal makes UTF-8 more useful and UTF-16 less so. I won't really need to use more than 137,000 measly private use codepoints in Unicode for a while, but I'm getting in early on what I expect to be a long decade or two of extolling the virtues of Unicode Ultra to deaf ears. But eventually programmers will code entire programs while riding to work on the subway in Beijing, Hong Kong, and Tokyo using Groovy and Unicode.

11 January 2014

The Groovy Future of Unicode

In last week's Groovy History of Unicode, I alluded to some possible Unicode futures which I'll explore here...

1. UTF-16-extended

1996's UTF-16 can access 17 planes, i.e. 1_114_112 codepoints. In plane 0, i.e. the BMP from U+0000 to U+FFFF, the 1024 high surrogates (U+D800 to U+DBFF) and 1024 low surrogates (U+DC00 to U+DFFF) can access an extra 1_048_576 codepoints (i.e. 16 extra planes) from U+1_0000 to U+10_FFFF. UTF-16 was subsequently adopted by Java and Windows, then in 2003, UTF-8 was accepted by the Unicode Consortium, but with its maximum 2.1 billion codepoints restricted to the same 1.1 million points that UTF-16 can access.

The top 2 planes (15 and 16) had been reserved, however, for private use which means we can use them as second-stage "ultra surrogates" to access the remaining 2.1 billion codepoints available in the original UTF-8 proposal. We would keep the first half of plane 15 (U+F_0000 to U+F_7FFF) as 32_768 private use codepoints alongside the 6400 private use ones in the BMP. But we'd reassign the top half of plane 15 to be 32_768 high ultra-surrogates (U+F_8000 to U+F_FFFF) and all of plane 16 as 65_536 low ultra-surrogates (U+10_0000 to U+10_FFFF). Together, they can access 2_147_483_648 codepoints (i.e. 32_768 planes) from U+0000_0000 to U+7FFF_FFFF.

A few points:
  • Because U+10_FFFE and U+10_FFFF are non-characters, not surrogates, the last two points in each plane would be unreachable and by default would also be non-characters.
  • Because U+F_FFFE and U+F_FFFF are non-characters, the entire last two planes (U+7FFE_xxxx and U+7FFF_xxxx) would be unreachable and be non-character planes.
  • We would make the first 17 high ultra-surrogates (U+F_8000 to U+F_8010) illegal so there's no overlap with the 17 already available planes. Though if we only made the first 16 illegal, we could re-use low ultra-surrogate plane 16 to be the first extended UTF-16 plane. Anyone could therefore use it in development mode before they know whether to implement it as UTF-16-extended or just UTF-16.

2. Over 280 trillion codepoints

I'm not sure if Corrigendum 9 is really needed for Unicode to break free of its 2.1 billion character straitjacket, but UTF-8 bytes FE and FF would still be vacant even after re-extending UTF-8 back to 2.1 billion codepoints.

The original UTF-8 looks like:

    U+0000 to      U+007F:  0xxxxxxx
    U+0080 to      U+07FF:  110xxxxx 10xxxxxx
    U+0800 to      U+FFFF:  1110xxxx 10xxxxxx 10xxxxxx
  U+1_0000 to   U+1F_FFFF:  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
 U+20_0000 to  U+3FF_FFFF:  111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U+400_0000 to U+7FFF_FFFF:  1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Byte form 1111111x is unused. Last time I proposed extending it to 4.4 trillion codepoints, but I don't think that's enough. A better way is to use 11111110 to indicate a 9-byte sequence giving 281 trillion codepoints (281万亿):

U+8000_0000 to U+FFFF_FFFF_FFFF:
    11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Last time I proposed using this range to formulaicly encode Unihan characters not already encoded in the 70_000 presently encoded ones, as we already do Korean hangul. The Unihan character defined as the simple sequence "又出 across", like billions of other such examples, isn't encoded anywhere in those 70_000. To encode it using an ideographic descriptor would take 9 bytes, each character taking 3 bytes, as ⿰又出, the same as in my proposal (though we'd need to define a control code in the private use area instead of the descriptor to do it properly). But the payoff comes from all characters more complicated than that, if 281 trillion is enough to formulaicly generate them to a practical recursion depth. And we could eliminate most characters in the huge Unihan lookup table we presently require.

3. A 6-bit embedded language

The remaining unused byte form 11111111 could then be used as an embedded 6-bit control language.

The committee designing Unicode's predecessor, ASCII, decided it was important to support uppercase 64-character alphabets, and chose to pattern ASCII so it could be reduced easily to a usable 64-character set of graphic codes. So by using range U+20 to U+5F only, we can use everything in ASCII except the 33 control codes, 26 lowercase letters, and 5 symbols ` ~ { } |. We can embed these tokens within the continuation characters 10xxxxxx easily, build single-line commands after each 11111111, yet retain UTF-8's self-synchronization. Not only would the Unicode codepoint repetoire be reducible down to ASCII, but so would Unicode embedded with this control language be. But what would we use this control language for?

Not only are the UTF encodings self-synchronizing, but so is the Unicode repetoire, almost. The original Unicode 1.0 and 1.1 wasn't so much, having pairs of symbols (U+206A to U+206F) to indicate scopes like which script's digit shapes to use, whether to join Arabic letters together, and whether symmetric reversal should be used for certain symbols. These have since been deprecated by later codepoints and properties. Unicode still has codes (U+FFF9 to U+FFFB) for embedding annotations like Japanese rubies within text, but recommends a higher-level protocol be used. And there's the bidirectional embedding and overriding codes (U+202A to U+202E), and although replaced by the simplified isolate codes (U+2066 to U+2069) in 2013's Unicode 6.3, it still uses nested scopes.

The 6-bit embedded language has enough characters (i.e. space, 26 uppercoase letters, 10 digits, and 37 punctuation and symbols ! ? , . : ; # $ % & @ ^ * = + - _ ' " / \ < > ( ) [ ] to design a language to be embedded within Unicode to control all scoped functionality, not just taking over bidirectional control but any scoped function normally requiring a higher-level protocol. The possibilities are endless.

6 January 2014

The Groovy History of Unicode

Unicode began life as ASCII, when IBM's Bob Bemer submitted a proposal for a common computer code to the ANSI in May 1961, nine months before I was born. Unicode's history has always been tightly intertwined with my own. In late 1963, when I was moved to Auckland NZ, the ASCII codeset was finally agreed upon by committee heavyweights Bemer and Teletype Corp's John Auwaerter, and the ISO later accepted ASCII. But ASCII almost died young as only Teletype machines and the Univac 1050 used it for the next 18 years.

In 1981, ASCII was reborn when the popular IBM PC used it, and since then virtually every computer made has been ASCII-compliant. Its codeset included what we see on the keyboard, plus 33 control codes, some of which (0x0e and 0x0f) allowed other code pages to be swapped in and out, though the code was no longer self-synchronizing. Hundreds of other code pages were created for other natural languages, as well as separate standards by various East Asian governments to handle thousands of ideographs. On 29 August 1988, Xerox's Joe Becker outlined the original 16-bit design for Unicode 1.0, encoding Latin, Cyrillic, and Greek scripts, right-to-left scripts Arabic and Hebrew, the Devanagari 9, Thai, Korean Jamo, and punctuation and math symbols, and later expanded to include 20,000 Unihan (unified Chinese and Japanese) characters in 1992. Specs were written to cover case mapping and folding, diacritical marks, bidirectionality, Arabic joining, grapheme clusters and ordering, varying widths in Asian characters, and breaking algorithms.

But Koreans didn't want to encode any of their writing system as separate Jamo that make up the square-shaped Hangul syllables, which would require 6 to 10 bytes for each syllable, so in 1996's Unicode 2.0, over 10,000 codepoints were remapped to Hangul, and generated by formula. Chinese and Japanese weren't happy with only 20,000 Hanzi/Kanji being encoded, so Unicode 2.0 also brought the UTF-16 encoding using surrogates, adding a million 4-byte codes to the 65,000 2-byte ones, and dividing the codespace into 17 planes. Many users switched to Unicode, with two notable exceptions. Many Japanese were unhappy at their characters being unified with those of China, a situation which wouldn't need to have happened if UTF-16 had been decided on earlier. And Americans didn't like the incompatibility with ASCII, and its requirement for twice the space.

So Ken Thompson's 1993 self-synchronizing UTF-8 encoding became a supported alternative by the Unicode Consortium, with its varying length ensuring backwards compatibility with ASCII. In 2003, its maximum 2 billion codepoints were restricted to 1.1 million to match the UTF-16 encoding, but the top 2 planes (0xF and 0x10) had been reserved for private use. This ensures anyone can access the full 2 billion codepoints using UTF-16 by using those planes as "ultra surrogates", the top half of plane 0xF being high ultra-surrogates and all of plane 0x10 being low ones. We could even re-use plane 0x10 to encode the first "ultra plane" of characters. By decoding through surrogate indirection twice in UTF-16, the maximum rune is `\U7FFFFFFF`, not `\U0010FFFF`.

Unicode started being used in Java, Windows, and Linux, and by 2008, UTF-8 had replaced ASCII to become the most common encoding on the web. The main holdouts to uptake of Unicode is Japan's continuing preference for the JIS encoding, and of course programmers clinging to using ASCII-only tokens in their code. But the next major leap came on Unicode's 15th birthday: on 29 August 2003, the Groovy Language was announced by its creator James Strachan, the nascent vehicle which will introduce the full power of Unicode's vocabulary and possible resultant grammars to the world of software creation. A language spec and testkit was initiated in April 2004, and I joined the online community a year later, watching progress, excited at the opportunity to contribute Unicode to the ecosystem, and even create another implemention of the Groovy Language so it would be the best language ever.

But a great darkness has descended upon the Groovy Language: two voldemorts who hijacked its development and greedily turned it into a business venture to generate profits from web framework consulting and conferences. I endured years of being ignored then handled, and even, though unknown to me at the time, smeared and watched. One example a few months before G2One,Inc was incorporated, Laforge said without any discussion, (paraphrased) "We will not be doing anything like Unicode symbols in the foreseeable future". The business plan to stall Groovy development had been decided. To this day, little Groovy stands enslaved by the evil giant Grailiath. But because I've faced up to my groovy destiny to bring the full tersity of Unicode to the world of programming through the Groovy Language, I'm now rebooting it from scratch, spec and all. The day will come when programmers code on their mobile phone screens while riding the subway in Beijing, Hong Kong, and Tokyo.

Last year (2013) brought Unicode 6.3, weighing in at around 110,000 encoded tokens, encompassing exactly 100 scripts! But the greatest achievement in Unicode last year wasn't with the big version release, but in little Corrigendum 9, only 350 words long, changing the definition of non-characters, which are primarily the last two codepoints in each plane, 0xfffe and 0xffff. Unicode's F.A.Q on Noncharacters had previously said "Noncharacters are in a sense a kind of private-use character, because they are reserved for internal (private) use. However, that internal use is intended as a super private use, not normally interchanged with other users." But Corrigendum 9 will be remembered as the announcement that freed Unicode from its straightjacket of 2.1 billion codepoints possible with both the pre-2003 extended mode of UTF-8 and the doubly-indirected ultra surrogate system of UTF-16. For Corrigendum 9 now allows noncharacters to be interchanged, saying "Noncharacter: A code point that is permanently reserved for internal use", crossing out "and that should never be interchanged", because "for distributed software, it is often very difficult to determine what constitutes an internal versus an external context for any particular software process".

In pre-2003 UTF-8, an ASCII byte begins with `0`, a continuation byte begins with `10`, and a leading multibyte begins with `11`, the number of `1` bits indicating the number of bytes in the sequence. 6 bytes can encode up to U+7FFF_FFFF, which is 2.1 billion codepoints. But the leading multibytes `11111110` and `11111111` weren't used in Thompson and Pike's original UTF-8 spec because they were noncharacters `FE` and `FF`. Because of Corrigendum 9 we can now use them in interchange, and in any way we want! Maybe we could use `FE` to signify, not a 7-byte, but a 9-byte sequence; maybe we could use `FF` to signify program code using 6-bit tokens embedded in the following continuation bytes, which would still leave UTF-8 somewhat self-synchronizing. But even if we just extended the original definition of UTF-8, `11111110` would indicate a 7-byte sequence (68 billion codepoints) and `11111111` a 8-byte sequence, which could encode 4.4 trillion codepoints (=4.4万亿, =2^42, yes, 42), up to U+3FF_FFFF_FFFF, a huge improvement on U+7FFF_FFFF, and giving meaning to Unicode!

"And what would we use all these codepoints for?", you ask? Your destiny isn't intricately woven around Unicode's so you've never imagined it. From Unicode 2.0 on, over 10,000 Korean hangul codepoints are generated via a formula each having 2 to 5 jamo, selected from 24 jamo, non-recursively. A font for Hangul only needs to encode the jamo, not every hangul because they can be generated by the formula. This isn't possible for the 70,000 Unihan codepoints, each composed of 1 to perhaps 10 or more components, selected from hundreds of radicals and phonetic components, perhaps recursively. For now, each glyph must be encoded separately in the font, or a huge lookup table used, and when a possible Chinese character not in the 70,000 encoded is required, the definition sequence takes up many bytes, like with many Korean hangul before Unicode 2.0. But 4.4 trillion codepoints could mean Unihan tokens can be encoded by formula to a practical recursion depth, without requiring huge fonts or superlong encodings. And if 4.4 trillion isn't enough, we can define the meanings of `11111110` and `11111111` differently. Wow!

Great things are coming to Unicode thanks to Corrigendum 9 and the Groovy Language reboot!

Post removed because it's been superceded.

27 January 2014

Groovy's Lies and Statistics

In Oct 2013 many articles suddenly popped up in the online IT rags boasting of a "surge in Groovy's popularity", citing its recent rise to #18 in the Tiobe index from outside the top 50 in May 2013. Three months later (Jan 2014), Groovy had dropped back out of the top 50 (#32 in Nov, #46 in Dec). According to another online rag: "After a long discussion with one of the Tiobe index readers, it turned out that the data that is produced by one of the Chinese sites that we track is interpreted incorrectly by our algorithms. So this was a bug," Janssen said. "After we had fixed this bug, Groovy lost much of its ratings." Whoever pulled off that trick didn't stop there, but utilized the feedback effect to prolong the afterglow of the deception: typing "Groovy Programming" into Google still gives...
  • 5. Groovy breaks into top 20 list of programming languages
  • 8. Groovy Programming Language Sees Major Boost in Popularity
  • 22. Groovy Programming Language Sees Major Boost in Popularity
  • 24. Groovy makes debut entry into programming language top twenty
  • 30. Interview about Groovy's popularity boost -- Guillaume Laforge's Blog

This has happened before. In December 2010, Groovy began a sudden rise from outside the top 50 when Groovy tech lead Jochen Theodorou "volunteered" his services to Tiobe to help them improve their algorithms. In April 2011, however, Groovy fell from #25 to #65 on Tiobe in a single month after they increased the number of search engines they monitor. Also then, Stack Overflow started being gamed by someone: the number of monthly questions tagged Groovy shot up suddenly, where questions and answers posted seemed to be coordinated. Github also showed signs of similar manipulation (though the ranking queries I describe there are no longer provided by Github, perhaps to deter such gaming), all in an apparent effort to game the Redmonk Programming Language rankings.

And so when yet another IT rag claimed last week that Groovy smashes records with 3 million downloads in 2013, was anyone not sceptical? Let's look at the claims one by one...
  • With 1.7 million downloads in 2012, all the signs were good that last year was going to be a big one for Groovy
Neither the annual nor monthly download numbers are relevant. There were 17 releases during 2012, giving 100k downloads per release from Maven and Codehaus (assuming the figures are correct). But even that number is exaggerated because the Maven downloads are often triggered automatically, and often gamed by bulk-download scripts. The most accurate number showing active users is the Codehaus downloads at 30% of the total, i.e. 30k per release (assuming the figures are correct and not gamed). I've simplified the chart from the original so it's readable...

groovy 2012 only.jpg

  • but not many people would have predicted today’s news of three million downloads of the alternative language in 2013.
There were 23 releases during 2013, giving 130k downloads per release from Maven and Codehaus (again assuming the figures are correct). The releases occur despite being sparse of features, and could even be timed just to drive up the download metric. E.g. Groovy's roadmap just added a version 2.3, featuring Traits, which were originally announced for version 2.2 but postponed. Similarly in June 2012, Codehaus Groovy had also announced a new meta-object protocol for Groovy version 3.0 to come after version 2.1, but that was postponed early 2013 to make way for a version 2.2.

Per release, most of the increase comes from the Maven figures, the most easily gamed, and that from October 2013. I suspect the person responsible for spoofing this apparent increase is the same person behind Groovy's Tiobe Top 20 deception that very same month. Laforge's chart, again made more readable...

groovy 2013 only.jpg

  • Creator Guillaume Laforge, who arrived at this figure by compiling Maven Central statistics, as well as “slicing and dicing the Codehaus Apache logs”, attributes this staggering increase to the “hard work of the Groovy core development team and the friendly community and ecosystem.”
Someone needs to tell the article's author that James Strachan is the creator of the Groovy Language, not Laforge. Where did they get that idea?

  • These figures are for pure Groovy downloads by the way, and don’t account for bespoke versions of the language which come bundled in with programs such as Grails or Gradle.
Why didn't Laforge use the Grails or Gradle numbers in his stats? Aren't they talking to him, or are they hiding something?

  • As Laforge points out in his blog post, a substantial amount of downloads are of Groovy as a library, rather than as an installable binary distribution, due to the fact that Groovy is essentially a “dependency” to add to projects. The peaks on the chart for the most part correspond to major releases - naturally rising in concurrence with monthly downloads, which rise from 200k to 300k a month between January 2012 and December 2013.

The increase from 150k to 200k in Jan 2013 corresponds to 4 releases instead of 2 that month, and again in Oct 2013. Also in January 2013, the Groovy download at Codehaus increased from 3 artifacts to 4.

  • Although it continues to lag behind Scala, with its huge and tenacious community, in terms of popularity, these figures show that slowly but surely, Groovy is edging its way into the mainstream.
Groovy is in decline. I suspect the sharp decline in downloads in Aug and Sep 2013 correspond to someone not tending to their bulk download scripts. Groovy's "surge in popularity" was probably timed to coincide with a due diligence investigation on Pivotal Inc, or someone there selling a consulting contract, as well as selling seats to their annual December GrailsXchange conference in London. They probably didn't expect their Tiobe exploit to get found out so quickly.

  • For beginners, the language is very simple to pick up - something that users argue actually hamper it in popularity rankings, on the grounds that, if people understand how something works, they won’t be spending hours looking for tutorials and pushing up your search rankings.
In Feb 2013, Dr Dobbs editor Andrew Binstock wrote: "Groovy is a language primed to be a major player. There is the conundrum. The endless variety of features requires considerable documentation, which is simply not available, especially for the advanced features that give Groovy much of its benefit. And so, if you jump in today, you'll find the language is easy to learn, but hard to master."

  • And let’s not forget the all important Java intercompatibility factor. Due to its similarity to Java, it’s relatively painless for these devs to master, the main difference being that it’s dynamically typed, which removes a lot of boilerplate, and adds closures to the language.
Groovy is only simple to pick up if you just use the terser closure and collection syntax when manipulating Java classes for tests. Most users aren't using anything more complex, except Grails users who also use the metaobject protocol.

  • There’s also a huge ecosystem which has grown up around it, and, as the newly emerged static website generator Grain shows, there are plenty of additions in development.
Grails and Gradle are the only users with any traction. The rest of the promoted ecosystem is just an "echo system". Grain, like Griffon and Gaelyk, is just hype.

  • A healthy industry interest in Groovy may well have also contributed to this spike in ranking. Notably last year, the language was featured in Pivotal’s recently released Spring Framework 4.0,
The Spring Framework is separate from Groovy, which together with Grails and Spring Scala at the very bottom of the list of Spring Projects. Within the Spring Framework 4.0 Reference, the Spring Expression Language takes up all of chapter 7, whereas Groovy takes up little section 28.3.3 only.

  • as well as the Gradle build automation system, currently being utilised by Google for Android app builds.
Google uses Groovy in Gradle to build Android, but Groovy can't actually run on Android. Developers in other JVM languages already use them to build Android apps. According to Laforge in 2013: "Groovy is not able to run properly on Google's Android mobile platform ... (It takes) 20 seconds to start up a simple Hello World". Laforge was seeking Google Summer of Code students to try make Groovy run on Android, but Google didn't accept that project, or any projects related to Groovy, perhaps because no students were interested, or maybe it was Codehaus Groovy's history of mishandling GSoC projects. The last Groovy project ever accepted was one in 2011 rewriting Groovy's antiquated parser in Antlr 3.0. The project failed and Google only paid out half the project money. But there are plenty of Scala and Clojure projects on that GSoC list.

But even Groovy's future in Gradle is in doubt. Gradle still ships with Groovy 1.x, and they'll looking at bundling alternative build languages such as Scala in Gradle 2.0.

  • Finally, with its last major releases, Groovy resolved some major legacy performance issues, which had long turned many people off the Java alternative. As Andrew Binstock notes, with these developments, the language was finally poised to explode. The community is relatively small, but it’s certainly growing. If the Groovy team can continue the good work they’ve started, 2014 may be yet another bumper year for adoption - though we’re not convinced Team Scala will be jumping ship just yet.
Groovy's not ready to explode. Absent other choices, Java developers will replace Groovy with the speedy JDK8-bundled "Nashorn" Javascript when upgrading from Java 7 to 8. Codehaus Groovy is on a downward spiral, its trajectory tied to JDK7 and before. But I've been designing and building the new, real, Groovy Language to replace the legacy one when it dies so there'll be a real choice available. Then Groovy will explode!

Although I've tried to focus on building the new rather criticising the old, lies such as these Tiobe and Maven exploits regarding Groovy make it difficult to keep silent. Some lies have been outright childish, such as when on 29 August 2013, Groovy's 10th birthday, someone who's been active for over 3 years updating release info on Groovy's Wikipedia page added Laforge's name to Strachan's as a co-designer of Groovy, giving him 3 titles (designer, project manager, and spec lead), and ignored the rest of their developer team. We don't know who this is: Laforge claimed during a previous undo-redo spat that he thinks he's never done any modifications to Groovy's wikipedia page as far as he can recall.

I really had to undo that addition. Because the JSR spec was changed to dormant in April 2012 after being inactive for 8 years, I removed Laforge's spec lead title also, then added the 3 technical people who are listed in Groovy's Codehaus repository as Despots...

wiki now and then.jpg

Someone should be concerned about celebrating Groovy's 10th birthday by listing the developers who do the actually coding and testing, instead of giving the manager more titles. But perhaps this is a frameup?

19 January 2014

A Groovy Syntax Curse

There's 2 reasons why I'm still interested in the Groovy Language: its groovy name, and its syntax, which always drew me back. The syntax of some other languages have never appealed.

I first tried out Python in 1997, and JPython a year later, I had already used Cobol and VBA at work, and C, C++, and Java in study. I didn't pursue using Python at that time for various reasons: I didn't like its indentation syntax; it wasn't promoted by a big corp like Microsoft, and I was always thinking about CV padding back then; it was in the process of changing its class semantics and I didn't think a language which had to change something important would be worth using; and the occasional funny syntax like (1,) and __THIS__. When I learnt some Haskell more recently, although the indentation syntax is optional, everyone uses it so it's not really optional.

tiobe top 6.jpg When I program, I want to use the syntactic techniques available to express different types of structures, just like in natural language. In the 6 most popular languages on Tiobe (C, Java, Objective-C, C++, C#, and PHP), semantic structure is indicated by bracketing and separator tokens, and thematic structure by whitespace distribution, naming, and commenting. Indentation syntax languages force me to use whitespace for semantics, taking away the choices I have to express thematic structure in my code.

I came across Ruby in 2001, and was put off by | | being used to bracket instead of a single symbol like -> or =>, but mostly by end everywhere. The most commonly used lexis in natural languages are the shortest, in speech, hand signing, and orthography, so it should also be in programming languages. C used the symbols { } ( ) [ ] in their pairs for good reason, the same reason scientists and writers have been using them, because they're the shortest to express the most frequently occuring ideas, which in programming is block structure. When I encountered Groovy in 2004, I found the syntax more intuitive than Python or Ruby's, especially the builder syntax which Kotlin and Dart now also use.

Around 2007, I tried out J, and later APL. APL seemed easier than J because it uses { } ( ) [ ] in their pairs to parenthesize things, whereas J uses those 6 symbols as standalones, harder to read the code. Verbs in J are often more than one character but always ASCII, whereas in APL they're single characters but often non-ASCII, so harder to enter but I found easier to read. I especially like APL's spaceless code. Like newlines, spaces should be optional in the code semantics, freeing developers to use them solely for the thematics.

I tried out Scheme in 2009, then started learning Clojure about 3 years ago. The syntax for writing macros in Clojure is terser than Scheme's, and the use of { } ( ) [ ] for various structures makes Clojure code easier to read. Clojure gives us great ability to vary the word ordering aspect of thematic structure by providing the arrow operators -> and ->> in addition to standard ordering. But even after years of using Clojure as my sole hobbyist language, I still find the S-expression syntax unintuitive to write and difficult to read later. The syntax could be the primary reason programmers are slow to learn and use Clojure. I miss the C/Java/C#/Javascript syntax, even though S-expressions are required to enable Clojure's macros. This is the Groovy Syntax curse, the reason why I keep coming back to Groovy.

I now believe the best solution is to have both the lispy Clojure syntax and a C/Java-like one as two sides of the same language. This is my motivation for building Grojure, a syntactic layer sitting atop Clojure, decomplecting the underlying functionality of Clojure from the syntactic richness of C/Java. I'm presently reworking its design for the upcoming 0.11.0 release to map more closely to Clojure's core macros and functions, utilizing the arrow operators as much as possible so programmers can give any thematic structure they want to their code. Grojure will continue to use the Kern combinator parsers rather than a faster executing solution because the parsers are composable, part of Clojure's design philosophy.

30 November 2013; truncated 22 October 2016

Clojure Vocab, Grojure Grammar

Version 1.5 of clojure.core adds 444 functions, 75 macros, and 29 refs to the 16 special forms in Clojure. 503 of these functions and macros use alphabetic characters in their names, often with - between words, and often with ?, !, ->, *, or ' at the end. Those symbols, along with +, _, =, <, $, &, and |, can in theory be used anywhere in a name, including user-defined names. #, %, /, ., and : can also be used with certain restrictions in names. The remaining 14 symbols, ", ~, `, ^, @, \, (, ), [, ], {, }, ;, and ,, are reserved by the Clojure grammar.

A mere 16 functions and macros in clojure.core, however, are non-alphabetic symbols. While allowing great freedom for developers to use symbols in names, in practice Clojure shunts the expressivity of its core library onto over 500 pre-defined alphabetic names, and Clojure developers follow this naming practise in their own programs. Many of these core functions and macros in particular also use pre-defined alphabetic keywords in parameters to increase this shunting. Because IDE name-completion doesn't work well with Clojure's dynamic types, the burden is on developers to remember all these names, or rely on a long cheat sheet, a burden that would be greatly eased by syntax in other programming languages. Such is the price of providing macros in Clojure.

Grojure attempts to ease this cognitive overload by reducing the number of alphabetic names to a fraction of this. If developers can remember the name or symbol for something, then they can remember it exists. The extra syntax required must also be easy to remember, primarily by copying that of existing programming languages, and adding to it in an intuitive manner.

Post removed because it's been superceded.

16 November 2013

Groovy continues

This blog continues from my previous one, but this time I'll endeavour to be positive, only talking about the positive features of the Groovy reboot, rather than criticising the negative aspects surrounding the legacy version. I shouldn't have let myself be tricked into publicly reciprocating someone else's bile.

I won't be deleting any of those blog entries, though. Someone's been saying he doesn't want to get Codeplex to remove them because he wants to be "reasonable" and give me the "opportunity to return to the IT profession" by removing them myself, so prospective employers won't be negatively influenced by "unprofessional web pages" with my name on them. This tall story is just a cover for the real reason he doesn't want my website forcibly removed, or cracked and defaced.

He owns the Codehaus implementation of Groovy Language but not the brand. His corporate attorneys have told him if he wants the brand, then I must be seen to forfeit my claim to it, primarily by deleting the blog entries where I follow a process to spec Groovy. A cracked and defaced blog or a forcibly removed one won't stand up in court. That's why someone's been using his left hand to entice me to relinquish control of the brand and his right hand to tighten the noose around me via proxies. By keeping both his hands separate, no-one can prove duress.

originally written on 11 November 2013, but posted later

Groovy's Coterie of Sociopaths

1. Pivotal's Rocher

10 years ago while staying at a backpackers in Sydney, I walked into the building and up the stairs. Behind me I heard some males walk in and say to the attendant something like "You shouldn't let people like that guy live here. He lived in Melbourne before and lots of people complained about him." Just one of hundreds of similar little incidents. This time, I heard the attendant reply in the Paddington accent something like "How dare you come in here and slag off one of our guests. We haven't had any specific complaints by anyone who actually lives here." Many people in Sydney and elsewhere know how rumors and innuendo can easily be nurtured by those with specific interests to promote and protect, and don't pay attention to them. That should have been the same response from Pivotal Inc's Graeme Rocher when busy-bodies from Melbourne contacted him in 2006 about me soon after I started occasionally posting on the Groovy users mailing list. Instead, he stirred up similar talk among the Groovy developers and community, and clueless management sorts like Guillaume Laforge dutifully listened and obeyed.

Rocher is just one of a coterie of Sociopaths out there targeting me and the Groovy Language (the real one here at Codeplex, that is, not the fake one over at Codehaus). The 3 interconnected power structures I talked about in March's blog entry is still the most accurate picture of the entrenched interests involved. There's probably more than one Sociopath involved from the US-centered IT industry, but Rocher's the one I know most about, so I'll use him as Silicon Valley's rep in the coterie.

2. Melbourger

Another power structure is the residential property market in Australia and New Zealand, which never crashed when the American and European ones did because of the far greater flow of Asian immigrant house-buyers into the countries. In my last programming job in Melbourne, some Sociopath put a snooper on my PC and set up a display downstairs so they could watch a live feed of me working. Nothing illegal about managers monitoring their workers, maybe not even targeting specific ones, though when the display is in practise viewed by anyone who wanders by, the intent is clearly to ridicule, but who can prove intent? I was probably first surveilled in Auckland around 1998 when some telecoms workers dropped by one evening to take down my dedicated internet line while they did some work outside.

In China, many people are monitored, often including via bugs in the apartments many foreigners live in. It's easy, though, to know when you're being listened to: just talk to yourself a lot, about things both true and absurd, using intonations from serious to comical, then wait. The Sociopaths controlling the surveillance use many Clueless as intermediaries, and the Clueless have an emotional need to show off that they're part of the action. While living in Melbourne I'd sometimes overhear people, even at work, talking about what webpages I'd been accessing from my apartment. When I talk to myself in "private" in China, I never have to wait long. By 2007, my conversations with myself were being hinted at by the Clueless on the Groovy mailing lists from as far afield as France.

There's probably many Sociopaths in the cities I've lived in back home, so I'll just use the most recent as Australia/NZ's rep in the coterie of Sociopaths.

3. Mr Han

After 1500 AD, people started leaving Europe and Asia for the New World, an immigration which continues to this day. If America and Australia are where Chinese and Indians want to live, then Americans and Australians can influence situations in China and India, stringing along people with promises and approval. One big catch on the hook is enough: a fish at a big college in China can lead along more fishes at smaller nearby colleges. As in Australia, there's many Sociopaths here in China but I don't know who they are. Before I arrived, a Chinese in NZ told me that when I got here I would be harrassed, and wouldn't know who was doing it because in his culture they keep themselves well hidden. The closest I've got is the name Mr Han. The deal-making between a Sociopath in China and one in Australia follows the same pattern as everywhere, like in the Karate Kid remake where both players benefit: Dre wins the fight and the girl with the violin, and Mr Han wins recognition as the master teacher.

4. Hollywood's Errol

The only name I've managed to rustle up here is "Errol someone", who works in Preaching Placements, the office next door to Product Placements. For years I watched American and British TV shows using a DVD player in China so I could do something else on my internet-connected computer at the same time. Only a small number of certain shows would I watch direct over the internet. And only those shows would have preachments aimed at me, and only after the date I switched from DVD to online viewing. Yes, not content with sarcastic and threatening submissions to Hacker News and other sites I visit, the coterie of Sociopaths and their Clueless useful fools even show off how idle their hands are by sticking snide remarks into the TV I watch.

It's taken years to uncover the existence of these 4 Sociopaths (or perhaps 4 groups of them), and disentangle which is which. For years I was focusing on the Clueless imtermediaries instead of discerning the macabre manipulations occuring in the background. I was exposing the Laforge instead of going after the Rocher behind the decoy. Perhaps they don't all know about each other, unlike the UK meeting room in The X-Files, but only communicate in pairs. But I've recently realized that even the 4 power structures represented by these Sociopath reps don't fully explain all of the outside attacks on my effort to respec and rebuild the Groovy Language from scratch. The events only make sense if behind the coterie of Sociopaths is a Supreme Sociopath, who knows and makes deals with all of them based on a bigger agenda.

This agenda is bigger than some frat boy getting higher ratings for his TV show, or some Chinese jockeying for better places for their children in overseas colleges, or some IT company's quarterly earnings, or even keeping house prices high for a whole city of early retirees. This agenda strikes at the very heart of what the Groovy Language reboot is all about, and the Master Sociopath behind it all is obviously associated with the American NSA.

5. Mr X at the NSA

The Groovy Language, as defined by its creator James Strachan, was always destined to be a success. An extended C/Java/C#/JavaScript dynamically-typed syntax running atop the JVM, its behavior standardized, then many implementations being built, even atop other platforms. But he let Laforge come in and take over. Laforge scuttled the spec, turning the reference implementation into the language itself, and sold Groovy and himself out to Rocher in return for Grails being built. Laforge then stifled innovation by making little changes to the AST to break any plugins or doco whose author he didn't approve of, hired a mate to launder the static-compilation code for one of those plugins, and foisted it on users by deceitfully bundling it as part of the same product, even though a static compiler and a dynamic compiler are as different as chalk and cheese.

Now it seems Rocher quickly pimped Grails and himself out to the American NSA (or perhaps one of its clients) in exchange for protection, like Jack Nicholson's character in The Departed who delivered his lieutenant over to the FBI every few years in exchange for being allowed to stay in business. The price Rocher paid: don't let Unicode anywhere near Groovy. For the NSA know a Unicode-enabled programming language will increase the productivity of Chinese and Japanese programmers manyfold, whereas American and European programmers won't bother learning how to make use of it. The Groovy Language reboot is Unicode-enabled, with emphasis on using Unihan (i.e. Chinese/Japanese) characters, though all programs will be viewable using only the ASCII keystrokes used to enter the Unicode, or the English translation of their Unihan name.

I will not be beaten by the American NSA and its clients, but shall continue to rebuild the Groovy Language, respecifying it and rebuilding it atop Clojure. It will utilize every token in the Unicode character set, including all 85,000-odd Hanzi/Kanji and Hangul. With such tersity, programmers will be able to write programs on a smartphone screen while riding the subway in Hong Kong, Tokyo, Seoul, and Beijing. Groovy will prevail!

tiaoqi zai luxiang.jpg

See earlier entries

Last edited Oct 22, 2016 at 6:36 AM by gavingrover, version 23


No comments yet.