Gavin Grover's GROOVY Wikiblog

28 March 2014; updated 26 Sep 2017

Building Unicode Ultra...

Since digging into Unicode's background during my reboot of the Groovy Language, I've discovered it isn't finished being built. To be of use in Groovy. I need to finishing building Unicode before continuing with the Groovy reboot. The final build will comprise the following components:

  1. the ultra extensions to Unicode reinstating the pre-2003 upper limit to UTF-8 as originally designed by Pike and Thompson, and the ultra surrogate system for where this isn't available
  2. a 6-bit control language embedded in the 10xxxxxx bytes following a 11111111 byte, using characters U+20 to U+5F, to, among other uses, encode decomposition commands for Unihan characters
  3. intermediate decompositions of the 75,000 Unihan characters presently in Unicode (e.g. 丨 and 川 are included in Unicode, but not two vertical strokes which is used in many characters like in 介 under the 人) to also be included in Unicode
  4. decomposition tables for all the Unihan characters similar to my own first attempt so they can be converted to Ultra Decomposed Normal Form
  5. an algorithm to generate 2.1 billion Unihan characters formulaicly into the 5- and 6-bit encodings in UTF-8-Ultra so Unihan characters can be converted to Ultra Composed Normal Form
  6. correct standard fonts and generation code for all such codepoints so users don't have to rely on erroneous fonts such as Microsoft Windows 7's SimSun font which has the wrong pictures for most Unihan extension C and D characters
  7. standard keystrokes for entering Unihan characters pictorially so everyone can easily enter them, irrespective of their native language
The primary purpose of building Unicode Ultra is so Unihan ideographs can be used outside of their spoken language contexts in Asia. When Japanese use single Kanji to represent multisyllabic words, they read them visually not phonetically. By making Unihan tokens visually defined characters in Unicode Ultra, not only Japanese but Chinese and Westerners and others can also easily use them.

The primary purpose of rebooting the Groovy Language is to showcase Unihan in a programming language. Unicode's ID_Start and ID_Continue properties (and their X- versions) treat Unihan the same as other alphanumerics in identifier names, but Real Groovy will put an implied space before and after every Unihan so they can only be used as standalone characters, like in Japanese kunyomi Kanji and Chinese classical texts. Programming language grammar can thus be much terser, readable on a smartphone screen.

27 March 2014 (updated from 22 March)

Unicode Ultra Normal Forms

The proposed UTF-8-Ultra will need two new normal forms, bringing the total to six...
  • NFC: canonical composition
  • NFD: canonical decomposition
  • NFKC: compatibility composition
  • NFKD: compatibility decomposition
  • NFUC: ultra composition
  • NFUD: ultra decomposition

A very rough picture of these forms is:

     NFKD <-------+----> NFKC (=canonically compose NFKD)
           NFD <--+----------> NFC (=canonically compose NFD)
NFUD <------------+---------------> NFUC (=ultra-compose NFUD)

Ultra Decomposition Normal Form

NFUD relies on UTF-8-Ultra introducing a 6-bit control language to Unicode, embedded in the 10xxxxxx bytes following a 11111111 byte, as suggested in proposal 3(b) of the draft spec for Ultra Unicode. The characters U+20 to U+5F will be used, i.e. the space, 26 uppercoase letters, 10 digits, and 27 punctuation and symbols !?,.:;#$%&@^*=+-_'"/\<>()[]. NFUD will extend NFD, and deprecate NFKD and NFKC.

Presently if a character such as ª is compatibility decomposed (NFKD), it will become a, losing some information, namely that it had decomposition type super. Ultra decomposition (NFUD) will retain the compatibility decomposition types as ultra control commands, e.g. ª would become control command SUPER[, followed by a, followed by control command ]. Case mappings would similarly be decomposed, e.g. A becomes command UPPER[, token a, command ]. We would also decompose all Unihan tokens to a set of core components, e.g. would become command UNIHAN[ACROSS[, token , command DOWN[, token , token, command ]]]. Because NFUD commands can be nested, we could process UTF-8-Ultra Unicode in NFUD as a tree.

We would use the control commands for embedding rich text commands besides these, such as BOLD, enabling UTF-8-Ultra higher level protocols to extend, not envelope, Unicode encodings in a non-breaking manner, so the protocol will more likely become widely used, as did UTF-8 when it extended the ASCII encoding. But until we implement the embedded 6-bit control language in UTF-8, we need an alternative method of specifying such commands in Unicode texts, just as we use ultra surrogates with UTF-8 because the 5- and 6-bit extensions haven't been available since 2003.

We'll therefore use a markup language in our text and Real Groovy programs. If using backtick ` for escaping, we would write:
  • `SUPER[`a`]` in our text for ª,
  • `UPPER[`a`]` for A,
  • and `UNIHAN[ACROSS[`讠`DOWN[`五口`]]]` for .

Ultra Composition Normal Form

NFUC is an extension of NFC, so all canonical compositions are also ultra compositions. So Hangul will be the NFUC version of Jamo sequence 가ᄋ, and ñ will be the NFUC version of sequence n◌̃ , just as they are presently the NFC versions. The additional processing will be that where a formulaicly-generated version of a Unihan character exists in planes 0x20 to 0x7FFF (i.e. the planes representable by 5 or 6 bytes in UTF-8-Ultra), the formulaicly-generated version will be substituted.

When converting to NFUC, the character stream will first be converted to NFUD, the canonical reordering algorithm applied, then converted to NFUC, leaving decomposition sequences not having formulaic versions in the data. Whereas ultra-composition will process such 6-bit commands, canonical composition will ignore them and compatibility composition will remove them.

9 March 2014

Groovy Unicode UTF-8-Ultra

I've only just begun publicizing the Ultra extensions to Unicode I'll need when the Groovy Language reboot reaches maturity, but initial negative feedback made me think about the best next step. Besides the two main additions (i.e. extending UTF-16 with doubly-indirected surrogates, and removing the U+10FFFF upper limit in UTF-8), the proposal also presents a minor addition that's not recommended, i.e. extending the existing UTF-8 codepoint range with the same ultra surrogates proposed for UTF-16. Although I only put it in for completeness, it could be the best choice for building a reference implementation.

UTF-8-Ultra uses existing private use planes to define "ultra-surrogates". The top half of plane 0xF becomes 32,786 high ultra-surrogates (U+F_8000 to U+F_FFFF) and all of plane 0x10 becomes 65,536 low ultra-surrogates (U+10_0000 to U+10_FFFF). Together, they can access 2,147,483,648 codepoints (i.e. 32,786 planes) from U+0000_0000 to U+7FFF_FFFF. We would keep the first half of plane 0xF (U+F_0000 to U+F_7FFFF) as 32,786 private use codepoints alongside the 6400 private use ones in the BMP. Of the additional ultra planes, we would leave another 983,040 codepoints (U+11_0000 to U+1F_FFFF) alone for general private use, they being the codepoints representable by 4 bytes in the original 1993 spec for UTF-8.

The planes representable by 5 or 6 bytes in that spec, 0x20 to 0x7FFF, we would use to implement a proof-of-concept showing how we can represent many CJK Unihan ideographs using the same formulaic method used to represent Korean Hangul. Each Hangul block is made up of 2 or 3 jamo: a leading consonant, a vowel, and possibly a trailing consonant. There are 19 leading consonants, 21 vowels, and 28 trailing consonants (including none at all), giving a total of 11,172 possible syllable blocks, generated by formula into range U+AC00 to U+D7A3. But only 2350 of them are used often enough to justify their inclusion in South Korea's KS-X-1001 national standard, the other 8822 are only there for completeness.

There's many more CJK components than Korean jamo: semantic radicals and phonetic components together number 500 to 1000 depending on how you count them, and unlike jamo, they can combine recursively to form Chinese characters (e.g. 懂 is 忄 followed by 董, which is 艹 over 重, which is 千 joined to the top of 里, which is 土 joined under 田, which is 冂 surrounding around the top of 土 ). By discovering the best combination of base components and recursion depths, we can create 2.1 billion new Chinese characters which don't need their own glyphs in a font but can have their images calculated and rendered by formula. Most of them are just there for completeness, but some of which can deprecate a large chunk of the 75,000 already encoded.

Building such a UTF-8-Ultra reference implementation will put to rest some of the negative feedback coming my way:
  • Having conversions from UTF-8 or UTF-32 to UTF-16 fail due to a valid but unrepresentable-in-UTF-16 code point would be an extra headache to deal with (that many would forget to handle)
By encoding UTF-8 with ultra-surrogates we're showcasing the way UTF-16 can be extended so all valid pre-2003 UTF-8 (and UTF-32) codepoints are representable in UTF-16. The ridiculosity of using these surrogates with UTF-8 to access 0x7FFF_FFFF codepoints will raise serious questions about why UTF-8 was clipped to only 1.1 million characters back in November 2003, the exact same time Laforge joined a fledging 3-month-old programming language ultimately intended to showcase Unicode in programming. By severely restricting the lexis of Unicode and the grammar of Groovy, business consultants have been able to ply their fake productivity tools around IT shops as substitutes for the real thing.
  • for zero benefit
By discovering a good formula for building Chinese characters we're showcasing an application with benefit. Because the Korean version of the same application was already encoded in Unicode 2.0, no-one can seriously suggest it's not a viable application. But I'm certain some will try.
  • Good for us nobody will listen to you
Because of this comment and the many downvotes I assume a voting ring was in place and UTF8 Everywhere was responsible. If so, I would've expected a different attitude from them since my proposal makes UTF-8 more useful and UTF-16 less so. I won't really need to use more than 137,000 measly private use codepoints in Unicode for a while, but I'm getting in early on what I expect to be a long decade or two of extolling the virtues of Unicode Ultra to deaf ears. But eventually programmers will code entire programs while riding to work on the subway in Beijing, Hong Kong, and Tokyo using Groovy and Unicode.

11 January 2014

The Groovy Future of Unicode

In last week's Groovy History of Unicode, I alluded to some possible Unicode futures which I'll explore here...

1. UTF-16-extended

1996's UTF-16 can access 17 planes, i.e. 1_114_112 codepoints. In plane 0, i.e. the BMP from U+0000 to U+FFFF, the 1024 high surrogates (U+D800 to U+DBFF) and 1024 low surrogates (U+DC00 to U+DFFF) can access an extra 1_048_576 codepoints (i.e. 16 extra planes) from U+1_0000 to U+10_FFFF. UTF-16 was subsequently adopted by Java and Windows, then in 2003, UTF-8 was accepted by the Unicode Consortium, but with its maximum 2.1 billion codepoints restricted to the same 1.1 million points that UTF-16 can access.

The top 2 planes (15 and 16) had been reserved, however, for private use which means we can use them as second-stage "ultra surrogates" to access the remaining 2.1 billion codepoints available in the original UTF-8 proposal. We would keep the first half of plane 15 (U+F_0000 to U+F_7FFF) as 32_768 private use codepoints alongside the 6400 private use ones in the BMP. But we'd reassign the top half of plane 15 to be 32_768 high ultra-surrogates (U+F_8000 to U+F_FFFF) and all of plane 16 as 65_536 low ultra-surrogates (U+10_0000 to U+10_FFFF). Together, they can access 2_147_483_648 codepoints (i.e. 32_768 planes) from U+0000_0000 to U+7FFF_FFFF.

A few points:
  • Because U+10_FFFE and U+10_FFFF are non-characters, not surrogates, the last two points in each plane would be unreachable and by default would also be non-characters.
  • Because U+F_FFFE and U+F_FFFF are non-characters, the entire last two planes (U+7FFE_xxxx and U+7FFF_xxxx) would be unreachable and be non-character planes.
  • We would make the first 17 high ultra-surrogates (U+F_8000 to U+F_8010) illegal so there's no overlap with the 17 already available planes. Though if we only made the first 16 illegal, we could re-use low ultra-surrogate plane 16 to be the first extended UTF-16 plane. Anyone could therefore use it in development mode before they know whether to implement it as UTF-16-extended or just UTF-16.

2. Over 280 trillion codepoints

I'm not sure if Corrigendum 9 is really needed for Unicode to break free of its 2.1 billion character straitjacket, but UTF-8 bytes FE and FF would still be vacant even after re-extending UTF-8 back to 2.1 billion codepoints.

The original UTF-8 looks like:

    U+0000 to      U+007F:  0xxxxxxx
    U+0080 to      U+07FF:  110xxxxx 10xxxxxx
    U+0800 to      U+FFFF:  1110xxxx 10xxxxxx 10xxxxxx
  U+1_0000 to   U+1F_FFFF:  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
 U+20_0000 to  U+3FF_FFFF:  111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U+400_0000 to U+7FFF_FFFF:  1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Byte form 1111111x is unused. Last time I proposed extending it to 4.4 trillion codepoints, but I don't think that's enough. A better way is to use 11111110 to indicate a 9-byte sequence giving 281 trillion codepoints (281万亿):

U+8000_0000 to U+FFFF_FFFF_FFFF:
    11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Last time I proposed using this range to formulaicly encode Unihan characters not already encoded in the 70_000 presently encoded ones, as we already do Korean hangul. The Unihan character defined as the simple sequence "又出 across", like billions of other such examples, isn't encoded anywhere in those 70_000. To encode it using an ideographic descriptor would take 9 bytes, each character taking 3 bytes, as ⿰又出, the same as in my proposal (though we'd need to define a control code in the private use area instead of the descriptor to do it properly). But the payoff comes from all characters more complicated than that, if 281 trillion is enough to formulaicly generate them to a practical recursion depth. And we could eliminate most characters in the huge Unihan lookup table we presently require.

3. A 6-bit embedded language

The remaining unused byte form 11111111 could then be used as an embedded 6-bit control language.

The committee designing Unicode's predecessor, ASCII, decided it was important to support uppercase 64-character alphabets, and chose to pattern ASCII so it could be reduced easily to a usable 64-character set of graphic codes. So by using range U+20 to U+5F only, we can use everything in ASCII except the 33 control codes, 26 lowercase letters, and 5 symbols ` ~ { } |. We can embed these tokens within the continuation characters 10xxxxxx easily, build single-line commands after each 11111111, yet retain UTF-8's self-synchronization. Not only would the Unicode codepoint repetoire be reducible down to ASCII, but so would Unicode embedded with this control language be. But what would we use this control language for?

Not only are the UTF encodings self-synchronizing, but so is the Unicode repetoire, almost. The original Unicode 1.0 and 1.1 wasn't so much, having pairs of symbols (U+206A to U+206F) to indicate scopes like which script's digit shapes to use, whether to join Arabic letters together, and whether symmetric reversal should be used for certain symbols. These have since been deprecated by later codepoints and properties. Unicode still has codes (U+FFF9 to U+FFFB) for embedding annotations like Japanese rubies within text, but recommends a higher-level protocol be used. And there's the bidirectional embedding and overriding codes (U+202A to U+202E), and although replaced by the simplified isolate codes (U+2066 to U+2069) in 2013's Unicode 6.3, it still uses nested scopes.

The 6-bit embedded language has enough characters (i.e. space, 26 uppercoase letters, 10 digits, and 37 punctuation and symbols ! ? , . : ; # $ % & @ ^ * = + - _ ' " / \ < > ( ) [ ] to design a language to be embedded within Unicode to control all scoped functionality, not just taking over bidirectional control but any scoped function normally requiring a higher-level protocol. The possibilities are endless.

6 January 2014

The Groovy History of Unicode

Unicode began life as ASCII, when IBM's Bob Bemer submitted a proposal for a common computer code to the ANSI in May 1961, nine months before I was born. Unicode's history has always been tightly intertwined with my own. In late 1963, when I was moved to Auckland NZ, the ASCII codeset was finally agreed upon by committee heavyweights Bemer and Teletype Corp's John Auwaerter, and the ISO later accepted ASCII. But ASCII almost died young as only Teletype machines and the Univac 1050 used it for the next 18 years.

In 1981, ASCII was reborn when the popular IBM PC used it, and since then virtually every computer made has been ASCII-compliant. Its codeset included what we see on the keyboard, plus 33 control codes, some of which (0x0e and 0x0f) allowed other code pages to be swapped in and out, though the code was no longer self-synchronizing. Hundreds of other code pages were created for other natural languages, as well as separate standards by various East Asian governments to handle thousands of ideographs. On 29 August 1988, Xerox's Joe Becker outlined the original 16-bit design for Unicode 1.0, encoding Latin, Cyrillic, and Greek scripts, right-to-left scripts Arabic and Hebrew, the Devanagari 9, Thai, Korean Jamo, and punctuation and math symbols, and later expanded to include 20,000 Unihan (unified Chinese and Japanese) characters in 1992. Specs were written to cover case mapping and folding, diacritical marks, bidirectionality, Arabic joining, grapheme clusters and ordering, varying widths in Asian characters, and breaking algorithms.

But Koreans didn't want to encode any of their writing system as separate Jamo that make up the square-shaped Hangul syllables, which would require 6 to 10 bytes for each syllable, so in 1996's Unicode 2.0, over 10,000 codepoints were remapped to Hangul, and generated by formula. Chinese and Japanese weren't happy with only 20,000 Hanzi/Kanji being encoded, so Unicode 2.0 also brought the UTF-16 encoding using surrogates, adding a million 4-byte codes to the 65,000 2-byte ones, and dividing the codespace into 17 planes. Many users switched to Unicode, with two notable exceptions. Many Japanese were unhappy at their characters being unified with those of China, a situation which wouldn't need to have happened if UTF-16 had been decided on earlier. And Americans didn't like the incompatibility with ASCII, and its requirement for twice the space.

So Ken Thompson's 1993 self-synchronizing UTF-8 encoding became a supported alternative by the Unicode Consortium, with its varying length ensuring backwards compatibility with ASCII. In 2003, its maximum 2 billion codepoints were restricted to 1.1 million to match the UTF-16 encoding, but the top 2 planes (0xF and 0x10) had been reserved for private use. This ensures anyone can access the full 2 billion codepoints using UTF-16 by using those planes as "ultra surrogates", the top half of plane 0xF being high ultra-surrogates and all of plane 0x10 being low ones. We could even re-use plane 0x10 to encode the first "ultra plane" of characters. By decoding through surrogate indirection twice in UTF-16, the maximum rune is `\U7FFFFFFF`, not `\U0010FFFF`.

Unicode started being used in Java, Windows, and Linux, and by 2008, UTF-8 had replaced ASCII to become the most common encoding on the web. The main holdouts to uptake of Unicode is Japan's continuing preference for the JIS encoding, and of course programmers clinging to using ASCII-only tokens in their code. But the next major leap came on Unicode's 15th birthday: on 29 August 2003, the Groovy Language was announced by its creator James Strachan, the nascent vehicle which will introduce the full power of Unicode's vocabulary and possible resultant grammars to the world of software creation. A language spec and testkit was initiated in April 2004, and I joined the online community a year later, watching progress, excited at the opportunity to contribute Unicode to the ecosystem, and even create another implemention of the Groovy Language so it would be the best language ever.

But a great darkness has descended upon the Groovy Language: two voldemorts who hijacked its development and greedily turned it into a business venture to generate profits from web framework consulting and conferences. I endured years of being ignored then handled, and even, though unknown to me at the time, smeared and watched. One example a few months before G2One,Inc was incorporated, Laforge said without any discussion, (paraphrased) "We will not be doing anything like Unicode symbols in the foreseeable future". The business plan to stall Groovy development had been decided. To this day, little Groovy stands enslaved by the evil giant Grailiath. But because I've faced up to my groovy destiny to bring the full tersity of Unicode to the world of programming through the Groovy Language, I'm now rebooting it from scratch, spec and all. The day will come when programmers code on their mobile phone screens while riding the subway in Beijing, Hong Kong, and Tokyo.

Last year (2013) brought Unicode 6.3, weighing in at around 110,000 encoded tokens, encompassing exactly 100 scripts! But the greatest achievement in Unicode last year wasn't with the big version release, but in little Corrigendum 9, only 350 words long, changing the definition of non-characters, which are primarily the last two codepoints in each plane, 0xfffe and 0xffff. Unicode's F.A.Q on Noncharacters had previously said "Noncharacters are in a sense a kind of private-use character, because they are reserved for internal (private) use. However, that internal use is intended as a super private use, not normally interchanged with other users." But Corrigendum 9 will be remembered as the announcement that freed Unicode from its straightjacket of 2.1 billion codepoints possible with both the pre-2003 extended mode of UTF-8 and the doubly-indirected ultra surrogate system of UTF-16. For Corrigendum 9 now allows noncharacters to be interchanged, saying "Noncharacter: A code point that is permanently reserved for internal use", crossing out "and that should never be interchanged", because "for distributed software, it is often very difficult to determine what constitutes an internal versus an external context for any particular software process".

In pre-2003 UTF-8, an ASCII byte begins with `0`, a continuation byte begins with `10`, and a leading multibyte begins with `11`, the number of `1` bits indicating the number of bytes in the sequence. 6 bytes can encode up to U+7FFF_FFFF, which is 2.1 billion codepoints. But the leading multibytes `11111110` and `11111111` weren't used in Thompson and Pike's original UTF-8 spec because they were noncharacters `FE` and `FF`. Because of Corrigendum 9 we can now use them in interchange, and in any way we want! Maybe we could use `FE` to signify, not a 7-byte, but a 9-byte sequence; maybe we could use `FF` to signify program code using 6-bit tokens embedded in the following continuation bytes, which would still leave UTF-8 somewhat self-synchronizing. But even if we just extended the original definition of UTF-8, `11111110` would indicate a 7-byte sequence (68 billion codepoints) and `11111111` a 8-byte sequence, which could encode 4.4 trillion codepoints (=4.4万亿, =2^42, yes, 42), up to U+3FF_FFFF_FFFF, a huge improvement on U+7FFF_FFFF, and giving meaning to Unicode!

"And what would we use all these codepoints for?", you ask? Your destiny isn't intricately woven around Unicode's so you've never imagined it. From Unicode 2.0 on, over 10,000 Korean hangul codepoints are generated via a formula each having 2 to 5 jamo, selected from 24 jamo, non-recursively. A font for Hangul only needs to encode the jamo, not every hangul because they can be generated by the formula. This isn't possible for the 70,000 Unihan codepoints, each composed of 1 to perhaps 10 or more components, selected from hundreds of radicals and phonetic components, perhaps recursively. For now, each glyph must be encoded separately in the font, or a huge lookup table used, and when a possible Chinese character not in the 70,000 encoded is required, the definition sequence takes up many bytes, like with many Korean hangul before Unicode 2.0. But 4.4 trillion codepoints could mean Unihan tokens can be encoded by formula to a practical recursion depth, without requiring huge fonts or superlong encodings. And if 4.4 trillion isn't enough, we can define the meanings of `11111110` and `11111111` differently. Wow!

Great things are coming to Unicode thanks to Corrigendum 9 and the Groovy Language reboot!

Post removed because it's been superceded.

27 January 2014

Groovy's Lies and Statistics

In Oct 2013 many articles suddenly popped up in the online IT rags boasting of a "surge in Groovy's popularity", citing its recent rise to #18 in the Tiobe index from outside the top 50 in May 2013. Three months later (Jan 2014), Groovy had dropped back out of the top 50 (#32 in Nov, #46 in Dec). According to another online rag: "After a long discussion with one of the Tiobe index readers, it turned out that the data that is produced by one of the Chinese sites that we track is interpreted incorrectly by our algorithms. So this was a bug," Janssen said. "After we had fixed this bug, Groovy lost much of its ratings." Whoever pulled off that trick didn't stop there, but utilized the feedback effect to prolong the afterglow of the deception: typing "Groovy Programming" into Google still gives...
  • 5. Groovy breaks into top 20 list of programming languages
  • 8. Groovy Programming Language Sees Major Boost in Popularity
  • 22. Groovy Programming Language Sees Major Boost in Popularity
  • 24. Groovy makes debut entry into programming language top twenty
  • 30. Interview about Groovy's popularity boost -- Guillaume Laforge's Blog

This has happened before. In December 2010, Groovy began a sudden rise from outside the top 50 when Groovy tech lead Jochen Theodorou "volunteered" his services to Tiobe to help them improve their algorithms. In April 2011, however, Groovy fell from #25 to #65 on Tiobe in a single month after they increased the number of search engines they monitor. Also then, Stack Overflow started being gamed by someone: the number of monthly questions tagged Groovy shot up suddenly, where questions and answers posted seemed to be coordinated. Github also showed signs of similar manipulation (though the ranking queries I describe there are no longer provided by Github, perhaps to deter such gaming), all in an apparent effort to game the Redmonk Programming Language rankings.

And so when yet another IT rag claimed last week that Groovy smashes records with 3 million downloads in 2013, was anyone not sceptical? Let's look at the claims one by one...
  • With 1.7 million downloads in 2012, all the signs were good that last year was going to be a big one for Groovy
Neither the annual nor monthly download numbers are relevant. There were 17 releases during 2012, giving 100k downloads per release from Maven and Codehaus (assuming the figures are correct). But even that number is exaggerated because the Maven downloads are often triggered automatically, and often gamed by bulk-download scripts. The most accurate number showing active users is the Codehaus downloads at 30% of the total, i.e. 30k per release (assuming the figures are correct and not gamed). I've simplified the chart from the original so it's readable...

groovy 2012 only.jpg

  • but not many people would have predicted today’s news of three million downloads of the alternative language in 2013.
There were 23 releases during 2013, giving 130k downloads per release from Maven and Codehaus (again assuming the figures are correct). The releases occur despite being sparse of features, and could even be timed just to drive up the download metric. E.g. Groovy's roadmap just added a version 2.3, featuring Traits, which were originally announced for version 2.2 but postponed. Similarly in June 2012, Codehaus Groovy had also announced a new meta-object protocol for Groovy version 3.0 to come after version 2.1, but that was postponed early 2013 to make way for a version 2.2.

Per release, most of the increase comes from the Maven figures, the most easily gamed, and that from October 2013. I suspect the person responsible for spoofing this apparent increase is the same person behind Groovy's Tiobe Top 20 deception that very same month. Laforge's chart, again made more readable...

groovy 2013 only.jpg

  • Creator Guillaume Laforge, who arrived at this figure by compiling Maven Central statistics, as well as “slicing and dicing the Codehaus Apache logs”, attributes this staggering increase to the “hard work of the Groovy core development team and the friendly community and ecosystem.”
Someone needs to tell the article's author that James Strachan is the creator of the Groovy Language, not Laforge. Where did they get that idea?

  • These figures are for pure Groovy downloads by the way, and don’t account for bespoke versions of the language which come bundled in with programs such as Grails or Gradle.
Why didn't Laforge use the Grails or Gradle numbers in his stats? Aren't they talking to him, or are they hiding something?

  • As Laforge points out in his blog post, a substantial amount of downloads are of Groovy as a library, rather than as an installable binary distribution, due to the fact that Groovy is essentially a “dependency” to add to projects. The peaks on the chart for the most part correspond to major releases - naturally rising in concurrence with monthly downloads, which rise from 200k to 300k a month between January 2012 and December 2013.

The increase from 150k to 200k in Jan 2013 corresponds to 4 releases instead of 2 that month, and again in Oct 2013. Also in January 2013, the Groovy download at Codehaus increased from 3 artifacts to 4.

  • Although it continues to lag behind Scala, with its huge and tenacious community, in terms of popularity, these figures show that slowly but surely, Groovy is edging its way into the mainstream.
Groovy is in decline. I suspect the sharp decline in downloads in Aug and Sep 2013 correspond to someone not tending to their bulk download scripts. Groovy's "surge in popularity" was probably timed to coincide with a due diligence investigation on Pivotal Inc, or someone there selling a consulting contract, as well as selling seats to their annual December GrailsXchange conference in London. They probably didn't expect their Tiobe exploit to get found out so quickly.

  • For beginners, the language is very simple to pick up - something that users argue actually hamper it in popularity rankings, on the grounds that, if people understand how something works, they won’t be spending hours looking for tutorials and pushing up your search rankings.
In Feb 2013, Dr Dobbs editor Andrew Binstock wrote: "Groovy is a language primed to be a major player. There is the conundrum. The endless variety of features requires considerable documentation, which is simply not available, especially for the advanced features that give Groovy much of its benefit. And so, if you jump in today, you'll find the language is easy to learn, but hard to master."

  • And let’s not forget the all important Java intercompatibility factor. Due to its similarity to Java, it’s relatively painless for these devs to master, the main difference being that it’s dynamically typed, which removes a lot of boilerplate, and adds closures to the language.
Groovy is only simple to pick up if you just use the terser closure and collection syntax when manipulating Java classes for tests. Most users aren't using anything more complex, except Grails users who also use the metaobject protocol.

  • There’s also a huge ecosystem which has grown up around it, and, as the newly emerged static website generator Grain shows, there are plenty of additions in development.
Grails and Gradle are the only users with any traction. The rest of the promoted ecosystem is just an "echo system". Grain, like Griffon and Gaelyk, is just hype.

  • A healthy industry interest in Groovy may well have also contributed to this spike in ranking. Notably last year, the language was featured in Pivotal’s recently released Spring Framework 4.0,
The Spring Framework is separate from Groovy, which together with Grails and Spring Scala at the very bottom of the list of Spring Projects. Within the Spring Framework 4.0 Reference, the Spring Expression Language takes up all of chapter 7, whereas Groovy takes up little section 28.3.3 only.

  • as well as the Gradle build automation system, currently being utilised by Google for Android app builds.
Google uses Groovy in Gradle to build Android, but Groovy can't actually run on Android. Developers in other JVM languages already use them to build Android apps. According to Laforge in 2013: "Groovy is not able to run properly on Google's Android mobile platform ... (It takes) 20 seconds to start up a simple Hello World". Laforge was seeking Google Summer of Code students to try make Groovy run on Android, but Google didn't accept that project, or any projects related to Groovy, perhaps because no students were interested, or maybe it was Codehaus Groovy's history of mishandling GSoC projects. The last Groovy project ever accepted was one in 2011 rewriting Groovy's antiquated parser in Antlr 3.0. The project failed and Google only paid out half the project money. But there are plenty of Scala and Clojure projects on that GSoC list.

But even Groovy's future in Gradle is in doubt. Gradle still ships with Groovy 1.x, and they'll looking at bundling alternative build languages such as Scala in Gradle 2.0.

  • Finally, with its last major releases, Groovy resolved some major legacy performance issues, which had long turned many people off the Java alternative. As Andrew Binstock notes, with these developments, the language was finally poised to explode. The community is relatively small, but it’s certainly growing. If the Groovy team can continue the good work they’ve started, 2014 may be yet another bumper year for adoption - though we’re not convinced Team Scala will be jumping ship just yet.
Groovy's not ready to explode. Absent other choices, Java developers will replace Groovy with the speedy JDK8-bundled "Nashorn" Javascript when upgrading from Java 7 to 8. Codehaus Groovy is on a downward spiral, its trajectory tied to JDK7 and before. But I've been designing and building the new, real, Groovy Language to replace the legacy one when it dies so there'll be a real choice available. Then Groovy will explode!

Although I've tried to focus on building the new rather criticising the old, lies such as these Tiobe and Maven exploits regarding Groovy make it difficult to keep silent. Some lies have been outright childish, such as when on 29 August 2013, Groovy's 10th birthday, someone who's been active for over 3 years updating release info on Groovy's Wikipedia page added Laforge's name to Strachan's as a co-designer of Groovy, giving him 3 titles (designer, project manager, and spec lead), and ignored the rest of their developer team. We don't know who this is: Laforge claimed during a previous undo-redo spat that he thinks he's never done any modifications to Groovy's wikipedia page as far as he can recall.

I really had to undo that addition. Because the JSR spec was changed to dormant in April 2012 after being inactive for 8 years, I removed Laforge's spec lead title also, then added the 3 technical people who are listed in Groovy's Codehaus repository as Despots...

wiki now and then.jpg

Someone should be concerned about celebrating Groovy's 10th birthday by listing the developers who do the actually coding and testing, instead of giving the manager more titles. But perhaps this is a frameup?

Last edited Sep 25 at 7:40 PM by gavingrover, version 26


No comments yet.