Kanji meets Programming: a Revolution in Progress

Over the past 5000 years two important ideas in Writing have spread out from the Middle East. One idea spread eastwards, eventually to Japan; the other westwards, to Silicon Valley in California. These two ideas began to meet about 50 years ago, across the Pacific Ocean, and within another 50 years, their merger will have launched the greatest revolution in Writing since its invention.

This short article on the Japanese writing system at OzIdeas says: "Kana are read phonetically and kanji are read visually, with a dissociation between the processes involved, according to Morton & Sasanuma and popular Japanese belief. Nomura found that meaning was extracted faster from kanji than kana words, and thought that kana pronunciation was data-driven and that kanji pronunciation was conceptually-driven. Morton & Sasanuma (1982) also claimed that evidence supports the intuitive belief that kanji can give direct access to the meaning of words, but that kana always require translation into a phonological code when they are being read, and there is no development of automatic visual recognition of the kana symbols."

The 2000 Kanji are used in Japanese to write content words (i.e. noun, adjective, and verb stems), while the 50 kana are used to indicate grammar. Thus the content words are processed visually by native Japanese readers, and the grammar words and inflections processed phonetically. The converse is generally true of programming languages: the names of entities are processed phonetically, while the grammatical structure is processed visually.

Java is a fairly typical programming language in this respect. The names of entities such as classes, variables, and methods can use any Unicode alphanumeric tokens, including ideographic characters such as Kanji. This means their names can be pronounced in some natural language, which is usually English in practice. The default names in the standard libraries are in English, so are the language keywords. Programmers reading Java code will process the English names the same way they process it in their natural language: phonetically. The grammatical structure of Java, and many other programming languages, use symbol and punctuation tokens, and programmers reading the code process it visually, even using indentation when it's not required by the grammar.

So presently we have the writing system of Japan providing visual processing of content words and phonetic processing of grammar, and the writing system of Silicon Valley providing phonetic processing of content words and visual processing of grammar. To see how these two ideas could combine and revolutionize Writing, we first need to consider their historical roots...

History of Writing

The Chicago University Oriental Institute claims "(Although) writing emerged in many different cultures and in numerous locations throughout the ancient world, the Sumerians of ancient Mesopotamia are credited with inventing the earliest form of writing, which appeared around 3500BC." Well before this time were many pictures on cave walls, and the earliest symbols in Writing came from these pictures. Numerical symbols were invented to represent more abstract accounting concepts, such as who owes what to who. But writing that can represent the full range of meanings that speech can, began with the Sumerians.

This required abstract ideas to be represented, which were less drawable, so the signs of similar sounding words were used. According to Wikipedia, the inventory of glyphs gradually reduced, and the writing became increasing phonological, a given sign having a single sound but many meanings depending on context. Determinatives, ideograms that aren't pronounced and only used to mark semantic categories of words, such as animals, vessels, and trees, were then introduced to avoid ambiguities of meaning. When papyrus replaced stone tablets, the symbols were simplified to accommodate the new medium, often losing their resemblance to the original picture. The Egyptian hieroglyphic writing also began to use determinatives. Perhaps they heard through the trade route grapevine about the Sumerian system of determinatives, then used the idea with their own hieroglyphs.

Modern Chinese characters are descended from picture-based ancient ones, such as 日 for sun, 月 for moon, 田 for field, 水 for water, 山 for mountain, 女 for woman, and 子 for child. Some characters have meanings composed of other meaning-based characters, such as 女 (woman) and 子 (child) combining into 好, meaning good. Perhaps the ancient Chinese also heard about the idea of determinatives being used by the Sumerians, but because Chinese ideograms maintain a one-to-one correspondence with syllables in their spoken language, they introduced determinatives by modifying their existing pictographic characters with the additional semantic information. Over 80% of modern Chinese characters consist of both a semantic determinative and a primarily phonetic component, e.g. 土 sounds like "tu", and 口 means mouth, so 吐 also sounds like "tu", and means to spit (with the mouth). The phonetic part of many phonetic characters often also provides secondary semantics to the character, e.g. the phonetic 土 (in 吐) means ground, where the spit ends up.

Eventually in Egypt, a set of 24 hieroglyphs called _uniliterals_ evolved, each denoting one consonant sound in ancient Egyptian speech, though they were probably only used for transliterating foreign names. This idea was copied by the Phoenicians by 1200BC, and their symbols spread around the Middle East into various other languages' writing systems, having a major social effect. It's the base of almost all alphabets used in the world today, except Chinese characters. These Phoenician symbols for consonants were copied by the ancient Hebrews and for Arabic, but when the Greeks copied them, they adapted the symbols of unused consonants for vowel sounds, becoming the first writing system to represent both consonants and vowels.

Meanwhile in China, different characters evolved in each of the different Chinese states up to the Warring States period. Around 220BC, the emperor of Qin conquered the other states, unifying China for the first time, and their version of Chinese characters became standard over the whole country. These characters were then simplified somewhat during the following 400 years during the Han dynasty, resulting in the general look of Chinese characters in use today in Hong Kong, Taiwan, and Chinatowns. These characters were then taken to Japan, Korea, and Vietnam, bringing literacy to the spoken language of those countries. Whereas characters in both ancient and modern China have a rough phonetic basis, they only have a semantic basis when used to write the Japanese, Korean, and Vietnamese languages, none of which are related to the dialects of the Chinese language. We therefore now call them CJK characters, also called Hanzi by the Chinese and Kanji by the Japanese. To this day, each character in China only ever represents one syllable, but when used as Kanji in Japan, can represent many syllables. In Japan, a small subset of 48 characters called Kana were eventually selected for representing syllabic sounds, and were used alongside the thousands of Kanji.

Over time in Europe, cursive versions of letters evolved for the Greek alphabet, and its descendant Latin and Cyrillic alphabets, so people could write them easily on paper. They used either the block or the cursive letters, but not both, in one document. The _Carolingian minuscule_ became the standard cursive script for the Latin alphabet in Europe from 800AD. Soon after, it became common to mix block (uppercase) and cursive (lowercase) letters in the same document. The most common system was to capitalize the first letter of each sentence and of each noun.

Punctuation was popularized in Europe around the same time as cursive letters. Punctuation is chiefly used to indicate stress, pause, and intonation when reading aloud. Underlining is a common way of indicating stress. In English, the comma, semicolon, colon, and period (,;:.) indicated pauses of varying degrees, though nowadays, only comma and period is used much in writing. The question mark (?) replaces the period to indicate a question, of either rising or falling tone; the exclamation mark (!) indicates a sharp falling tone. The idea of separating words with a special mark also began with the Phoenicians. Irish monks began using spaces in 600-700AD, and this quickly spread throughout Europe. Nowadays, the CJK languages are the only major languages not using some form of word separation. Until recently, the Chinese didn't recognize the concept of word in their language, only of (syllabic) character.

The bracketing function of spoken English is usually performed by saying something at a higher or lower pitch, between two pauses. At first, only the pauses were shown in writing, perhaps by pairs of commas. Hyphens might replace spaces between words to show which ones are grouped together. Eventually, explicit bracketing symbols were introduced at the beginning and end of the bracketed text. Sometimes the same symbol was used to show both the beginning and the end, such as pairs of dashes to indicate appositives, and pairs of quotes, either single or double, to indicate speech. Sometimes different paired symbols were used, such as parentheses ( and ). In the 1700's, Spanish introduced inverted ¿ and ¡ at the beginning of clauses, in addition to the right-way-up ones at the end, to bracket questions and exclamations. Paragraphs are another bracketing technique, being indicated by indentation.

Around 1050, movable-type printing was invented in China. Instead of carving an entire page on one block as in block printing, each character was on a separate tiny block. These were fastened together into a plate to reflect a page of a book, and after printing, the plate was broken up and the characters reused. But because thousands of characters needed to be stored and manipulated, making movable-type printing difficult, it never replaced block printing in China in a major way. But less than a hundred letters and symbols need to be manipulated for European alphabets, much easier. So when movable-type printing reached Europe around 1480, the printing revolution began.

With printing a new type of language matured, one that couldn't be spoken very well, only written: the language of mathematics. Mathematics, unlike natural languages, needs to be precisely represented. Natural languages are very expressive, but can also be quite vague. Numbers were represented by many symbols in ancient Egypt and Sumeria, and had reduced to a mere 10 by the Renaissance. But from then on, mathematics started requiring many more symbols than merely two cases of 26 letters, 10 digits, and some operators. Many symbols were imported from other alphabets, different fonts introduced for Latin letters, and many more symbols invented to accommodate the requirements of writing mathematics. Mathematical symbols are now almost standardized throughout the world. Math can describe concepts extremely tersely, since the greater the range of tokens a language has, the terser it can be written. Many other symbol systems, such as those for chemistry and music, also require precise representation. Existing writing systems changed to utilize the extra expressiveness that came with movable-type printing. Underlining in handwriting was supplemented with bolding and italics. Parentheses were supplemented with brackets [] and curlies {}.

Korea and Vietnam eventually replaced Chinese characters with a new invented script more suitable for their own spoken language. In Japan 50 years ago, they reduced the number of officially used Kanji to about 2000, and simplified the pictorial representation of many of them. Subsequent proposals to totally replace Kanji with the 48 phonetic Kana have met with strong resistance, though, perhaps because the Japanese language only has 48 syllabic sounds, and when written with Kana only, it's difficult to understand the meaning from the context.

In the 1950's, mainland China's government also simplified many hundreds of commonly-used characters, taking it further than the Japanese. These Chinese simplified characters, also used in Singapore, can be read at the same font size as alphabetic letters and digits. Written Chinese using simplified characters takes up half the page space as written English, and can be condensed even further by using headline writing style. Nowadays in mainland China, both complex and simplified Chinese are sometimes used in the same document, the complex ones for headings and more formal parts of the document. Perhaps one day complex characters will sometimes mix with simplified ones in the same sentence, turning Chinese into a two-case writing system.

Future of Writing

Western civilisation spread across the Atlantic 500 years ago, bringing with it the Latin alphabet, with the dominant industrial power spreading across the North American continent, the information technology industry eventually making its home in northern California. 50 years ago, programming languages arose. The first ones, such as Cobol and Fortran, used many English words in their syntax, due to parser constraints. But eventually, programming languages became more sophisticated, utilizing many more symbols and punctuation for syntax. Some of these, such as C, Perl, and their syntactic descendants became popular. Programmers still wanted code to be readable, so many rejected regexes, C++ operator overloading, and the J language as being too terse. All these languages rely on ASCII. The tersity of these languages comes from maximizing the use of grammar, the different ways tokens can be combined. The same 100 tokens are used.

Some languages, such as APL, tried to use many more symbols, but didn't become popular, probably because the extra symbols couldn't be entered by the keyboard easily. So programming is limited to the 100 tokens on the keyboard. Many people can type those 100 characters faster than they can write them, but can write thousands more they can't type. Computer programs generally copied Latin-based natural language writing systems, rather than mathematics, in using letters, numbers, bracketing, punctuation, and symbols in similar ways.

Around 1990 Unicode was born, unifying the character sets of the world. Initially, there was only room for about 65,000 tokens in Unicode, so to save space, the CJK characters of China and Japan were unified into one character set based on their historical roots. This subset of Unicode is called the Unihan characters, which many people in Japan initially complained about. Unicode is also bidirectional, catering for Arabic and Hebrew. Topdown languages such as Mongolian and traditional Chinese script can be simulated with left-to-right or right-to-left directioning by using a special sideways font. However, Unicode didn't become very popular until its UTF-8 encoding was invented 10 years ago, allowing backwards compatibility with ASCII. UTF-8 also increased its capacity to about one million characters, allowing less commonly used scripts such as Sumerian cuneiform to be encoded. The latest version is only about 11% full. Unicode is now used almost everywhere, including on Java, Windows, and Linux. The unification of CJK characters into Unihan remains in Unicode even though it's no longer needed.

Most programming languages now cater for Unicode. The contents of strings and comments can use any Unicode character, which programmers in non-Latin-alphabet countries (e.g. Greece, Russia, China, Japan) often use. The display tokens in Unicode are divided into various categories and subcategories, mirroring their use in natural language writing systems, allowing existing programming languages to easily extend their existing syntax to cater for Unicode. Some Unicode symbols have an obvious meaning, such as math symbols, but most don't have a meaning that can be transferred easily to the programming context, and aren't widely used. An exception is Oracle-Sun's Fortress, which provides keystroke sequences for entering mathematical symbols in programs, though it does leave it vague whether the Unicode tokens or the ASCII keys used to enter them are the true tokens in the program text.

User-defined names can use all the alphabetic letters and CJK characters, and because there already exists agreed meanings for combinations of these, derived from their respective natural languages, we can begin using them immediately to increase tersity while keeping readability. But even when there's an IME (input method editor) for their entry with keyboards, and an available font to represent every token, programmers generally still only use ASCII for user-defined names. Perhaps the main reason is that the core of most programming languages still only use ASCII characters.

Programmers from cultures not using the Latin alphabet won't be motivated to use their own alphabets in user-defined names when they don't with pre-supplied names, such as grammar keywords and those in supplied libraries. Often, most of the names in a program are from libraries standard to the language. To trigger the widespread use of Unicode characters from non-ASCII alphabets in programs, the pre-supplied names must also be in those alphabets. This could easily be done. The grammar of a language and its vocabulary are two different concepts. A programming language grammar could conceivably have many vocabulary options. Almost all programming languages only have English. Other vocabularies could be based on other natural languages. A Spanish vocabulary plugged into a certain programming language would have Spanish names for the entities.

Computer software nowadays is internationalized, webpages are, and most programming languages enable internationalized software. But the languages themselves and their libraries are not internationalized. An internationalized programming language would enable a fully functional program to be written totally in a natural language of one’s choice. Not only could all user-defined names be specified in any alphabet, but also all keywords and names in the standard libraries would be available in many natural languages. Ideally, when a software supplier ships a library, they'll specify the public names in many languages in the definition. Soon enough, most programming languages will be internationalized. But supplying non-English names is likely to have a slow uptake, so languages must allow a library to be translated in a gradual way easily from one natural language into another.

Most natural languages, however, wouldn't actually be used with internationalized programming: there's no real reason to. Many people going to college in non-English countries can read English, if not speak or write it, and so can use a programming language's libraries easily. Typing the library names into programs involves skill in spelling, not writing, and so such people can program in English easily enough. And with auto-completors in IDE's, programmers really only need to know the first few letters of the name. Writing foreigner-readable programs in English will be more important than using the native tongue of a culture, unless there's a strong nationalist movement promoting the language, such as in France and China.

Another reason a certain natural language might be used in programming is if the program is shorter. The natural languages of Northeast Asia, traditional Chinese, simplified Chinese, Japanese, and Korean, have many more tokens than alphabetic languages, and so enable programs to be written much more tersely. The Chinese written language has tens of thousands of characters. In fact, 80% of Unicode characters are CJK or Korean-only characters. However, the characters used for writing Japanese kanji and traditional Chinese (used in Hong Kong, Taiwan, and Chinatowns) must be read at a much larger font size than English, which would cancel out the benefits of using them. Possibly why the Japanese-invented Ruby only uses English alphabetic library names. Korean can be read at the same font size as English, but, unlike Chinese and Japanese characters, it's really an alphabetic script. There's only 24 letters in the Korean alphabet, and they're arranged in a square to form each sound, instead of one after the other as in English. Thus Korean is potentially terser than alphabetic languages, but not as terse as simplified Chinese, the script used in mainland China. Not all simplified characters can be read at the same font size as alphabetic characters, but the thousands that can enable far greater tersity than 26-letter alphabets. A non-proportional font would enable many horizontally-dense characters, both simplified and traditional, to be read at a normal font size also, though that would be a radical departure from the historical square shape of characters.

Chinese characters are each composed of smaller components, recursively in a treelike manner. When breaking them down into their atomic components, the choice of what's atomic is arbitrary because all components can be broken down further, all the way to individual strokes, whereas in Korean it's clear which component is atomic. Chinese characters are often divided into about 500 components. A certain group of Chinese components can often be arranged in more than one manner to form characters with different meanings, e.g. 粞 and 粟 both use 米 as the semantic determinative and 西 as the phonetic component in different sequencing and shaping, unlike Korean where certain components can be arranged into a square in one way only. The arrangement is as much a part of a Chinese character as the components themselves, and provides another dimension of lexical variation that increases the potential tersity of written Chinese in computer programs.

A terse programming language and a tersely-written natural language used together means greater semantic density, more meaning in each screenful or pageful, hence it’s easier to see and understand what's happening in the program. Chinese characters in writing often reduces it to half the size of Latin-based writing, but when used in programming, the tersity could be much greater. The syntax of present-day programming languages is designed to accommodate their ASCII vocabulary. With a Unicode vocabulary, the language grammar could be designed differently to make use of the greater vocabulary of tokens, e.g. for 'public class' in Java we could just write '公类' without a space between (just like in Chinese writing), instead of '公 类', making it terser. If only 3000 of the simplest-written 70,000 CJK characters in Unicode are used, there are millions of unique two-Chinese-character words. Imagine the reduction in code sizes if the Chinese uniquely map them to every entity name in the entire Java class libraries.

A large demographic and economic base also helps a certain natural language be successful in internationalized programming. Mainland China has over 1.3 billion people, is consistently one of the fastest growing economies in the world, and has long-term strategic planning. Some leaders in mainland China are asking not how they can reform their language's graphology to promote literacy, but how they can gain advantage from their citizen's knowledge of a difficult-to-learn writing system that few Westerners know. One day using the tersity of Chinese characters in programming languages will be of more value to mainland Chinese programmers than writing foreigner-readable programs in English, and when they decide to switch, it’ll be over in a year.

Kanji and Programming

If a programming language enabled multi-vocabulary programming, not only could Chinese programmers mix Chinese characters with the Latin alphabet in their programming, but so could Western programmers. Deep down in their hearts, hackers want to read and write terse code, and will try out new languages at home if they can't at work. They'll begin learning and typing Chinese characters if it reduces clutter on the screen, there's generally available Chinese translations of the names, and they can enter the characters easily. IME's and IDE auto-completors work in similar ways. With an IME, someone types the sound or shape of the character, then selects the character from a popup menu of other similar-sounding or similarly-shaped ones. Instead of two different popup systems, each at a different level of the software stack, IME's for programming could be combined into IDE auto-completors. Entering terser Chinese names for class library names would then be as easy as entering the English names, limited only by how quickly a Western programmer could learn to recognize the characters in the popup box. They would gradually learn more Chinese characters, simply to program more tersely. Just as Perl, Python, and Ruby are used because of the tersity of their grammar, so also Chinese programming will eventually become popular because of the tersity of its vocabulary.

An internationalized programming language and IDE plugins must allow programmers to begin using another natural language in their programs gradually, only as fast as they learn the new vocabulary, so some names are in one language and some in another. This is much easier if the two natural languages use different alphabets, as do English and Chinese. An IDE plugin can transform the names in a program between two such natural languages easily enough. Non-Chinese programmers won't actually have to learn any Chinese speaking, listening, grammar, or character writing. They can just learn to read characters and type them, at their own pace. Typing Chinese is very different to writing it, requiring recognizing eligible characters in a popup menu. They can learn the sound of a character (the pinyin without the syllabic tone), or instead learn the shape. Because the characters are limited to names in class libraries, they won't need to know grammar or write sentences.

However, it would be easier for Western programmers to learn to recognize and recall the shape of a character, instead of recall the sound and recognize the shape. Native Chinese speakers can already recall the sound before learning to type, so they learn the pinyin input method easier. But Westerners who only want to use the visual token in programming, but not speak Chinese or Japanese, would instead learn a visual input method more easily. Instead of learning 剩 as "sheng4: remainder", they would learn it as "禾北刂: remainder". This is exactly how the Japanese imported Chinese characters into their own language 2000 years ago.

Having begun using simplified Chinese characters in programs, programmers will naturally progress to using the rest of the Unihan tokens (Japanese, Korean, and traditional Chinese), and eventually all the left-to-right characters in the Unicode BMP (basic multilingual plane). They'll develop libraries of shorthands, typing π instead of Math.PI. There’s a deep urge within hackers to write programs with mathlike tersity, to marvel at the power portrayed by a few lines of code. One day programmers will choose from all the 110,000 Unicode tokens to use in their programs, not just the 65,000 in the BMP. As the Unicode Consortium adds more tokens, perhaps at an exponential rate, programmers may have a choice of one million tokens in a few short decades. All those tokens, from Sumerian cuneiform to Klingon letters, may eventually have unique meanings for use in programming languages, becoming integrated into the semantic system of the Unihan tokens.

Just as the Japanese people now read Kanji visually, directly accessing meanings from the symbols without first translating them into the sounds of their language, so also programmers will eventually think in this way. Presently, the writing system of Japan provides visual processing of content words and phonetic processing of grammar, and programming languages generally provide phonetic processing of content words and visual processing of grammar. Programmers using CJK tokens in their code could visually process both the content words and the grammar, giving programmers the same power of code comprehension that speed-readers of natural language writing have. Programmers will be able to visually understand a program by quickly flicking through the pages or screens of code, just as speed-readers can read books in a bookshop without buying them just by quickly flicking through the pages.

For programmers to read CJK characters in this way, it may even be essential they learn the characters visually only, assigning them to English meanings in their own minds. A Western programmer who learns to speak and read Chinese may be able to read Unihan-based programs, but be hindered in their ability to read the characters visually in programs. The IT industry in the West will import the CJK characters, mainly the simplified Chinese ones, but not the actual Chinese, Japanese, or Korean language. As a greater percentage of people continues to learn programming for their jobs, businesses, and daily life, the Unihan-based writing system of programming languages will increasing influence that of natural languages.

Within 50 years, this will have launched the greatest revolution in Writing since its invention. The Groovy Language aims to be a part of this revolution by enabling the upcoming Strach parser to assign CJK tokens to both Groovy statement keywords and entity names in the JDK standard libraries. And the upcoming Groovy Language IME will enable programmers to enter CJK tokens using a visual method, typing in the components. Just as Sumerian cuneiform revolutionized Writing for all natural languages in the world over 5000 years ago, the Groovy Language will revolutionize Writing for all programming languages.

Last edited Aug 24, 2011 at 2:21 PM by gavingrover, version 4


gavingrover Apr 10, 2012 at 9:08 AM 
The research in the article's 2nd paragraph suggests reading mixed kanji and phonetic kana script is NOT a "painfully slow exercise".

Western programmers using Unihan chars in their programs could do so incrementally, with IDE's handling conversion between different programmers requirements, so the benefits could be reaped immediately, as mentioned in the article. Also mentioned is the time scale I'm refering to, 50 yrs perhaps. Though some things could happen significantly quicker, especially in computing and N.E.Asia.

this_is_my_handle Mar 31, 2012 at 2:50 AM 
Listen, this is an interesting idea, but there are some big flaws in the thinking in this piece. This statement:

"Just as the Japanese people now read Kanji visually, directly accessing meanings from the symbols without first translating them into the sounds of their language, so also programmers will eventually think in this way."

...is incorrect. During the course of day-to-day life, Japanese for the most part ignore the meanings of the kanji and parse them as sounds first and foremost. If they don't know the reading, then they may use the kanji's individual meanings combined with context to figure out the larger sentence's meaning. However, if every time they read anything they had to process the meanings of each kanji, reading would be a painfully slow exercise.

I'm not saying there isn't some alternative method Western programmers could use to add Chinese characters to their vocabulary of programming syntax, but I'm skeptical that the benefits would outweigh the very significant time and effort it would take to add a large amount of these characters to one's lexicon.

I think it's more likely that native Chinese- or Japanese-reading programmers could integrate some of these strategies into their programming style, but presently the amount of programming resources available in English (both in terms of documentation and programming languages with English-based syntax) are simply overwhelming.

That said, I'm interested in seeing what the future brings, in terms of some of the changes described in the article...it will be interesting when editors and programmings languages widely support UTF character sets as a matter of course, with little or no configuration struggles (as is, sadly, still the case now).