(For a Chinese translation of this blog entry, see 中譯本：「文章造天下，功業還蒼生。」)
(Alternate title: "Chinese Twitter users live in a density 2x to 8x their English counterparts; here's why.")
I promised Adina Levin a treatise on the information density of Chinese characters on Twitter "after all SocialCalc (nee wikiCalc) performance bugs are fixed".
As I've fixed them yesterday, let's try coding some English...
I'll begin by saying that Ken's in(ter)vention of UTF-8 (as narrated by Rob Pike) accurately reflects the relative information density between ASCII and CJK characters.
After all, UCS was extended from 16-bits to 21-bits precisely because so damn many Chinese characters need to be encoded, even after the controversial decimation from the Han unification effort.
Hypothetically, if Twitter had set its limit to 140 UTF-8 bytes, then our experience when tweeting Chinese would be on par with tweeting English, because each Chinese character would then take 3 bytes — and since I occasionally venture beyond the BMP, sometimes 4 bytes.
By a historical accident, though, Twitter counts in characters. (Not graphemes, as I've clinically proved.) So that gives a Chinese tweet 420 effective bytes to work with, which is already sufficient for a short blog post.
...but wait, there's more!
You see, there are two modes of Chinese: Vernacular (白話) as well as Literary (文言).
Typically, the vernacular mode takes two characters to encode a English word, e.g. "網路" for "network".
While in Literary mode, one character encodes a concept, e.g. "網" for "net" can stand for "network", "fishnet", "to connect", or "to capture", depending on surrounding context.
So by writing in Literary mode, which conveniently also elides spaces and only uses minimal punctuation, gives me 140 concepts to work with, which is about 200 English words, or nearly 1kb of English text.
With that brief intro, let's use my three most recent tweets for a concrete demonstration.
The first one is entirely in vernacular mode, the second one is in mixed mode, while the third (most recent) one is written almost completely in Literary mode.
We shall see how the information density increases with the shift of modes, as compared with their English translations.
The first one is a quote from Chen Ying-Zhen, b. 1936, one of the greatest Taiwanese author and activist of his generation, who recently uttered:
Google Translate renders the above as:
RT 陳映真：「文學退化，影像、聲音成為『當下世界』的符碼，」他說：「托爾斯泰生在今日，大部頭的作品也會喪失大量讀者。」現在的創作流行輕薄短小，又以自我為中心，他因此形容這一代青年創作者「是脫光了衣服站在鏡子前面，凝視鏡中自己的身體與慾望… 他們讀的不多，不能成就崇高的文學」。(139 characters)
RT Chen Ying-chen: "Literature degradation, images, sounds become codes of present world," he said: "Tolstoy was born in today, voluminous works of the loss of a large number of readers will be."
Now the creation of popular thin and light short, Youyi self-centered, so he described this generation of young artists "is stripped of clothes and stand in front of the mirror, staring in the mirror his own body and desires ... they read much, does not develop high literature."
The translation is subtly wrong on multiple regards (not to mention having a sexist default). It's especially wrong on the last sentence, where Chen actually said "they don't read much"!
But it's at least comprehensible, and will serve as a good compression-ratio example: the Chinese/English information density ratio here is 3.4x.
Now the second one is from me, a reply to Chen:
Re 陳映真：但背了整本維基，行遍無數國度，亦不成就崇高的文學。歐巴馬「以父之名」成不了作家，只好化藝術為行動、為現實。上一輩在高壓下，被迫濃粹經驗為符碼，而我們讀遍之後，融合歷史視域，還原此在而為 Hacktivism，竊自以為然。 (116 characters)
The Google Translation is much weaker this time, bordering on incomprehensible, with a ratio of 3.8x:
Chen Ying-chen Re: But the back of the whole wiki, line countless times a country, nor the noble achievements of literature. Obama, "Name of the Father" can not become a writer had to arts-based action, into reality. The last generation under high pressure, was forced to experience concentrated Intrade codes, and we read times, the integration of history, depending on the domain, restore this in the for Hacktivism, stolen from that it does.
My own translation would be something like this, with a 5.7x ratio:
Re Chen Ying-Zhen: But having an entire Wikipedia-backed memory, and having traveled to countless countries, still won't make high literature out of me.
Consider Obama who, having failed to launch a writer's career with "Dreams from my Father", is forced to project his art into activism and into reality-shaping.
My earlier generation, under tremendous Fascist pressure, is forced to distill their subjective experience to highly compressed literary code.
When my generation finished deciphering those codes to achieve a fusion of horizons, we decompressed them into the here-and-now as Hacktivism. This kind of adaption is IMHO natural and quite justified.
This leads us to the third tweet, a follow-on elaboration composed entirely in Literary mode:
舉實例言，《金盾工程》（即「功夫網」、「資訊長城」），無異吾儕之柏林牆。我十年前譯寫自由網，而至其後 Tor、無界等武裝，無非保持對話、以獨促統之意。與 Beijing.pm 嘗言：「功網未散，何以為族？」。凡此種種，生自文事，衍為武功，皆此代之共業。 (127 characters)
The Google translation at 509 characters is entirely incomprehensible ("promoting unification of Italy"??), so I'll not even bother quoting here. My translation:
As a concrete example, for our contemporary people, the "Golden Shield Project" (a.k.a., the "Gong-Fu Web", "Great Firewall of China") is no different from the Berlin Wall.
Ten years ago I translated and coded for the Freenet project; along with its follow-ups such as Tor and Wu-Jie, they're nothing but cyberspace armaments designed with the sole purpose of keeping the conversation flowing.
This way we maintain our independent identity, in the hope of accelerating a fusion of horizons with the Chinese government.
I've told Beijing.pm: "Without the dissipation of the Great Firewall, how can we make one people out of us?"
All these circumstances, born out of the literary world, has spawned into cyberspace as hacktivism, and affected us in creating a software-shaped reality. This is a factor in the shared karmic setting of this generation.
Here the ratio is 6.67x. Bear in mind that this is not an edge-case of maximum compression rate; Chinese Twitter users constantly live in a density anywhere from 2x to 8x of their English counterparts.