鳳

文章造天下,功業還蒼生。

(This is a Chinese translation of my previous blog post: Our paroqial fermament, one tide on another, written in mixed cn-tw vocabularies.)

(這是我前一篇部落格的中譯,標題原本引 Joyce,實在沒法譯,只好改周德偉的對聯,是當年辦藝立協時,每周在牆上看到的。)

Adina Levin 要我講講中文推特的資訊密度與英文有何不同,我答「且等咱們社算表(原圍紀算表)運行速度上的瑕疵都修好了再說。」

昨天我除完了蟲,這下只好勉力用英語編程了。

我首先想說的是,Ken 當年創發 UTF-8 編碼(見 Rob Pike 的現場實錄),實在已正確反映了英美字母和中日韓字元間的資訊差異。

想想,這通用字符集從 16 位元被迫擴充到了 21 位元,豈不正是「多如繁星,萬碼奔騰」的中文字害的麼?即便硬生生搞個漢同文,滅了無數重複字元,到頭來雙位元組仍是不敷使用。

假設推特的限制是 140 個 UTF-8「位元組」,那我們寫中文推特的感覺,和寫英文或許相去不遠。因為每個中文字占 3 個位元組——有時我用些生僻古字,超出了基本多語面,那就要占 4 個位元組了。

但歷史的意外,讓推特算的是 140 個「字符」。(不是「語素」,這我親身求證過了。)如此一則中文推特,就有 420 個位元組可用,相當於一篇博客短文了。

...且慢,還沒完呢!

要知道,中文有兩種寫法:Vernacular (白話) 和 Literary (文言)。

這白話呢,通常用兩個字符,來代表一個英文詞兒。像「網絡」,就是「Network」的意思。

在文言裡,每個字表示一組「概念」,像「網」,英文的「net」,可以是「網路」、「漁網」、「連網」、「網羅」的意思,全依上下文脈絡決定。

寫文言文時,字與字間既沒有空白,標點句讀也省略不少,這樣就有 140 組「概念」可用,換成英文得用 200 個詞來表示,也就是 1kb 的信息量。

理論如上,介紹完畢,接下來看看我最近的三則推特,小心求證一番。

第一則完全是大白話,第二則半文半白,第三則(最近的一則)則幾乎全是文言。

在白話到文言的過程裡,信息密度應該要遞增,這可以用英文譯本的長度來計算。

第一則是引陳映真(1936 年生,台灣當時影響力最大的作家及運動者之一)最近說過的話:

RT 陳映真:「文學退化,影像、聲音成為『當下世界』的符碼,」他說:「托爾斯泰生在今日,大部頭的作品也會喪失大量讀者。」現在的創作流行輕薄短小,又以自 我為中心,他因此形容這一代青年創作者「是脫光了衣服站在鏡子前面,凝視鏡中自己的身體與慾望… 他們讀的不多,不能成就崇高的文學」。(139 字)

谷歌將它譯成:

RT Chen Ying-chen: "Literature degradation, images, sounds become codes of present world," he said: "Tolstoy was born in today, voluminous works of the loss of a large number of readers will be."

Now the creation of popular thin and light short, Youyi self-centered, so he described this generation of young artists "is stripped of clothes and stand in front of the mirror, staring in the mirror his own body and desires ... they read much, does not develop high literature."

(475 字)

這譯本大致無誤,但小錯比比皆是(更別提對青年創作者身體的性別岐視了)。而最後一句更是整個譯錯:「他們讀得少」竟然譯成「他們讀得多」!

但至少這譯本還能達意,我們不妨除除看:中文與英文字元之比,是 3.4 倍。

接著第二則是我對陳的回應:

Re 陳映真:但背了整本維基,行遍無數國度,亦不成就崇高的文學。歐巴馬「以父之名」成不了作家,只好化藝術為行動、為現實。上一輩在高壓下,被迫濃粹經驗為符碼,而我們讀遍之後,融合歷史視域,還原此在而為 Hacktivism,竊自以為然。 (116 字)

谷歌的譯本這下亂了套,幾乎沒法看懂了。比例是 3.8 倍:

Chen Ying-chen Re: But the back of the whole wiki, line countless times a country, nor the noble achievements of literature. Obama, "Name of the Father" can not become a writer had to arts-based action, into reality. The last generation under high pressure, was forced to experience concentrated Intrade codes, and we read times, the integration of history, depending on the domain, restore this in the for Hacktivism, stolen from that it does.

(444 字)

我的翻譯如下,中英比率是 5.7:

Re Chen Ying-Zhen: But having an entire Wikipedia-backed memory, and having traveled to countless countries, still won't make high literature out of me.

Consider Obama who, having failed to launch a writer's career with "Dreams from my Father", is forced to project his art into activism and into reality-shaping.

My earlier generation, under tremendous Fascist pressure, is forced to distill their subjective experience to highly compressed literary code.

When my generation finished deciphering those codes to achieve a fusion of horizons, we decompressed them into the here-and-now as Hacktivism. This kind of adaption is IMHO natural and quite justified.

(662 字)

緊接著看第三則,近一步解釋前面的看法,大致用文言寫成:

舉實例言,《金盾工程》(即「功夫網」、「資訊長城」),無異吾儕之柏林牆。我十年前譯寫自由網,而至其後 Tor、無界等武裝,無非保持對話、以獨促統之意。與 Beijing.pm 嘗言:「功網未散,何以為族?」。凡此種種,生自文事,衍為武功,皆此代之共業。 (127 字)

谷歌譯出 509 字來,長則長矣,意義盡失(「促進意大利之統一」??),故不在此引述。茲譯如下:

As a concrete example, for our contemporary people, the "Golden Shield Project" (a.k.a., the "Gong-Fu Web", "Great Firewall of China") is no different from the Berlin Wall.

Ten years ago I translated and coded for the Freenet project; along with its follow-ups such as Tor and Wu-Jie, they're nothing but cyberspace armaments designed with the sole purpose of keeping the conversation flowing.

This way we maintain our independent identity, in the hope of accelerating a fusion of horizons with the Chinese government.

I've told Beijing.pm: "Without the dissipation of the Great Firewall, how can we make one people out of us?"

All these circumstances, born out of the literary world, has derived into the cyberspace hacktivism, and affected us in creating a software-shaped reality. This is a factor in the shared karmic setting of this generation.

(848 字)

正如所料,密度比例達到了 6.67 倍。切記,上述範例並非刻意作成:中文推特寫手,確實活在 2~8 倍於英文寫手的資訊密度當中。

October 12, 2009 at 04:22 AM in Craft, Lingua, Meta, People | Permalink | Comments (2) | TrackBack (0)

| Digg This | Save to del.icio.us |

Our paroqial fermament, one tide on another.

(For a Chinese translation of this blog entry, see 中譯本:「文章造天下,功業還蒼生。」)

(Alternate title: "Chinese Twitter users live in a density 2x to 8x their English counterparts; here's why.")

I promised Adina Levin a treatise on the information density of Chinese characters on Twitter "after all SocialCalc (nee wikiCalc) performance bugs are fixed".

As I've fixed them yesterday, let's try coding some English...

I'll begin by saying that Ken's in(ter)vention of UTF-8 (as narrated by Rob Pike) accurately reflects the relative information density between ASCII and CJK characters.

After all, UCS was extended from 16-bits to 21-bits precisely because so damn many Chinese characters need to be encoded, even after the controversial decimation from the Han unification effort.

Hypothetically, if Twitter had set its limit to 140 UTF-8 bytes, then our experience when tweeting Chinese would be on par with tweeting English, because each Chinese character would then take 3 bytes — and since I occasionally venture beyond the BMP, sometimes 4 bytes.

By a historical accident, though, Twitter counts in characters. (Not graphemes, as I've clinically proved.) So that gives a Chinese tweet 420 effective bytes to work with, which is already sufficient for a short blog post.

...but wait, there's more!

You see, there are two modes of Chinese: Vernacular (白話) as well as Literary (文言).

Typically, the vernacular mode takes two characters to encode a English word, e.g. "網路" for "network".

While in Literary mode, one character encodes a concept, e.g. "網" for "net" can stand for "network", "fishnet", "to connect", or "to capture", depending on surrounding context.

So by writing in Literary mode, which conveniently also elides spaces and only uses minimal punctuation, gives me 140 concepts to work with, which is about 200 English words, or nearly 1kb of English text.

With that brief intro, let's use my three most recent tweets for a concrete demonstration.

The first one is entirely in vernacular mode, the second one is in mixed mode, while the third (most recent) one is written almost completely in Literary mode.

We shall see how the information density increases with the shift of modes, as compared with their English translations.

The first one is a quote from Chen Ying-Zhen, b. 1936, one of the greatest Taiwanese author and activist of his generation, who recently uttered:

RT 陳映真:「文學退化,影像、聲音成為『當下世界』的符碼,」他說:「托爾斯泰生在今日,大部頭的作品也會喪失大量讀者。」現在的創作流行輕薄短小,又以自我為中心,他因此形容這一代青年創作者「是脫光了衣服站在鏡子前面,凝視鏡中自己的身體與慾望… 他們讀的不多,不能成就崇高的文學」。(139 characters)

Google Translate renders the above as:

RT Chen Ying-chen: "Literature degradation, images, sounds become codes of present world," he said: "Tolstoy was born in today, voluminous works of the loss of a large number of readers will be."

Now the creation of popular thin and light short, Youyi self-centered, so he described this generation of young artists "is stripped of clothes and stand in front of the mirror, staring in the mirror his own body and desires ... they read much, does not develop high literature."

(475 characters)

The translation is subtly wrong on multiple regards (not to mention having a sexist default). It's especially wrong on the last sentence, where Chen actually said "they don't read much"!

But it's at least comprehensible, and will serve as a good compression-ratio example: the Chinese/English information density ratio here is 3.4x.

Now the second one is from me, a reply to Chen:

Re 陳映真:但背了整本維基,行遍無數國度,亦不成就崇高的文學。歐巴馬「以父之名」成不了作家,只好化藝術為行動、為現實。上一輩在高壓下,被迫濃粹經驗為符碼,而我們讀遍之後,融合歷史視域,還原此在而為 Hacktivism,竊自以為然。 (116 characters)

The Google Translation is much weaker this time, bordering on incomprehensible, with a ratio of 3.8x:

Chen Ying-chen Re: But the back of the whole wiki, line countless times a country, nor the noble achievements of literature. Obama, "Name of the Father" can not become a writer had to arts-based action, into reality. The last generation under high pressure, was forced to experience concentrated Intrade codes, and we read times, the integration of history, depending on the domain, restore this in the for Hacktivism, stolen from that it does.

(444 characters)

My own translation would be something like this, with a 5.7x ratio:

Re Chen Ying-Zhen: But having an entire Wikipedia-backed memory, and having traveled to countless countries, still won't make high literature out of me.

Consider Obama who, having failed to launch a writer's career with "Dreams from my Father", is forced to project his art into activism and into reality-shaping.

My earlier generation, under tremendous Fascist pressure, is forced to distill their subjective experience to highly compressed literary code.

When my generation finished deciphering those codes to achieve a fusion of horizons, we decompressed them into the here-and-now as Hacktivism. This kind of adaption is IMHO natural and quite justified.

(662 characters)

This leads us to the third tweet, a follow-on elaboration composed entirely in Literary mode:

舉實例言,《金盾工程》(即「功夫網」、「資訊長城」),無異吾儕之柏林牆。我十年前譯寫自由網,而至其後 Tor、無界等武裝,無非保持對話、以獨促統之意。與 Beijing.pm 嘗言:「功網未散,何以為族?」。凡此種種,生自文事,衍為武功,皆此代之共業。 (127 characters)

The Google translation at 509 characters is entirely incomprehensible ("promoting unification of Italy"??), so I'll not even bother quoting here. My translation:

As a concrete example, for our contemporary people, the "Golden Shield Project" (a.k.a., the "Gong-Fu Web", "Great Firewall of China") is no different from the Berlin Wall.

Ten years ago I translated and coded for the Freenet project; along with its follow-ups such as Tor and Wu-Jie, they're nothing but cyberspace armaments designed with the sole purpose of keeping the conversation flowing.

This way we maintain our independent identity, in the hope of accelerating a fusion of horizons with the Chinese government.

I've told Beijing.pm: "Without the dissipation of the Great Firewall, how can we make one people out of us?"

All these circumstances, born out of the literary world, has spawned into cyberspace as hacktivism, and affected us in creating a software-shaped reality. This is a factor in the shared karmic setting of this generation.

(848 characters)

Here the ratio is 6.67x. Bear in mind that this is not an edge-case of maximum compression rate; Chinese Twitter users constantly live in a density anywhere from 2x to 8x of their English counterparts.

October 11, 2009 at 07:30 PM in Craft, Lingua, Meta, People | Permalink | Comments (4) | TrackBack (0)

| Digg This | Save to del.icio.us |

鳳たんです!

Last month when I visited Tokyo, I was very flattered (and pleasantly surprised) when fellow Japanese hackers -- Miyagawa and Takesako in particular -- referred to me as "otori-tan", literally the "phoenix girl", where phoenix is my Chinese chosen name.

The fascinating thing is that this sounds almost exactly the same as the Japanese romanization of "Audrey Tang", my English chosen name. Miyagawa wondered if I knew about this strange coincidence -- I really did not. :-)

Wikipedia has some more information about the -tan suffix, as seen in e.g. the OS-tan phenomenon. After some ego-googling, it seems that this usage started from lolipop's blog, then subsequently propagated to Rocco's and other places.  Lovely!

April 20, 2009 at 10:30 AM in Lingua | Permalink | Comments (18) | TrackBack (2)

| Digg This | Save to del.icio.us |

About

Map

  • Locations of visitors to this page

Recent Posts

  • [活動] 7/22 下午【攻殼機動隊】
  • Socialtext 應用 Scrum 方法的三年(下)
  • Socialtext 應用 Scrum 方法的三年(上)
  • 開發經驗談 (六之六)
  • 即時多人協作 (六之五)
  • 豐富文本編輯 (六之四)
  • SocialCalc (六之三)
  • WikiCalc (六之二)
  • SocialCalc: 緣起 (六之一)
  • 企業「人際層」: 化願景為現實(下)

Recent Comments

  • audreyt on Socialtext 應用 Scrum 方法的三年(上)
  • OOBE on Socialtext 應用 Scrum 方法的三年(上)
  • Ali on Socialtext 應用 Scrum 方法的三年(上)
  • randomly on 鳳たんです!
  • audreyt on 心慟凡例 / An Instance of Sinthome
  • audreyt on 心慟凡例 / An Instance of Sinthome
  • Johann Chiang on Our paroqial fermament, one tide on another.
  • Kim Feraday on Our paroqial fermament, one tide on another.
  • twitter.com/lazinet on 文章造天下,功業還蒼生。
  • Eric Hellman on Our paroqial fermament, one tide on another.

Categories

  • Books
  • Craft
  • Current Affairs
  • Film
  • Lingua
  • Meta
  • People
  • Trans
  • Web/Tech

July 2012

Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31        

Archives

  • July 2012
  • April 2011
  • March 2011
  • February 2011
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • November 2009
Subscribe to this blog's feed

License


  • Public Domain Dedication
    This work is dedicated to the Public Domain.

Audrey

  • Audrey060602
My Photo

Twitter

    follow me on Twitter