Categories: Unicode | Chinese language | Japanese language | Korean language

Han unification

Han unification is the process used by the authors of Unicode and the Universal Character Set to map multiple character sets of the CJK languages into a single set of unified characters. The Chinese characters are common to Chinese (where they are called "hanzi"), Japanese (where they are called kanji), and Korean (where they are called hanja). Modern Korean, Chinese and Japanese typefaces may represent a given Han character as somewhat different glyphs. However, in the formulation of Unicode, these differences were folded. This unification is referred to as "Han unification", with the resulting character repertoire sometimes referred to as Unihan.

Contents

1 Standard

2 Details

3 Controversy

4 Check your browser

5 See also

6 External links

Standard

Rules for Han Unification are given in the East Asian Scripts chapter of the various versions of the Unicode Standard (Chapter 11 in Unicode 4.0). The Ideographic Rapporteur Group (IRG), made up of experts from the Chinese-speaking countries, North and South Korea, Japan, Vietnam, and other countries, is responsible for the process.

Details

The secret life of Unicode article located on IBM DeveloperWorks has an explanation of this issue that illustrates some of the confusion:

The problem stems from the fact that Unicode encodes characters rather than "glyphs," which are the visual representations of the characters. There are four basic traditions for East Asian character shapes: traditional Chinese, simplified Chinese, Japanese, and Korean. While the Han root character may be the same for CJK languages, the glyphs in common use for the same characters may not be, and new characters were invented in each country.

For example, the traditional Chinese glyph for "grass" uses four strokes for the "grass" radical, whereas the simplified Chinese, Japanese, and Korean glyphs use three. But there is only one Unicode point for the grass character (草, U+8349) regardless of writing system. Another example is the ideograph for "one" (壹, 壱, or 一), which is different in Chinese, Japanese, and Korean. Many people think that the three versions should be encoded differently.

In fact, the three ideographs for "one" are encoded separately in Unicode. They are not national variants. The first and second are used on financial instruments to prevent forgery, while the third is the common form in all three countries.

A slight difference in rendering characters might be considered a serious problem if it changes the meaning or reflects the wrong cultural tradition. Besides a simple nuisance like Japanese text looking like Chinese, names might be displayed with a different glyph — the same character in the sense of encoding but a different character in the view of the users. This rendering problem is often employed to criticize Westerners for not being aware of subtle distinctions, even though Unification is being carried out by Easterners. The display error occurs only when rendering plain text in a single font, and not when rendering language-specific text and names in language-appropriate fonts.

The process of Han Unification was controversial, with most of the opposition coming from Japan. Opponents of Han unification state that it steamrolls over thousands of years of cultural tradition, misses many of the subtleties that are one of the most important features of these languages, and renders serious literature and academic research in these languages impossible. Proponents of Han unification point out that the unification process is in the hands of specialists from China, Korea, and Japan, and that the objections to unification of specific characters are made without regard to their histories. Characters which some Japanese today consider completely distinct were historically the same, and were taught as the same in Japanese schools until the 1950s. As for historical research, Unicode now encodes far more characters than any other standard, and far more than were listed in any dictionary, with many more being processed for inclusion as fast as the scholars can agree on their identities.

Some characters used only in names are not included in Unicode. This is not a form of cultural imperialism, as is sometimes feared. These characters are generally not included in their national character sets either.

Controversy

Much of the controversy surrounding Han unification is based on confusion between the ideas of characters and glyphs, as defined in Unicode, and the related but distinct idea of graphemes. Unicode defines abstract characters, as opposed to glyphs, which are particular visual representations of a character in a font, or graphemes, basic units of writing in a particular language. One character may be represented by many distinct glyphs, for example a "g" or an "a", both of which may have one loop or two. In Dutch, "ij" is a single letter (ĳ), and thus a grapheme. For example, the first letter in "IJsselmeer" is capitalized. Similarly for "ch" in some Spanish-speaking countries, and "lj" in Croatian. Graphemes present in national character code standards have been added to Unicode, as required by Unicode's Source Separation rule, even where they can be composed of characters already available.

Unicode publishes charts with pictures for each character, but these are illustrations only and do not mandate the character's shape. References like [1] below seem to assume that what the Unicode standard pictures is how each character must be displayed, and protest when it doesn't match the local appearance of the character. The way things are supposed to work is that a Japanese user will have a font with Japanese-style characters, a Chinese user will have a font with Chinese-style characters, etc., and everyone will see the "right" characters for them. Problems are introduced when several languages must be represented in the same text document, and users expect different fonts for the different languages. This can be worked around outside the Unicode standard with higher-level markup defining the language used for each string of characters, although this is cumbersome and may not always work correctly; see the demonstration below.

Note that most of the opposition to Han unification appears to be Japanese, because of increased sensitivity to the distinctions between Chinese and Japanese styles of letters. There has been very little opposition from Chinese speakers. Although the Taiwan Big5 character set does not include Simplified characters, the PRC has character set standards with and without them. Unicode is seen as neutral with regards to the politically charged issue of Simplified versus Traditional characters, encoding Simplified and Traditional Chinese glyphs separately (e.g. the ideograph for "discard" is 丟 U+4E1F for Traditional Chinese big5 #A5E1 and 丢 U+4E22 for Simplified Chinese gb #2210). Traditional and Simplified characters must be encoded separately according to Unicode Unification rules, because they are distinguished in pre-existing PRC character sets, not just because they have different shapes. Mapping between Traditional and Simplified characters is not one-to-one, which also prevents unification.

Specialist character sets developed to address, or regarded by some as not suffering from, these perceived deficiencies include:

However, none of these alternative standards has been as widely adopted as Unicode, which is now the base character set for many new standards and protocols, and is built into the architecture of operating systems (Windows, Macintosh OS X, and many versions of Unix), programming languages (Perl, Python, Java, Common LISP, APL), and libraries (IBM International Components for Unicode (ICU) along with the Pango, Graphite and Scribe rendering engines), font formats (TrueType and OpenType) and so on.

Check your browser

The following table contains identical grapheme in all five rows, but each row is marked (via an HTML attribute) as being in a different language: Chinese (3 varieties: unmarked "Chinese", simplified characters, and traditional characters), Japanese, or Korean. So, ideally, your browser should select fonts and glyphs that suit each language better. See how well it works for you.

Chinese (generic)

与

今

令

免

入

全

具

刃

化

區

外

天

才

次

海

漢

町

画

直

真

空

紀

草

角

道

餓

骨

Chinese (Simplified)

与

今

令

免

入

全

具

刃

化

區

外

天

才

次

海

漢

町

画

直

真

空

紀

草

角

道

餓

骨

Chinese (Traditional)

与

今

令

免

入

全

具

刃

化

區

外

天

才

次

海

漢

町

画

直

真

空

紀

草

角

道

餓

骨

Japanese

与

今

令

免

入

全

具

刃

化

區

外

天

才

次

海

漢

町

画

直

真

空

紀

草

角

道

餓

骨

Korean

与

今

令

免

入

全

具

刃

化

區

外

天

才

次

海

漢

町

画

直

真

空

紀

草

角

道

餓

骨

The following table contains identical grapheme with multiple glyphs encoded in unicode:

Chinese (generic)

高

髙

紅

红

丟

丢

乗

乘

侣

侶

兌

兑

內

内

產

産

稅

税

⿔

亀

龜

龟

龜

龜

別

别

両

两

兩

兩

Chinese (Simplified)

高

髙

紅

红

丟

丢

乗

乘

侣

侶

兌

兑

內

内

產

産

稅

税

⿔

亀

龜

龟

龜

龜

別

别

両

两

兩

兩

Chinese (Traditional)

高

髙

紅

红

丟

丢

乗

乘

侣

侶

兌

兑

內

内

產

産

稅

税

⿔

亀

龜

龟

龜

龜

別

别

両

两

兩

兩

Japanese

高

髙

紅

红

丟

丢

乗

乘

侣

侶

兌

兑

內

内

產

産

稅

税

⿔

亀

龜

龟

龜

龜

別

别

両

两

兩

兩

Korean

高

髙

紅

红

丟

丢

乗

乘

侣

侶

兌

兑

內

内

產

産

稅

税

⿔

亀

龜

龟

龜

龜

別

别

両

两

兩

兩

code

U+9ad8

U+9ad9

U+7d05

U+7ea2

U+4e1f

U+4e22

&nbsp

U+4e57

U+4e58

U+4fa3

U+4fb6

U+514c

U+5151

U+5167

U+5185

U+7522

U+7523

U+7a05

U+7a0e

U+2fd4

U+4e80

U+9f9c

U+9f9f

U+f907

U+f908

U+5225

U+522b

U+4e21

U+4e24

U+5169

U+f978

External links

Unicode standard
Han Unification in Unicode by Otfried Cheong
Why Unicode Won't Work on the Internet: Linguistic, Political, and Technical Limitations
Why Unicode Will Work On The Internet
Unihan Database
Per-character summary of differences in characters
The secret life of Unicode
GB18030 Support Package for Windows 2000/XP, including Chinese, Tibetan, Yi, Mongolian and Thai font by Microsoft
Proposal to encode additional grass radicals in the UCS - A humorous proposal to encode all possible variants of the grass radical, made as an April Fool's Day joke

Categories: Unicode | Chinese language | Japanese language | Korean language

Last updated: 05-07-2005 16:40:36

Last updated: 05-13-2005 07:56:04

Encyclopedia

Dictionary

Quotes

Han unification

Standard

Details

Controversy

Check your browser

See also

External links

The Online Encyclopedia and Dictionary

Encyclopedia

Dictionary

Quotes

Han unification

Standard

Details

Controversy

Check your browser

See also

External links