Search

The Online Encyclopedia and Dictionary

 
     
 

Encyclopedia

Dictionary

Quotes

   
 

Mapping of Unicode characters

Unicode reserves 1,114,112 (= 220 + 216) code points, and currently assigns characters to more than 96,000 of those code points. The first 256 codes precisely match those of ISO 8859-1, the most popular 8-bit character encoding in the "Western world"; as a result, the first 128 characters are also identical to ASCII.

The Unicode code space for characters is divided into 17 "planes" and each plane has 65,536 (= 216) code points. There is much controversy among CJK specialists, particularly Japanese ones, about the desirability and technical merit of the "Han unification" process used to map multiple Chinese and Japanese character sets into a single set of unified characters. (See Chinese character encoding)

The cap of ~220 code points exists in order to maintain compatibility with the UTF-16 encoding, which can only address that range (see below). There is only ten percent current utilization of the Unicode code space. Furthermore, ranges of characters have been tentatively blocked out for every known unencoded script (see [1]), and while Unicode may need another plane for ideographic characters, there are ten planes that could only be needed if previously unknown scripts with tens of thousands of characters are discovered. This ~20 bit limit is unlikely to be reached in the near future.

Contents

Basic Multilingual Plane

The first plane (plane 0), the Basic Multilingual Plane (BMP), is where most characters have been assigned so far. The BMP contains characters for almost all modern languages, and a large number of special characters. Most of the allocated code points in the BMP are used to encode Chinese, Japanese, and Korean (CJK) characters.


As of Unicode 4.1, The BMP includes the following scripts:

  • Basic Latin (0000–007F)
  • Latin-1 Supplement (0080–00FF)
  • Latin Extended-A (0100–017F)
  • Latin Extended-B (0180–024F)
  • IPA Extensions (0250–02AF)
  • Spacing Modifier Letters (02B0–02FF)
  • Combining Diacritical Marks (0300–036F)
  • Greek and Coptic (0370–03FF)
  • Cyrillic (0400–04FF)
  • Cyrillic Supplement (0500–052F)
  • Armenian (0530–058F)
  • Hebrew (0590–05FF)
  • Arabic (0600–06FF)
  • Syriac (0700–074F)
  • Arabic Supplement (0750–077F)
  • Thaana (0780–07BF)
  • Devanagari (0900–097F)
  • Bengali (0980–09FF)
  • Gurmukhi (0A00–0A7F)
  • Gujarati (0A80–0AFF)
  • Oriya (0B00–0B7F)
  • Tamil (0B80–0BFF)
  • Telugu (0C00–0C7F)
  • Kannada (0C80–0CFF)
  • Malayalam (0D00–0D7F)
  • Sinhala (0D80–0DFF)
  • Thai (0E00–0E7F)
  • Lao (0E80–0EFF)
  • Tibetan (0F00–0FFF)
  • Burmese (1000–109F)
  • Georgian (10A0–10FF)
  • Hangul Jamo (1100–11FF)
  • Ethiopic (1200–137F)
  • Ethiopic Supplement (1380–139F)
  • Cherokee (13A0–13FF)
  • Unified Canadian Aboriginal Syllabics (1400–167F)
  • Ogham (1680–169F)
  • Runic (16A0–16FF)
  • Tagalog (1700–171F)
  • Hanunoo (1720–173F)
  • Buhid (1740–175F)
  • Tagbanwa (1760–177F)
  • Khmer (1780–17FF)
  • Mongolian (1800–18AF)
  • Limbu (1900–194F)
  • Tai Le (1950–197F)
  • New Tai Lue (1980–19DF)
  • Khmer Symbols (19E0–19FF)
  • Buginese (1A00–1A1F)
  • Phonetic Extensions (1D00–1D7F)
  • Phonetic Extensions Supplement (1D80–1DBF)
  • Combining Diacritical Marks Supplement (1DC0–1DFF)
  • Latin Extended Additional (1E00–1EFF)
  • Greek Extended (1F00–1FFF)
  • General Punctuation (2000–206F)
  • Superscripts and Subscripts (2070–209F)
  • Currency Symbols (20A0–20CF)
  • Combining Diacritical Marks for Symbols (20D0–20FF)
  • Letterlike Symbols (2100–214F)
  • Number Forms (2150–218F)
  • Arrows (2190–21FF)
  • Mathematical Operators (2200–22FF)
  • Miscellaneous Technical (2300–23FF)
  • Control Pictures (2400–243F)
  • Optical Character Recognition (2440–245F)
  • Enclosed Alphanumerics (2460–24FF)
  • Box Drawing (2500–257F)
  • Block Elements (2580–259F)
  • Geometric Shapes (25A0–25FF)
  • Miscellaneous Symbols (2600–26FF)
  • Dingbats (2700–27BF)
  • Miscellaneous Mathematical Symbols-A (27C0–27EF)
  • Supplemental Arrows-A (27F0–27FF)
  • Braille Patterns (2800–28FF)
  • Supplemental Arrows-B (2900–297F)
  • Miscellaneous Mathematical Symbols-B (2980–29FF)
  • Supplemental Mathematical Operators (2A00–2AFF)
  • Miscellaneous Symbols and Arrows (2B00–2BFF)
  • Glagolitic (2C00–2C5F)
  • Coptic (2C80–2CFF)
  • Georgian Supplement (2D00–2D2F)
  • Tifinagh (2D30–2D7F)
  • Ethiopic Extended (2D80–2DDF)
  • Supplemental Punctuation (2E00–2E7F)
  • CJK Radicals Supplement (2E80–2EFF)
  • Kangxi Radicals (2F00–2FDF)
  • Ideographic Description Characters (2FF0–2FFF)
  • CJK Symbols and Punctuation (3000–303F)
  • Hiragana (3040–309F)
  • Katakana (30A0–30FF)
  • Bopomofo (3100–312F)
  • Hangul Compatibility Jamo (3130–318F)
  • Kanbun (3190–319F)
  • Bopomofo Extended (31A0–31BF)
  • CJK Strokes (31C0–31EF)
  • Katakana Phonetic Extensions (31F0–31FF)
  • Enclosed CJK Letters and Months (3200–32FF)
  • CJK Compatibility (3300–33FF)
  • CJK Unified Ideographs Extension A (3400–4DBF)
  • Yijing Hexagram Symbols (4DC0–4DFF)
  • CJK Unified Ideographs (4E00–9FFF)
  • Yi Syllables (A000–A48F)
  • Yi Radicals (A490–A4CF)
  • Modifier Tone Letters (A700–A71F)
  • Syloti Nagri (A800–A82F)
  • Hangul Syllables (AC00–D7AF)
  • High Surrogates (D800–DB7F)
  • High Private Use Surrogates (DB80–DBFF)
  • Low Surrogates (DC00–DFFF)
  • Private Use Area (E000–F8FF)
  • CJK Compatibility Ideographs (F900–FAFF)
  • Alphabetic Presentation Forms (FB00–FB4F)
  • Arabic Presentation Forms-A (FB50–FDFF)
  • Variation Selectors (FE00–FE0F)
  • Vertical Forms (FE10–FE1F)
  • Combining Half Marks (FE20–FE2F)
  • CJK Compatibility Forms (FE30–FE4F)
  • Small Form Variants (FE50–FE6F)
  • Arabic Presentation Forms-B (FE70–FEFF)
  • Halfwidth and Fullwidth Forms (FF00–FFEF)
  • Specials (FFF0–FFFF)

Several scripts are expected to be included in the next revision of Unicode:

Several other scripts are proposed for inclusion in the BMP, including:

Supplementary Multilingual Plane

Plane 1, the Supplementary Multilingual Plane, (SMP) is mostly used for historic scripts such as Linear B, but is also used for musical and mathematical symbols.

As of Unicode 4.1, Plane One includes the following scripts:

  • Linear B Syllabary (10000–1007F)
  • Linear B Ideograms (10080–100FF)
  • Aegean Numbers (10100–1013F)
  • Ancient Greek Numbers (10140–1018F)
  • Old Italic (10300–1032F)
  • Gothic (10330–1034F)
  • Ugaritic (10380–1039F)
  • Old Persian (103A0–103DF)
  • Deseret (10400–1044F)
  • Shavian (10450–1047F)
  • Osmanya (10480–104AF)
  • Cypriot Syllabary (10800–1083F)
  • Kharoshthi (10A00–10A5F)
  • Byzantine Musical Symbols (1D000–1D0FF)
  • Musical Symbols (1D100–1D1FF)
  • Ancient Greek Musical Notation (1D200–1D24F)
  • Tai Xuan Jing Symbols (1D300–1D35F)
  • Mathematical Alphanumeric Symbols (1D400–1D7FF)

Several scripts are expected to be included in the next revision of Unicode:

Many other scripts are proposed for inclusion in Plane One, including:

Private Use Area

A Private Use Area is one of several ranges which are reserved for private use. For this range, the Unicode standard does not specify any characters.

The Basic Multilingual Plane includes a Private Use Area in the range U+E000–U+F8FF (57344–63743), and Plane Fifteen (U+F0000–U+FFFFF) and Plane Sixteen (U+100000–10FFFF) are completely reserved for private use as well.

The use of the Private Use Area was a concept inherited from certain Asian encoding systems. These systems used private use areas to encode Japanese Gaiji (rare personal name characters) in application specific ways. Similarily the ConScript Unicode Registry aims to coordinate the mapping of scripts not yet encoded in or rejected by Unicode in the PUAs.

Other planes

Plane 2, the Supplementary Ideographic Plane (SIP), is used for about 40,000 rare Chinese characters that are mostly historic, although there are some modern ones. Plane 14, the Supplementary Special-purpose Plane (SSP), currently contains some non-recommended language tag characters and some variation selection characters.

Last updated: 05-13-2005 07:56:04