Unicode

Article

May 25, 2022

Unicode is an encoding system that assigns a unique number to each character used for writing texts, independently of the language, the IT platform and the program used. It has been compiled and is updated and advertised by the Unicode Consortium, an international consortium of companies interested in interoperability in the computer processing of texts in different languages.

History

Origin and development Unicode was created to address the limitations of traditional character encoding schemes. For example, although the characters defined in ISO 8859-1 are widely used in different countries, incompatibilities often occur between different countries. Many traditional coding methods have a common problem, which is that they allow computers to manage a bilingual environment (usually using Latin letters and their native languages), but cannot support a multilingual environment at the same time (referring to a situation where multiple languages ​​can be mixed at the same time). Unicode encoding contains characters with different writing styles, such as "ɑ / a", "强 / 强", "home / family / 戸". However, there has been controversy over the identification of polymorphism in Chinese characters. For details, see the unified ideograms of China, Japan and Korea. In terms of word processing, Unicode defines a unique code (i.e. an integer) for each character rather than a glyph. In other words, Unicode processes characters abstractly (i.e. numbers) and leaves the work of visual deduction (such as font size, appearance shape, font shape, style, etc.) to other software, such as navigation web or word processor. At present, almost all computer systems support the basic Latin alphabet, and each supports several other coding methods. To be compatible with them, the first 256 characters of Unicode are reserved for characters defined by ISO 8859-1, so that converting existing Western European languages ​​does not require special consideration; and a large number of the same characters are repeated in different In character code, the old complicated encoding method can be directly converted between Unicode encoding without losing any information. For example, the full-form format section contains the full-form format of major Latin letters. In Chinese, Japanese, and Korean glyphs, these characters are presented in full-form instead of the common half-form. , Which has a major effect on vertical text and monospaced text. When representing a Unicode character, it is usually represented by "U +" followed by a set of hexadecimal numbers. In the basic multilingual plan: (basic multilingual plan in English all characters. BMP abbreviated also known as "plan zero", plan 0) inside, uses four digits (ie 2 bytes, for a total of 16 bits, As U + 4AE0, which supports a total of over 60,000 characters); characters outside the zero plane must use five or six numbers. The older version of the Unicode standard uses similar markup methods, but with some minor differences: In Unicode 3.0, "U-" is used followed by eight digits and "U +" must be followed by four digits.

Code structure

Unicode was originally thought of as a 16-bit (four hexadecimal digits) encoding that gave the ability to encode 65,535 (2 ^ 16 -1) characters. This was believed to be sufficient to represent the characters used in all the written languages ​​of the world. Now, however, the Unicode standard, which tends to be perfectly aligned with the ISO / IEC 10646 standard, provides for an encoding of up to 21 bits and supports a repertoire of numeric codes that can represent approximately one million characters. This appears to be sufficient to cover the requirements d