A Bit of History
In the beginning there was just Extended Binary Coded Decimal Interchange Code (EBCDIC) and American Standard Code for Information Interchange (ASCII). Used during the late 1950’s, EBCDIC is a descendant of punched cards and based on 6-bit code. Today, only old-school IBM engineers really know what EBCDIC was.
First published in 1963, ASCII was federally mandated by the U.S. government in 1968. Because it was an American standard that used just 7 bits, ASCII only included the Basic English alphabet. Initially there was no support for characters such as æ, ø, å, č, €, or anything else that was foreign to America. As computers spread to non-English countries, other nations implemented their own variations of ASCII called ISO/IEC 646. Norway got æ and å, but included the ¥ symbol instead of Ø.
Eventually 8 bits gained acceptance, and that meant another 128 characters became available! Different nations could then decide on what to use the “upper” 128 characters for. Norway chose Ø while the Czechs chose Č and Ř. However, text could now only be correctly interpreted if you knew something that was not part of the text, the Code Page (CP) number. The technical term for this is “out-of-band data”, otherwise meaning things your just have to know. This is still with us today.
Code Page Grief
The standard IBM PC code page, called CP 437, was a seemingly random collection of box drawings, currency symbols, and assorted accented characters that did not completely cover anything. For example, in CP 437, character B516 is ╡, where as in CP 852 it is Á, and in CP 869 it is Κ (greek kappa).
There is no way of putting the code page number into the text, so given character B516 you have no idea what to display. Essentially, what gets displayed is what you think is correct – if on a Czech PC you will get the Á, but in Greece you will get the K. Some languages cannot be expressed with just ≈200 letters; therefore double-byte characters sets (DBCS) were invented.
DBCS to the Rescue
ISO 2022 defines how to use 2 x 7 bits to encode a wider range of characters. An “escape sequence” of one or more bytes effectively says “what follows should be interpreted in a particular way”. This is really a way of taking out-of-band-data and sneaking it into the text
Many variants of escape sequences exist that signify different interpretations. Japanese, Chinese, and Korean each have their own variants. There are even different versions of Japanese, for example the mobile phone operator KDDI defines hundreds of “smileys” sequences.
Shift JIS is fully backwards compatible with the legacy JIS X 0201 single-byte encoding, meaning it supports half-width katakana and that any valid JIS X 0201 string is also a valid Shift JIS string.
However Shift JIS only guarantees that the first byte will be in the upper ASCII range; the value of the second byte can be either high or low. This makes reliable Shift JIS detection difficult.
On the other hand, the competing 8-bit format EUC-JP, which does not support single-byte halfwidth katakana, allows for a much cleaner and direct conversion to and from JIS X 0208 codepoints, as all upper-ASCII bytes are part of a double-byte character and all lower-ASCII bytes are part of a single-byte character.
Is this all relevant
Unfortunately, yes. This is all relevant because you will find all of this in a customer database.
The various mixes and multiple interpretations become interesting in two cases:
1. When a single database is used by clients that have different code page settings, and therefore different interpretations of the data.
2. When you want to convert a system to Unicode.
Unicode is a standard way of identifying printable and non-printable characters. Each one is known as a Code Point and is written U+nnnn, where nnnn is a hexadecimal number with 4 digits, and can potentially contain 1 114 112 supported characters. There are 2684 characters reserved for designation within a particular block, 98893 graphical characters, and 435 special purpose characters for control, formatting, and glyph/character variation selection. Additionally, there are 872582 unassigned characters waiting to become something more.
The whole Unicode universe is divided into 17 character planes. The Basic Multilingual Plane (BMP), plane 0, contains all characters for writing system in current use. The list of scripts included in BMP is too long to show here, but the point is that there should be enough. Why this is important becomes clearer in the Encodings section. The Supplementary Multilingual Plane (SMP), BMP plane 1, contains mostly historical scripts such as Gothic, but also includes musical and mathematical symbols as well. SuperOffice only supports BMP Plane 0.
Although Unicode identifies each character, each character needs a way to be represented for storage and transmission. Identification and representation are not the same concept. Encodings represent the way Unicode characters are actually represented in the computer. As you may know, there are many encodings. The important ones are currently UTF-8 and UTF-16. The interpretation of a Unicode character, such as what glyph should be displayed, is depend on the context. The context is present in the text, not as out-of-band-data.
UTF-16 is the encoding used internally by .NET and by Unicode C++ programs on Windows. It uses 2 bytes, a total of 16 bits, per character in the Basic Multilingual Plane. Non-BMP characters are represented by a surrogate pair that consists of two 16-bit words.
SuperOffice does not support surrogate pairs. SuperOffice experience shows that database support for surrogate pairs is sketchy and unreliable at best. One example is with Latin-Character text UTF-16; Latin-Character text UTF-16 is wasteful because almost 50% of the bytes are 0.
UTF-8 represents BMP characters with 1 to 3 bytes and non-BMP characters by 4 bytes. Simple Latin text uses 1 byte per character, and 2 bytes for accented characters. UTF-8 is the same as ASCII for non-accented Latin characters, and is the most preferred format for multilingual text on the Internet today (Web, Mail, etc.).
SuperOffice only uses UTF- for files and mail input/output. It however can be used as a storage format by some of the databases we support.
What about SuperOffice
The Windows World, to SuperOffice 6.1
SuperOffice C++ code, including the ODBC driver, is written using ANSI encoding. Most SuperOffice 6.1 Modules and Applications are written in VB6 code, which is ANSI too.
Data stored in the database is ANSI encoded. Any client will interpret it according to the current code page settings. What you actually see on-screen depends on this. Eastern European characters where displayed correctly for the following client languages: Czech (CZ), Polish (PL) and a special Czech and Polish English language dll.
The COM Application Programming Interface (API), however, is Unicode. Conversions between ANSI and Unicode are done automatically according to the current code page of the client computer.
The Windows World, from SuperOffice 6.2
All SuperOffice C++ code is Unicode – UTF-16 internally. The ODBC driver is also Unicode, but the data will get converted according to the current code page settings on the computer. SuperOffice is not responsible for the conversion.
All SuperOffice modules and add-on applications are now VB.NET, which is Unicode. The COM API is unchanged and still Unicode.
When opting for the Unicode database, no conversions will ever occur. Any character you enter will be shown exactly the same on all clients, anywhere in the world.
Czech and Polish databases must convert to Unicode.
The Web World
SuperOffice web and NetServer, both written using .NET technologies, have always supported Unicode. With an ANSI database, the database driver is responsible for doing the conversion.
When the database is Unicode, just as in the SuperOffice Windows world, what you enter anywhere is what you see everywhere.
To see Unicode characters as they were meant to be displayed, the client machines will of course have to have the proper fonts to display Unicode text –like East Asian Support for Chinese text. If not, they will have to install the proper font on each machine.
Unicode support for user names and passwords is very patchy in different databases. SuperOffice has chosen to restrict names and passwords that can survive conversion from Unicode to ANSI and back again. This is reasonably a safe rule, but some customers might not agree.
Encrypted license information is stored differently in ANSI than in Unicode. Because each Unicode character is twice as many bytes as an ANSI character, this is true even if the characters used are only between A and Z. As a result the encryption methods will give different answers. SuperOffice now generates two keycodes per license, one for an ANSI database and another for Unicode. SoAdmin is helpful in this respect as it will tell you which license encoding it expects.