Character set encoding and rendering — ASCII/Unicode and code page
Character encoding — Why ?
If you use anything other than the most basic English text, people may not be able to read the content you create unless you say what character encoding you used.
For example, you may intend the text to look like this:
but it may actually display like this:
What is character encoding?
Words and sentences in text are created from characters. Examples of characters include the Latin letter á or the Chinese ideograph 請 or the Devanagari character ह. For computer to refer to characters in an unambiguous way, each character is associated with a number, called a code point. A set of characters that are needed for a specific purpose (typically to represent a language) are grouped into a character set.
Character encoding (or Character set encoding) is the way to represent a collection of characters by some kind of encoding system that assigns a number to each character for digital representation.
Fonts files and Glyphs mapping with code points:
A font is a collection of glyph definitions, ie. definitions of the shapes used to display characters. Once your browser or app has worked out what characters it is dealing with, it will then look in the font for glyphs it can use to display or print those characters. (Of course, if the encoding information was wrong, it will be looking up glyphs for the wrong characters.)
ASCII
Most of us use ASCII by default and are unaware of what it exactly it means and how it is related to displaying encoded characters. American Standard Code for Information Interchange (ASCII) is nothing but a character encoding system based on the English alphabet (the numbers 0–9, the letters a-z and A-Z, some basic punctuation symbols).
Initially ASCII used 7-bit code and represented 128 characters (0–127), out of which 0–31 (first 32) characters are control characters for devices like printers and telegraphic devices (they are also called non-printable characters). ASCII is expanded to 8-bit code for supporting 256 characters (mainly vernacular language specific characters, various symbols, as well as box-drawing characters. 128–255 characters are referred to as extended ASCII.
Code page
‘Code page’ is a mapping of values for a character set (for encoding a particular language). It all started with IBM assigning unique numbers to characters in EBCDIC encoding scheme for mainframe systems, later every system vendors used their own scheme for characters encoding.
We can also view ‘code page’ as graphical glyph set used for rendering an encoded character. These code pages were originally embedded directly in the text mode hardware of the graphic adapters used with the IBM PC and its clones.
Having said that — Code page 437 is the actual character set of the original IBM PC (personal computer). It includes ASCII codes 32–126, extended codes for accented letters (diacritics), some Greek letters, icons, and drawing symbols. Most of the code pages(for different languages) are super-sets of ASCII(discussed in previous post). Also, 8-bit implementations of the ASCII code set the top bit used as parity bit in network data transmissions.
Natural doubt for anyone will be — is there any standard way of encoding for all the code pages?
Unicode is the answer for this. Unicode is an effort to include all characters from all code pages into a single character enumeration that can be used with a number of encoding schemes. There are standard translations for converting code pages to Unicode.
Unicode and Non-Unicode Encoding of Indic Languages
To properly view any language encoding schemes in the browser/notepad or any text processing application following things are mandatory:
- Support from Operating System/Application, which understands the encoding scheme
- Font files for the OS/Application of the choice
Before Unicode standardization, many of the non-English languages created their own scheme for encoding and used them. With the introduction of Unicode all this changed. Almost all modern operating systems and browsers support Unicode and many of them ship with font packs for all the major international languages. Tamil and other major Indian languages Unicode is supported in almost all the OS and browsers available.