|
Character display problems?
Multi-lingual web pages and Unicode |
|
Are you getting ???? or
or
or Yνδύρνθι or other mojibake instead of the correct text for some languages? It's probably because your computer system can't display Unicode correctly. The good news is that most Unicode display problems can be fixed.
How do I fix Unicode display problems on my computer?
To display text in many different alphabets on one web page (e.g. Languages A-Z), we use Unicode, even though Unicode can create display problems for some computer systems. This web page offers solutions for those problems.
It may be that to "see" everything correctly on our Unicode pages, you only need to download and install Internet Explorer (unless you already have it) and at most, one font. Basically, you need:
- a Unicode compatible operating system (see Assistance: Introduction);
- a Unicode enabled browser (Assistance: Step 1); and
- Unicode-compatible font(s) (Assistance: Step 2);
and then you need to:
- configure your browser (Assistance: Step 3 and Assistance: Step 4).
See also Display Problems? on the Unicode site, and Help: Multilingual support (Indic) on Wikipedia.
Definitions:
What is Unicode? Encoding Code Language Script Font
- What is Unicode? It is one of several systems (called encodings) that have been developed for to manage the display of characters on-screen, but it is the first system that can assign a unique number (code) to every character in each of the world's major languages. (Other systems don't allow for enough characters and they also conflict with one another. That is, two encodings might use the same number for two different characters, or use different numbers for the same character.) Not all computer systems in current use are fully Unicode compatible.
- Encoding: a system of assigning numbers to characters (i.e. letters, punctuation, and mathematical notations) so a computer knows which character to display. Hundreds of different systems (encodings) have been developed and used. Unicode is one of them. Here are examples of how encodings are specified in the head of an html page:
- charset=iso-8859-1 (for Western No.1),
- charset=BIG5 (for Traditional Chinese), and
- charset=utf-8 (for Unicode).
- Code: the number assigned to the character. Problems happen when different encodings use the same code for two different characters, or use different codes for the same character. Synonyms for "code" that are also in use: code position, code number, code value, code element, code set value.
- Language Script: the group of characters used to express a language in writing. Also called the "character set" or "character repertoire" or "alphabet" of a language.
- Font: the font determines the way a character will actually look on the screen (or on a printed page). For instance, this "A" in a sans-serif font looks different than this "A" in a serif font, but it is still the same character. (The "A" and the "A" are known as different "glyphs" of the same character. A font is basically a collection of glyphs. Also note that "A" and "a" are two different characters.)
Most fonts don't come close to containing all possible characters in the world—instead they contain ranges (also called "blocks") of characters (e.g. in Unicode, the codes (i.e. numbers) for Arabic characters are found in the range of 0660 to 06FF). Unicode currently defines over 100 ranges, and for example, the newest, Unicode-compatible versions of:
- Arial (with 2792 characters and 3381 glyphs) and
- Times New Roman (with 2790 characters and 3380 glyphs),
contain only 39 ranges, while the:
- Akaash font (409 characters; 642 glyphs), specifically for Bengali,
is also Unicode-compatible, yet contains only 4 ranges: Basic Latin; Latin-1 Supplement; Latin Extended-A; and Bengali.
- NOTE: "language script" and "range" are sometimes synonymous, but some languages require characters from more than one range and even non-contiguous ranges (e.g. Vietnamese, and especially CJK (Chinese-Japanese-Korean). CJK ideographs now encompass at least three ranges in two separate "planes" of Unicode).
- For more information, see also:

Assistance: Introduction
Because Unicode is a relatively recent development that is not yet in widespread usage, and because surfers use a wide range and combination of operating systems (Windows 95/98/ME/NT/2000/XP/Vista, Mac OS 9/OS X, Linux, etc.), and browsers (Netscape, Internet Explorer, Opera, Mozilla, FireFox, Safari, etc.), not all computer systems are currently fully Unicode compatible.
- Some Unicode support has been included in Mac OS since Mac OS 8.5, but prior to Mac OS X (10) only limited use was been made of it by applications.
- Windows NT/2000/XP/Vista are based on Unicode, and some Unicode support has been included in Microsoft Windows since Windows 95.
If you have display problems with some of the links and/or text on our pages, you can try the steps set out below. My intention is to bring together, in one place, useful information I found when I was trying to figure out how to fix my own display problems, and to make that information as easy to understand as possible. Do keep in mind, though, that you don't have to understand everything here in order to get the hoped for results from carrying out the steps. (It may turn out that to "see" everything on our pages correctly, you actually only need to download and install IE (unless you already have it) and at most, one font.) The suggestions I offer come from my experience using the following browsers:
- Netscape 4.79 and 7, Mozilla 1.2.1 and 1.3b , and Internet Explorer (IE) 5.5, with a Windows 98 operating system, and
- FireFox 2, Netscape 7 and 8, Mozilla 1.5 and 1.7, Opera 7.1, 7.2 and 7.5, Safari 3 for Windows, and Internet Explorer (IE) 6 and 7 with a Windows XP operating system,
although I think they could be useful for those with Windows 2000, NT 4 and Vista, and maybe even Windows 95. (Because I only do Windows, the best I can offer those with other operating systems is to send you off-site to Mac OS 9 (Browsers and Fonts), Mac OS X (Browsers and Fonts) and Unix/Linux (Browsers and Fonts) and to Help: Multilingual support (Indic) on Wikipedia, although some of what I say below may be applicable.)
Assistance: Step-by-step
Step 0: You need a a Unicode compatible operating system (see
Introduction above for information)
Step 1: Selecting a browser
Step 2: Obtaining Unicode compatible fonts
Step 3: Configuring your browser by selecting fonts
Step 4: Configuring your browser by selecting encodings
NOTE: Most encodings are still used somewhere on the web, and these steps can be applied to all encodings, not just Unicode. However, if you are interested in viewing pages in a different encoding, such as Big5 (for Traditional Chinese) for example, in Step 2 you would need to make sure you had Big5-compatible fonts, rather than Unicode-compatible fonts.
Step 1: Selecting a browser
NOTE: Upgrade your browser to the latest version: e.g. IE, FireFox, Netscape , Opera.
The easiest way to reduce the number of potential display problems when surfing the Hot Peach Pages and EarthWords is to install Internet Explorer (IE) 5.5 or higher if you don't already have it. (Although I much prefer to surf with Netscape or FireFox, Internet Explorer 5.5 and higher are much more Unicode friendly, plus, if you've ever compared them, you know there are even some non-Unicode pages that Netscape can't access but IE can.) The bottom line is that, after I went through everything in Steps 1 to 4 on Windows 98 and XP:
- IE 5.5, 6 and 7 display our Unicoded pages perfectly.
Caveat: on my current computer configuration (Win XP and IE 7), for the HTML <title> attribute, IE displays empty rectangles (
) for Amharic, Chinese, Japanese, Korean, Khmer, Lao and Tigrigna (even though the text for the link itself displays fine), whereas Moz-based browsers and Opera display the title text correctly. To see if you have the same issue, go to Quick definition of DV in 22 languages using IE, and hover over the language links at the top of the page to make the title boxes pop-up. Let me know by email if you know how to fix it, or if you don't even have the issue in IE.
- FireFox 2, Netscape 8 and Opera* 8 & 9 seem to display everything correctly except:
- Netscape 7*, Mozilla* 1.2, 1.3, 1.5, 1.7 and Opera* 7.2 & 7.5 all display Arabic and Hebrew correctly right-to-left, but don't produce conjuncts or re-ordering for Indic scripts.
- Netscape 4.79* and Opera 7.1* display Arabic and Hebrew incorrectly left-to-right and don't produce conjuncts or re-ordering for Indic scripts.
*NOTE: After I installed Netscape 8, other browsers such as Netscape 7, Mozilla 1.7 and Opera 9 (and everything earlier) suddenly seemed to display everything right except for Khmer, which may mean you don't actually have to use Netscape 8 to clear up browser display problems, you just need to install it. (It may have been cause-and-effect or it may have been completely unrelated.)
- Safari:
- Safari 3 for Windows doesn't seem to support conjuncts or re-ordering for Indic scripts on my Windows XP and also doesn't display connected Arabic font. Both of these problems may be because I have the Arial and Times New Roman fonts installed by Microsoft Office, as explained below for Safari 2 for Mac OS X.
- Safari 2 for Mac OS X supports conjuncts or re-ordering for Indic scripts (see Wikipedia). There seems to be a problem with displaying connected Arabic font if you have Arial and Times New Roman fonts installed by Microsoft Office (this may also affect conjuncts or re-ordering for Indic scripts as well). Regarding the problem with Arabic scripts, see Apple Discussions:
For more information about these and other browsers, go to:
- Alan Wood's:
and
- Wikipedia's:
Step 2: Obtaining Unicode compatible fonts
Make sure you have a Unicode-compatible font for either all the Unicode ranges, or for each of the language scripts you want to be able to display.
NOTE: To see what fonts you already have in your system, look in your Control Panel under Fonts. This will also give you the address of your FONTS file for when you want to intall a new font.
- Easiest: If you have:
- Arial Unicode MS* (with almost 39,000 characters and over 50,000 glyphs in 65 ranges; supplied with Microsoft Office 2000 and later, FrontPage 2000 and later, Office XP and later, and Publisher 2002 and later; see also Description of the Arial Unicode MS) OR
- Code2000* (over 50,000 characters and 60,000 glyphs in 105 ranges; free download, $5 honour-system registration),
you should be OK for most languages on our pages. In other words, to "see" everything on our pages, you only need to download and install IE (unless you already have it) and at most, one font. Easy.
*Note: Code2000 is not recommended for Chinese Simplified or Traditional, or for Japanese, and Arial Unicode MS is not recommended for Lao, but anyone who can read them probably already has appropriate fonts on their computer.
- Extra work: Because fonts designed for just one particular language script often present that script better than fonts that contain several scripts, you may want to download further specific Unicode-compatible fonts for certain languages. On our EarthWords pages, for instance, we code a preference for the following fonts:
but we leave the rest up to the user's choices in Step 3, for which you need at least either:
In other words, to "see" everything on our pages almost exactly the way we intended, you only need to download and install IE (unless you already have it) and at most, four or five fonts. No big deal.
- Maximum effort: Because sites other than ours will prompt for fonts other than those mentioned above, you may want to download a whole whack of fonts. I suggest starting at Alan Wood's Unicode Resources*.
*NOTE: even though this page of Alan's is entitled "Unicode Fonts for Windows computers", it also has links for Mac and Unix.
*ALSO: Raghindi (listed on Alan's page under Devanagari Fonts) has been known to cause a conflict with other fonts on Windows 9x, including Code2000. It seems that many fonts produced for Windows 2000-and-up lack the ASCII characters required for backwards compatibility on earlier
versions of Windows. Installing such fonts on Win 9x is not recommended, as they have a tendency to "take
over" the system. The Raghindi is the only one I know about, but apparently there are others.
NOTE: Many times IE has automatically prompted me to download software to properly display a particular web page, whereas Netscape never has.
Step 3: Configuring your browser by selecting fonts
This step reveals a significant difference between Mozilla-based browsers (FireFox, Netscape and Mozilla) on the one hand, and IE (& Opera) on the other:
- with Mozilla browsers, you are only able to choose which font it should use for each of 18 encodings;
- with IE and Opera, you can choose which font it should use for each of 47 different language scripts.
This means that, on a web page that has a meta tag setting for the Unicode encoding (i.e. <meta http-equiv="Content-Type" content="text/html; charset=utf-8">) in the head, and no other specified font tags within the body:
- with Mozilla browsers, all language scripts will be displayed in the same font (i.e. the font you choose for Unicode),
- whereas IE and Opera are able to display each language script in a different font.
Because, as I mentioned earlier, fonts designed for specific language scripts often display that specific script better than a general font for all scripts does, the latter approach is definitely preferable.
Alan Wood offers directions for configuring various browsers at Unicode and Multilingual Web Browsers. To help with the decisions about which fonts to choose where, the following chart sets out font options for Netscape encodings and for IE language scripts that should work.
Chart adapted from
Yale University Library Workstation Support Group
|
Netscape (4.x and up) Font Options
|
IE (5.5/6) Font Options
|
| Encoding |
Variable width font |
Fixed width font |
Western
(ISO-8859-1) |
(any number of options) |
(any number of options) |
Central European (ISO-8859-2)
(Windows-1250 |
Bitstream Cyberbit, Times New Roman |
Courier New |
Japanese
(Auto-Detect)
(Shift-JIS)
(EUC-JP) |
Arial Unicode MS, MS Gothic |
Arial Unicode MS, MS Gothic |
Traditional Chinese
(Big5)
(EUC-TW) |
Arial Unicode MS, MingLiU |
Arial Unicode MS, MingLiU |
Simplified Chinese
(GB2312) |
Arial Unicode MS, MS Song |
Arial Unicode MS, MS Hei |
Korean
(Auto-Detect) |
Arial Unicode MS, Code2000, GulimChe |
Arial Unicode MS, Code2000, GulimChe |
Cyrillic
(KOI8-R)
(ISO8859-5)
(Windows-1251)
(CP866) |
Arial Unicode MS, Code2000, Times New Roman |
Courier New |
Baltic
(ISO-8859-4)
(Windows-1257) |
Arial Unicode MS, Code2000, Times New Roman |
Courier New |
Greek
ISO-8859-7)
(Windows-1253) |
Arial Unicode MS, Code2000, Times New Roman |
Courier New |
Turkish
(ISO-8859-9) |
Arial Unicode MS, Bitstream Cyberbit, Code2000, Times New Roman |
Courier New |
Unicode
(UTF-8)
(UTF-7) |
Arial Unicode MS, Code2000 |
Arial Unicode MS, Code2000 |
| UserDefined |
Arial Unicode MS, Code2000 |
Courier New, Courier New Baltic |
|
Language script |
Web page font |
Plain text font |
| Arabic |
Arabic Transparent , Arial Unicode MS, Bitstream
Cyberbit , Tahoma, Traditional Arabic & ... |
|
| Armenian |
Arial Unicode MS, Code2000 |
|
| Bengali |
Akaash, Arial Unicode MS, Code2000 |
|
| Braille |
Code2000 |
|
| Burmese |
|
|
| CanSyllabic |
Aboriginal Serif, Aboriginal Serif Unicode, Ballymun RO, Code2000 |
|
| Cherokee |
Aboriginal Serif, Code2000 |
|
| Chinese Simplified |
Arial Unicode MS, Bitstream Cyberbit, MS Hei, MS Song, simSun-18030 |
MS Hei, MS Song |
| Chinese Traditional |
Arial Unicode MS, Bitstream Cyberbit, MingLiU |
MingLiU |
| Cyrillic |
Times New Roman & ... |
Courier New , Andale Mono, Lucida Console |
| Devanagari |
Alpha-demo, Arial Unicode MS, Code2000, shiDeva |
|
| Ethiopic |
Code2000, Ethiopia Jiret, GF Zemen Unicode, TITUS Cyberbit Basic |
Ethiopia Jiret |
| Georgian |
Arial Unicode MS, Code2000, TITUS Cyberbit Basic |
|
| Greek |
Times New Roman & ... |
Courier New Andale Mono Lucida Console |
| Gujarati |
Arial Unicode MS, Code2000, Shruti |
|
| Gumukhi |
Arial Unicode MS, Code2000. Raavi |
|
| Hebrew |
David, Miriam & ... |
Mirian Fixed Fixed Miriam Transparent Rod |
| Japanese |
Arial Unicode MS, Bitstream Cyberbit, MS Gothic, MS
Mincho |
MS Gothic, MS Mincho |
| Kannada |
Arial Unicode MS, Code2000. Tunga |
|
| Khmer |
Code2000, Khmer OS |
|
| Korean |
Arial Unicode MS, Batang, Bitstream Cyberbit, Code2000, GulimChe |
GulimChe |
| Lao |
Saysettha Unicode, Saysettha OT, VangVieng Unicode, XiengThong Unicode, Alice5 Unicode, Alice3 Unicode, Alice4 Unicode, Alice0 Unicode, Alice1 Unicode, Alice2 Unicode |
|
| Latin based |
(any number of options) |
Courier New .... |
| Malayalam |
Arial Unicode MS, Code2000, Kartika |
|
| Mongolian |
Code2000 (?) |
|
| Ogham |
Code2000, TITUS Cyberbit Basic |
|
| Orriya |
Arial Unicode MS, Code2000 |
|
| Runic |
Abiriginal Serif Unicode, Code2000, TITUS Cyberbit Basic |
|
| Sinhala |
Dinamina, Potha |
|
| Syriac |
Code2000, Estrangelo Edessa, TITUS Cyberbit Basic |
|
| Tamil |
Arial Unicode MS, Code2000, Latha, TabAvarangal2 |
|
| Telugu |
Arial Unicode MS, Code2000, Gautami |
| Thaana |
Code2000, Mv Boli, TITUS Cyberbit Basic |
|
| Thai |
Cordia New, Angsana New, Arial Unicode MS, Bitstream
Cyberbit, Code2000, IrisUPC, Microsoft Sans Serif, Saysettha OT, Tahoma |
Courier Mono Thai |
| Tibetan |
Arial Unicode MS, NSimSun-18030, SimSun-18030 |
|
| UserDefined |
Arial Unicode MS & all |
Courier New ALA ... |
| Yi |
Code2000, NSimSun-18030, SimSun-18030 |
|
|

Step 4: Configuring your browser by selecting encodings (called "character set" in Netscape 4, "character encoding" in Netscape 7 & 8, and FireFox 2)
Directions on how to select encodings for various versions of different browsers can be found on the same pages where the directions for Step 3 are located (i.e. go to Alan Wood's Unicode and Multilingual Web Browsers, click on a browser, then scroll down to the end of the instructions for selecting Fonts to the instructions re: Encodings).
Keep in mind that pre-Netscape 7 doesn't have Auto-Select (called "Auto-Detect" in Netscape 7 and 8). Instead, for pages that don't specify an encoding (and many do not even if they aren't Latin-based), pre-Netscape 7 will always choose the default encoding, so if the page is gibberish, the first thing you want to check is what encoding Netscape is using. It may need changing. (However, it seems that sometimes, though the default setting has a check mark beside it, the page is not gibberish because Netscape has performed some internal acrobatics and is actually using the right encoding.)
IE's "Auto-Select" on the other hand most often chooses the appropriate encoding automatically, though you can change it if you want to.
Language Support
adapted on July 4, 2001 from page at
http://www.palevich.com/ja/faq_japanese.htm
(page no longer seems
to exist)
These questions and answers refer specifically to Japanese, but the
same process works for all international characters, such as Chinese,
Korean, Vietnamese, Cyrillic, Arabic, Hebrew, Greek, Extended Latin, and African and
Latin American Symbols.
Q: I don't care about Unicode and all that information, I just want to know how to see Japanese text on the web. I'm using Netscape Navigator 4.x on a non-Japanese version
of Windows 95/98/NT/2000, and I see funny characters when I visit a
web site like http://www.yahoo.co.jp/.
A: Either upgrade to Netscape 7 or higher or install Internet Explorer (IE) 5.5 or higher.
(Don't worry, you can still use Navigator as your browser. Once you
have the necessary fonts installed for IE, they should also work for
Netscape, although I still had display problems sometimes with Hebrew, Chinese, Japanese, and Korean in Netscape 4. To solve this
problem, I used IE when visiting a site with one of those four languages. When I upgraded to Netscape 7, the problem went away.)
To install Internet Explorer, go to this URL http://www.microsoft.com/windows/ie/default.htm
and follow directions. Don't worry, even if you choose the complete,
non-custom version, you will still be able to make Netscape your default
browser.
Once you have installed IE, start IE, and go to the View menu and choose
the menu item "Encoding:Auto Select". Then use IE to visit
the web site http://www.yahoo.co.jp/
Once you're there, you'll either see the site in Japanese, or you will
get a dialog box that says something like "I need to install Japanese
Language Support to view this page". If you get the dialog box,
click "Yes", or "OK", to install the Japanese
Language Support (it's actually a font plus some stuff for the operating system). After that's downloaded and installed, you should
be able to surf the web in Japanese using IE.
You may need to repeat this "installation of the Language Support"
process for each language you want to access.
Now, you can either continue to use IE, or you can go back to using
Netscape 4, and just use IE for Hebrew, Chinese, Japanese and Korean (
or you can get with it and upgrade your Netscape).
Q: How come some Japanese pages look fine, but others are garbled?
A: Japanese web pages are encoded in one of several different
text encodings. Some Japanese web pages contain special HTML tags that
tell your browser which encoding the page is using. Unfortunately, most
Japanese web pages don't contain these tags. As a result, your browser
has to guess which of the possible encodings is being used. Most browsers
have a "View:Encoding" menu (Netscape's is "View:Character Set" until Netscape 7, which is "View:Character Coding") with an option named "Auto-Select"
that tells the browser to try and guess which encoding is being used.
You should normally select that option. (Netscape 7 calls it "Auto-Detect" and previous versions of Netscape have neither "Auto-Detect" nor "Auto-Select".)
Even with that "Auto-Select" selected, you may occasionally
find a web page that displays garbled Japanese text. In that case,
use the View:Encoding menu to manually select each of the Japanese
encoding options in turn. First try "Japanese (Auto Select)",
then the others. One of them should work for that page. If you know
that the author of the page used a PC or a Mac to create the page, the
encoding is probably Shift-JIS. If the author used unix, the encoding
is probably EUC. You might also have to do some "font selecting" for your browser as set out in Step 3 above.
Netscape users: Because Netscape 4.x doesn't have an "Auto-Select", you often have to manually select. Go to the "View" menu,
choose "Character Set" and keep selecting until it works. Or why not just take the plunge and upgrade to a more recent version of Netscape?
Acknowledgements
Thank you especially to James Kass, Jukka "Yucca" Korpela and Alan Wood. Were it not for their work and excellent material freely available on the web (and James Kass' generous help and suggestions), I would understand very little about encoding systems, or about Unicode and how to use it, and the above would not exist.