This file is part of the TADS 2 Author’s Manual.
Copyright © 1998 - 2002 by Michael J. Roberts. All rights reserved.
Edited by NK Guy, tela design.

Appendix K

Character Set Mapping and Translation

To provide character set portability, the TADS Compiler, Run-Time, and Debugger (including the HTML TADS versions) now provide an option that lets you specify a character set translation to use for your game. The character set translation allows your game to use a standardized character set, such as ISO Latin 1, but still run on any system by providing a mapping from your game's internal character set to the native system character set for each player's system.

This new feature is intended to be simple to use; in most cases, players will not need to know anything about this new feature, and game authors will only need to know how to use the new compiler option that specifies the character set used by the game's source code. Most TADS game authors and players should find everything they need to know in the Quick Reference below.

In addition to the quick reference guide, this document also provides considerable background information on multi-lingual character sets and how TADS implements character set portability. TADS provides an open and extensible mechanism for character set translation; this background material is intended only for game authors who need to go beyond the standard uses of the feature described below.

Contents:

Quick Reference to TADS Character Sets
Background Information
Using Mapping Tables
Creating a Mapping Table File

Quick Reference to TADS Character Sets

This section provides answers to the most common questions about using the TADS portable character set system.

I just downloaded a TADS game and want to play it. Do I have to know what character set it uses?

No. If the game uses the new character set system, the run-time will automatically read the character set information from the game and will load the correct character set mapping table. Note that you will need to make sure your TADS run-time is up-to-date -- version 2.2.4 or higher is required for games that use character set translation.

I'm trying to play a game with the TADS Run-time, and it's reporting this error: character table file "xyz.tcp" not found for internal character set "ISO-Latin-n (ISO-8859-m)".

This means that the game uses character set translation, but you don't have the necessary character set translation file for your system installed. The run-time needs to be able to find the file listed in the error message in order to translate the character set.

You have a few options. The easiest is to tell the TADS run-time to ignore the character set mapping entirely; to do this, run the game again with the option "-ctab-" on the command line:

    tr -ctab- foo.gam

This will make TADS run without using character set translation. The game will work correctly, except that any extended characters (such as accented letters) that it attempts to display may be displayed incorrectly on your screen. Depending on how heavily the game makes use of accented letters and other extended characters, this may or may not be acceptable to you.

Second, you can try to find the mapping file named in the error message. You might try asking in the rec.arts.int-fiction or rec.games.int-fiction newgroups, or checking the IF Archive to see if the file is available.

Third, if you're willing to take the time, you can create the character mapping table yourself. This requires that you know the layout of your computer's native character set, as well as the layout of the internal character set (named in the error message) that the game uses. If you're interested, you can refer to the section on creating a mapping table file below for full details.

I'm writing a game, and I want to use extended characters in the game. What should I do?

You should use the new character set translation feature.

First, you should choose an internal character set for your game. Although you can choose any character set you want, we recommend that you choose one of the ISO 8859 family of character sets. If you're writing in a Western European language, you should use ISO Latin-1, since this character set has the necessary characters for most Western European languages. If you're writing in a Central or Eastern European language, you should choose ISO Latin-2. If you're writing in another language, you might have to choose another character set appropriate to that language; however, you may have more trouble finding a mapping file if you don't use ISO Latin-1 or Latin-2.

Second, you should prepare your source code using your computer system's native character set. Simply enter accented and other extended characters with your text editor or word processor directly into your source code.

Third, you must identify the correct mapping file for your system that maps from the native character set that you used to prepare your source file to the internal character set that you chose. How this file is named depends on your system.

DOS users: you will need to know the DOS "code page" that your text editor used to save your game's source file. This is a 3- or 4-digit number that identifies the DOS character set you're using. US users normally use DOS code page 437, users in Western Europe usually use code page 850, and users in Eastern Europe normally use code page 852. You can determine what code page you're using by typing "chcp" at the DOS prompt. Once you know your code page, append "La" and the ISO Latin number to the code page. So, to convert from code page 850 to ISO Latin-1, the mapping file is "850La1.tcp".
Windows users: Windows uses character sets that map very closely to the ISO 8859 character sets, so the mapping files on Windows simply correspond to the ISO 8859 character sets. If you're using the Windows Western European localization, use Win_La1.tcp, which maps from Windows code page 1252 to ISO Latin-1. If you're using the Windows Eastern and Central European localization, use Win_La2.tcp, which maps from code page 1250 to ISO Latin-2.
Other systems: consult your system-specific TADS documentation. (We'll add to this list in the future as more systems adopt conventions for naming the mapping file.)

Finally, when you compile your game, specify the mapping file, like this:

    tc -ctab xyz.tcp mygame.t

Do I need to include character set mapping files with my game?

No. Character set mapping files are part of the system, not part of your game. As long as you use one of the standard character sets (such as a member of the ISO 8859 family), mapping files should be readily available for most TADS run-time platforms.

Note, however, that if you're distributing a stand-alone executable version of your game, players will not need to download TADS to play your game, so you will have to include the .TCP files for your system with your game in this case.

Can I use one of the ISO 8859 character sets that is based on a non-Latin script (such as Greek or Cyrillic)?

Yes. Although this document uses ISO Latin-1 and ISO Latin-2 in most of its examples, you can use any 8-bit character set as the internal character set, including non-Latin scripts, as long as the script is written left-to-right. The ISO 8859 family of character sets includes several character sets based on non-Latin scripts; any of these can be used as an internal character set.

Note that TADS does not currently support right-to-left or bidirectional scripts, because the rendering code is only designed for left-to-right scripts. TADS also does not support double-byte or multi-byte character sets for use as the internal character set.

The mapping file I need is not available for my platform. Is there an easy way to create one?

Possibly. Although TADS is not capable of using Unicode directly, TADS can use Unicode as an intermediary in translating charater sets. Unicode mappings are available for all of the ISO 8859 character sets, and for many vendor-specific character sets; the Unicode web site provides a large catalog of these mappings in a format that can be used directly by the TADS character set translation tool. Refer to Using Unicode Mapping Files for details.

Background Information

The following sections are intended to provide information on the theory of operation of the TADS character set translation mechanism. Most game players and authors will not need to know about character sets at the level of detail provided below, but this information is included for completeness.

An Overview of Character Set Portability

TADS game authors have often requested more support for non-English games. One important capability that non-English games require is access to characters from the "extended" character set, beyond the normal US ASCII range, so that games can display accented characters and other special symbols that are not part of the standard ASCII character set.

TADS has long allowed games to use the full "8-bit" character set that most modern systems use to encode accented and other special characters, so it's been possible for a long time to write a game that uses, for example, the extended characters on a Macintosh.

Unfortunately, a problem that has long afflicted not only TADS users but practically everyone using a computer is that the extended character sets that most computers use are incompatible with each other. The extended characters on a Macintosh, for example, are defined differently from the special characters on a PC.

What makes this incompatibility especially irritating is that many of the extended characters that are defined in one system's extended character sets tend to show up in most of the other systems' character sets -- it's just that they're in different locations in the character set. For example, although both Macintoshes and PC's are capable of showing an "e" with an umlaut, they put the "e" with the umlaut in different places in their character sets. This means that a text file prepared on one system will be garbled on the other system, because the file simply encodes the character numbers that are appropriate for the original system, and these character numbers refer to different symbols on other systems.

TADS has traditionally had this same problem, because TADS previously simply encoded in the .GAM file the same character codes that you used in your source file. When a player moved the .GAM file to a different type of computer, TADS continued to display the same extended character codes that were correct on your computer, but which were usually wrong on a different type of computer.

Character Set Translation

To address the portability problem with extended character sets, TADS uses a method known as "translation" or "mapping." For each character in one computer's native character set, TADS translates the character to the corresponding code point in the other computer's native character set. This translation happens dynamically as a player runs your game, so the same .GAM file will work on any platform -- you don't have to translate the .GAM file itself to different machines.

The principle is simple: each game uses a well-defined character set, called the "internal" character set, and the TADS run-time translates from the well-defined game character set to the actual character set used by the player's computer. The internal character set provides the portability, because each platform that has a TADS run-time will be able to provide its own conversion table from the internal character set to the native character set.

Data Source of the Data Character Set
Source Code Your text editor or word processor Your computer's native character set
.GAM File TADS Compiler: the compiler translates your game source code into the binary .GAM file, converting characters to the internal character set of your choosing One of the standard "internal" TADS character sets of your choosing
Text displayed on the player's screen TADS Run-time: the run-time executes your game's compiled code, and translates the characters stored in the .GAM file to the player's computer's native character set Player's computer's native charater set

Data	Source of the Data	Character Set
Source Code	Your text editor or word processor	Your computer's native character set
.GAM File	TADS Compiler: the compiler translates your game source code into the binary .GAM file, converting characters to the internal character set of your choosing	One of the standard "internal" TADS character sets of your choosing
Text displayed on the player's screen	TADS Run-time: the run-time executes your game's compiled code, and translates the characters stored in the .GAM file to the player's computer's native character set	Player's computer's native charater set

Choosing an Internal Character Set

To use the new character set translation mechanism, you must first decide on the "internal" character set for your game. This is the character set that your .GAM file will use. Although you can choose any single-byte character set for your internal character set, we strongly recommend that you use one of the standardized international character sets, such as one of the ISO 8859-X standards; this will make it easier for your players to find a suitable mapping for their systems.

Obviously, if every game author made up his or her own character set, players would have a huge problem. For every type of computer that a player might want to use to run your game, someone will have to produce a character set mapping that translates from your internal character set to the player's computer's character set.

So, you clearly want to choose a standard character set. We recommend that you choose one of the following character sets, according to your needs:

ISO Latin-1 (ISO 8879-1) - This character set encodes most of the characters needed for Western European languages.
ISO Latin-2 (ISO 8879-2) - This character set encodes most of the characters needed for Central and Eastern European languages.

There are several other encodings in the ISO 8859 series; if the ones listed above are not suitable for your language, you should try to find an ISO 8859 encoding that has the characters you need.

If game authors use the same character sets whenever possible, it will greatly simplify the creation of character mapping files for the many platforms where TADS runs, and will make it easier for people to play your game.

Character Set Identifiers and Long Names

Each internal character has a unique and universal identifier, and it also has a full name that is used for display purposes. The identifier is "universal" in that it must have the same value on every platform; the reason is that each platform uses this information to determine how to map the internal character set to the native character set. If the Macintosh and Windows translation files for, say, ISO 8859-1 did not have a common identifier, there would be no way to figure out how to choose the translation when moving the game from one machine to the other.

To ensure that each character set identifier is unique and universal, internal character sets must be registered and included on a master list. By consulting the master list, anyone on any platform can determine the character set associated with a particular identifier, so there will never be any confusion about a game's character set. Currently, Mike Roberts (mjr_@hotmail.com) maintains the master list; please contact him if you need a copy of the master list or would like to add a new character set to the list.

At the time of this writing, the master list of registered internal character sets includes the following:

The entire ISO 8859 series of 8-bit character sets. Each ISO 8859 series member is identified as follows:
- ID = "LaX", where X is the number in the Latin-X suffix. For example, ISO Latin-1's identifier is "La1".
- LDESC = "ISO full-name (ISO 8859-X)". For example, ISO Latin 1's full display name is "ISO Latin-1 (ISO 8859-1)".
Code Page 1251, which is the Windows Cyrillic character set. This is the recommended internal (i.e., .GAM) character set for authors working in Cyrillic. This character set is identified as follows:
- ID = "1251"
- LDESC = "Code Page 1251 - Cyrillic"

When you create a character set definition file, you must specify the ID and LDESC values. The compiler stores these values in the .GAM file when you compile a game using this character set translation file. Later, when a player loads the game with the run-time, the run-time reads the ID and LDESC values, and attempts to load a mapping file matching the character set. If the run-time is unsuccessful, it will display an error message that includes the LDESC value; this provides the player with information that may be helpful in locating an appropriate mapping file.

The TADS Run-time uses a system-dependent convention to identify the mapping file. The convention depends on how native character sets are represented on the particular platform.

On DOS, the run-time determines the current DOS code page that is in effect (DOS code pages are identified by 3- or 4-digit numbers), then concatenates the code page number and the internal character set ID, and adds the ".TCP" extension. So, the mapping file to translate between ISO Latin-2 and DOS code page 852 is called "852La2.TCP".

On Windows, HTML TADS chooses a file named WIN_xxx.TCP, where xxx is the internal character set ID. For example, for ISO Latin-2, the file is called "WIN_La2.TCP". On Windows, TADS chooses the native character set based on the code page stored in the .TCP file, so TADS doesn't need to include the code page number in the .TCP filename.

Refer to the system-specific documentation for your version of TADS for information on how the run-time chooses a mapping file.

The Extra System Information String

The mapping table source file can contain one more thing, which is the extra system information string. This is some extra information entirely for the use of the system-specific version of the TADS run-time. The meaning of this value depends on the operating system.

The value is specified like this:

   EXTRA_SYSTEM_INFO = 1250

The meaning of the value ("1250" in this case) depends on which system you are using.

For DOS, the extra system information is not used.

For Windows, HTML TADS interprets the extra system information as a Windows code page number. When HTML TADS loads the mapping file, it reads the extra system information string, and then finds the Windows character set matching the code page number.

Why go to all this trouble? Why not just use my computer's character set?

You probably shouldn't use your computer's native character set as the internal character set (unless your computer uses one of the ISO 8859 character sets) for the same reasons that you shouldn't make up your own character set: it is very desirable to minimize the total number of different internal character sets that games use, because this reduces the number of translation files that need to be created for each different system where TADS runs.

If everyone simply used their own computer's native encoding, we'd end up needing one translation table on every type of computer for every other type of computer where TADS runs. Clearly, if we can limit ourselves to a few common standard character sets, the total number of translation files each computer needs will be much smaller.

Note that Windows does not use ISO 8859 character sets. Many people have been led to believe that the Windows character sets are the same as the ISO 8859 character sets, but this is not the case. Windows actually uses supersets of these character sets; not all of the characters in the Windows code pages exist in the ISO 8859 character sets.

Why not use Unicode?

If you're familiar with the new Unicode character set standard, you may be thinking that this problem would be solved much more easily with Unicode. Well, yes and no.

TADS cannot use Unicode directly. However, the TADS character set translation tool can use Unicode character mappings as an intermediary to create a mapping between two non-Unicode character sets, which can greatly simplify the process of constructing a character set translation. Refer to Using Unicode Mapping Files for details.

Unicode is a 16-bit character set that is designed to incorporate every character in every character set in use anywhere in the world, and assign each individual character a unique code point within this single 16-bit space. This clearly is a wonderful improvement over the 8-bit world, since a particular character code always refers to the same character, regardless of the computer's language or country settings.

So, Unicode would seem to solve the character translation problem once and for all. Unfortunately, it's not the right solution for TADS right now for a couple of reasons.

First, Unicode is not a panacea for the character set problem, at least in the short run, for the simple reason that Unicode is not yet supported as the native character set of most operating systems. For any computer that doesn't use Unicode as its native character set, a translation table of some sort would be required -- using exactly the same type of mechanism that TADS now uses. So, while Unicode would eliminate the need for the game author to choose an internal character set, it would still require that the player have appropriate mapping files, which would still vary by system and native character set. In other words, Unicode wouldn't really simplify matters for TADS or for most users until most or all operating systems adopt Unicode as their native character set.

Second, TADS is designed to use an 8-bit character set internally, which would not be adequate for Unicode. Changing TADS to use 16-bit characters would be possible, but it would require much more work than the new character set mapping feature.

TADS may change to use Unicode internally in the future. Once TADS and the major operating systems all adopt Unicode as their native character sets, the character set problem should disappear, but for the time being, TADS will remain in the confusing 8-bit world.

Using Mapping Tables

TADS uses mapping tables to translate characters between the native character set of your computer and the internal character set used by your game. TADS uses mappings when it compiles a game, and when it runs a game.

Compiling with a Mapping Table

Once you've chosen your internal character set, you must provide a translation from the character set that your source code uses to the internal character set you have chosen. If you prepare your source code using the same character set that you decided to use as the internal character set, there's nothing you need to do here. If your source code uses something other than your internal character set, though, you must provide a mapping. To use the mapping, provide the -ctab (character table) option to the compiler:

    tc -ctab mymap.tcp mygame.t

The character map file is a separate file that defines the mapping between your source code character set and the internal character set. As the TADS Compiler reads your source code, it will translate everything into your internal character set for storage in the .GAM file.

Running with a Mapping Table

Finally, whenever a player wants to play your game, the run-time must choose a suitable mapping file for their system. Character set selection will, in most cases, be completely transparent to the player. The TADS Compiler stores the internal character set identifier with in your .GAM file; when the player runs the game, the TADS run-time reads the character set identifier from the .GAM file, and attempts to find a suitable mapping file. The run-time will automatically choose a mapping file that is suitable for the player's computer, which ensures that your game will appear correctly (or as close to correctly as is possible on the player's machine, if it doesn't support all of the same characters your computer does).

Note that you as the game author don't need to know anything about what kind of system the player will use to play your game, and the player doesn't need to know anything about the type of system that you used to write your game.

Players can override the automatic mapping selection that the run-time attempts to make by using the -ctab option, and can also turn off mapping entirely by using the -ctab- option. This will probably only be desirable when an appropriate mapping file is not available for a particular game's internal character set on a particular platform, which may occur early on, before every platform has mapping files available.

The -ctab option can be used to override the default mapping as follows:

    tr -ctab playmap.tcp mygame.gam

Eventually, when mapping files become widely available for all of the TADS run-time platforms, the character mapping process should be so automatic and transparent that players should never be aware that it is even happening.

Creating a Mapping Table File

This feature is brand new with this release, and as of yet TADS itself does not provide many pre-defined mapping files. Currently, TADS includes the following mapping files:

437La1 - DOS code page 437 to ISO Latin-1
win_La1 - Windows code page 1252 to ISO Latin-1.

It will be much easier to players and authors to take advantage of this feature when more standard mapping files become available, which should happen in future releases. However, if you want to start using this feature immediately, you can create your own mapping files with relatively little work.

To define a mapping file, you use a new tool called MKCHRTAB (it's called MKCHRTAB32 for Windows 95/NT users). This tool reads a special source file that you can create to define a character set mapping, and generates a mapping file that you can use with the -ctab option to the TADS Compiler, Run-Time, and Debugger. Once you've created a mapping file, you can use the same mapping file for all of the TADS tools.

You run MKCHRTAB using a command like this:

    mkchrtab source-file mapping-file

The source-file is a text file that you create, as described below, to define the character set mapping. The mapping-file is a binary file that MKCHRTAB creates; you use this file with the -ctab option for the TADS Compiler, Run-Time, and Debugger.

Mapping Table Source File Format

The source file contains a listing of character mappings. Each line of the source file contains one character translation. Blank lines are ignored, as are lines starting with a pound sign ('#').

In addition to character mappings, the source file must contain two special definitions: the internal character set identifier, and the internal character set full display name. You specify these thus:

ID = character-set-id-code
LDESC = character-set-full-display-name

The ID and LDESC are stored in the character set mapping file. When you compile a game using this character set definition, the ID and LDESC will be stored in the compiled .GAM file, so that the Run-time can determine the appropriate character set to load when a player runs the game.

Note that the native character set is not identified in the file. This is because it is not necessary to store any information about the local character set with a game. The entire point of the character set translation system is to make sure that games do not use native character sets but use portable standard character sets instead. When you compile a game, it is completely translated to the internal character set, so after compilation, the original native character set that was used to prepare the game's source code is completely irrelevant.

Using Unicode Mapping Files

There's a very quick and simple method for generating a mapping file that lets you avoid all of the tedious data entry: use the existing Unicode translation tables from the Unicode web site.

Although TADS cannot use Unicode directly as an internal character set, the character set translation tool can use Unicode as an intermediary to construct a character set translation. This greatly simplifies the task of creating a mapping file.

One of the useful properties of Unicode is that the Unicode character set is a superset of most of the common character sets in use on computer systems throughout the world. This means that almost any character that you can find in almost any character set on any computer has a unique code point in Unicode. As a result, Unicode can serve as a "Rosetta stone" to provide a translation between nearly any pair of character sets.

The Unicode Consortium, the organization that develops and publishes the Unicode standard, has created a large collection of character set mappings that translate other character sets into Unicode. These mappings are available in electronic form, and use a standard text format. The TADS character set mapping tool can use these mapping files directly.

The Unicode mapping files can be found on the Unicode web site at ftp://ftp.unicode.org/Public/MAPPINGS. The Unicode Consortium does not permit third-party redistribution of these files, so they are not included with TADS, but you can download them directly from the Unicode web site.

To use Unicode mappings to construct a character set translation, you first must obtain the correct pair of Unicode mapping files -- one for your native character set, and one for your internal character set.

Next, start your mapping file as normal, specifying the ID and LDESC settings.

Then, rather than specifying character mappings directly, simply specify a reference to your Unicode mapping files, like this:

    unicode native="native-mapping" internal="internal-mapping"

Finally, specify the default character mappings (see below) using the NATIVE_DEFAULT and INTERNAL_DEFAULT directives. These will provide mappings for any characters that can't be mapped from one character set to the other.

You will now have a complete character set mapping source file. Simply run this through MKCHRTAB as usual to create your .TCP file. Note that the character set translator will even automatically use the Unicode characters to translate named HTML entities to the native character set.

Note that you may wish to supplement the Unicode mappings with explicit mappings of your own, to provide approximations of characters that can't be mapped directly between the character sets. For example, DOS code page 437 lacks several of the accentented letters in ISO Latin-1, so it may be desirable to supplement the Unicode mapping by specifying the unaccented equivalents for the missing accented characters.

Specifying Default Characters

It is likely that there will not be a perfect correspondence between your native and internal character sets -- each character set is likely to have a few characters that have no equivalents in the other character set. In these cases, you will need to provide a "default" character in each character set that should be used for those characters that cannot be mapped from the other character set.

The simplest way of specifying the default character is to use this pair of directives in your mapping file:

    native_default = character
    internal_default = character

The NATIVE_DEFAULT directive specifies the character value (as a decimal, hex, or octal number, or as a single character enclosed in single quotes) that should be used for any character in the internal character set that has no explicit mapping to the native character set. Similarly, the INTERNAL_DEFAULT directives specifies the internal character that should should be used for any native character that doesn't have an explicit mapping.

Note that these directives apply only to character codes in the range 128 to 255. Unmapped characters in the range of codes from 1 to 127 are assumed to be in the ASCII subset that is the same in most character sets, so the defaults are not applied to this range.

Of course, you don't have to use the default character directives to specify default characters; you can instead specify each default mapping explicitly.

Specifying Character Mappings

Each character mapping defines the translation from your computer's native character value to the internal character value and back again. These mappings take several forms.

First, you can define the "reversible" mapping:

   152 <-> 201

The first value is the character code in your computer's native character set; the second is the corresponding code in the standardized internal character set. When the compiler reads your game, it will convert code 152 to code 201 for storage in the .GAM file. When the run-time plays a game compiled using this character set, it will translate code 201 to code 152 for display on your computer.

Second, you can define a one-way forward mapping:

   160 -> 255

This mapping specifies that character 160 in the native character set is translated to code 255 in the internal character set, but doesn't specify how code 255 is displayed. This type of mapping is useful when multiple characters in your native character set are to be translated to the same character in the internal character set; this is especially important when a character in your native character set doesn't have an equivalent in the internal character set, so you must map the character to a special value that you reserve as an invalid character.

Third, you can define a one-way reverse mapping:

   128 <- 220

This specifies that character 220 in the internal character set is translated to character 128 in the native character set. This type of mapping is useful for invalid characters, much as the one-way forward mapping is, but in this case it's useful when the invalid characters appear in the internal character set. You use this mapping when you need to map several characters in the internal character set to a single character in the native character set.

Finally, you can define a complete three-way mapping:

   161 -> 255 -> 128

This specifies that character 161 in the native character set is to be mapped to character 255 in the internal character set, but that character 255 in the internal character set is mapped back to character 128 in the native character set. As with the one-way mapping, this is useful for invalid characters, because you will probably want to map any invalid character in the internal character set back to an appropriate "empty" character in the native character set, so that it is displayed as a missing or empty character rather than something incorrect.

Note that you can use hex or octal notation for any of these values. To specify a hex number, start the number with "0x":

  0xf3 <-> 0x85

To specify an octal number, start the number with a zero:

  0172 <-> 0157

You can also specify a character value directly:

  'c' <-> 0x94

Note that you should only use character values for ASCII character, because if you use any extended characters, your character mapping file itself will not be portable and could be quite confusing.

Any characters that you don't include in your mapping file will be mapped to the same character code. This is convenient when your native character set has a substantial correspondence to the internal character set that you want to use -- you can simply leave out any of the characters that are the same in both sets. It's also convenient because almost every character set is likely to include US ASCII as the first 127 characters, so these character values will probably never need to be mentioned in a mapping file.

HTML Named Entity Mappings

The character mapping file also allows you to specify the mappings that the standard TADS run-time uses for the named HTML character entities (the "&" sequences). The reason that these character mappings are specified separately from the code point translations is that the HTML entities do not necessarily map anywhere into your internal character set; HTML specifies the entity values using Unicode character values, which do not fit into the 8-bit character set that TADS uses.

You map the HTML named character entities directly to the native character set -- HTML entities completely bypass the internal character set. An HTML entity mapping uses a format that is different than the mappings between internal and native characters:

   &entity-name = native-char [native-char ...]

The "entity-name" is the HTML character name; for example, "Auml" is the HTML character name for a capital letter "A" with an umlaut. The "native-char" values are one or more native character codes, specified in the same manner as in the other mappings: decimal, hex, or octal numbers, or single characters, each enclosed in a pair of single quote marks.

Note that you can map an HTML character entity to one or more native characters. This is to allow you to provide approximations for characters that can't be mapped directly; for example, you could use the string "(c)" for a copyright symbol. For example, here are some HTML entity mappings from the DOS code page 437 mapping file:

   &copy = '(' 'c' ')'
   &trade = '(' 'T' 'M' ')'
   &Auml = 0216
   &auml = 0204

Note that the HTML entity mappings that you specify are not used by HTML TADS, but only by the standard TADS run-time. HTML TADS is capable of displaying multiple character sets at the same time, so it does not need to translate characters into a single active character set the way that the standard TADS must. Instead, HTML TADS displays each character using an appropriate character set, switching between character sets as needed.

If you use Unicode mapping files, the translation tool will automatically provide mappings for all of the HTML entities that correspond to the Unicode characters in the native mapping file. Since HTML entities are specified internally using their Unicode values, the Unicode native mapping file is sufficient to specify these mappings. However, you may still wish to supplement the Unicode mappings with additional mappings of your own, since you may want to provide approximations for characters that are not supported in the Unicode mapping.

System Issues: Code Pages and Fonts

This new TADS feature is entirely portable; TADS doesn't know anything about the actual character set encoding that your computer uses. So, it's up to you to choose the appropriate mapping based on the code page or font encoding that your system was using when you created your game's source file, or the one in use when you run a game.

Some systems allow you to change the character set encoding dynamically. For example, DOS lets you set the "code page," which is what DOS calls a character set encoding, using the CHCP command.

There may be cases where you need to set your system's character set and set the TADS character set mapping.

For example, if you have a DOS system that's configured with US language settings, DOS uses the US code page (437). This code page has many of the ISO Latin-1 characters, but uses a different encoding. If you want to play a game that uses ISO Latin-1 as the internal character set, you'll need to use a translation table that maps between ISO Latin-1 and DOS code page 437.

Now, suppose you want to play a game written with internal character set ISO Latin-2. DOS code page 437 lacks most of the characters in ISO Latin-2, so there really isn't a way to create a translation between ISO Latin-2 and DOS code page 437. However, DOS has another code page, 852, that contains the Latin-2 characters. So, to play this game, you'd first use the DOS CHCP command to switch to code page 852, and then use a translation table that maps between ISO Latin-2 and DOS code page 852. (Switching code pages on DOS may require some additional system configuration; please refer to the chapter on localization in your DOS manual for details.)

On Windows, HTML TADS is capable of selecting a code page dynamically at run-time, so HTML TADS uses a different method for choosing the character table file. Instead of trying to find a mapping file that matches the current character set, HTML TADS simply looks for a file for the internal character set, and then chooses a native code page based on the file's settings. In order for this to work properly, the mapping file itself specifies the Windows code page that HTML TADS should use by specifying the code page number in the EXTRA_SYSTEM_INFO string that's stored in the mapping file.

[As this feature evolves and authors start creating games using standardized internal character sets, we should start gaining experience about how to handle character set selection on different platforms; we'll expand this documentation in the future to cover additional platforms.]

Appendix I Table of Contents Index