Note: this is an archived copy of this message, which we're
keeping here to avoid broken links in case the tads3 list archive
moves its URL. If you want to see the original copy from the tads3
list archive, and view related messages from the thread it was posted
to, you can find the original here.
Re: [Tads3] the language-specific interface
- From: "Michel Nizette" <mnizette@xxxxxxxxx>
- Subject: Re: [Tads3] the language-specific interface
- Date: Sat, 4 Jun 2005 12:38:33 +0200
- To: <tads3@xxxxxxxxxxx>
Steve Breslin wrote: > But certainly Michel or Mike will be much better answering specifics than I; > Michel has done a lot of work on translation [...] Well, if thinking about the problem, requesting features, and then producing nothing but vaporware is what you call "work", then yes, I have done a lot of that. :-) But seriously, yes, I'd be happy to answer specifics to the best of my ability. I'm afraid there is no real substitute for studying the English module source code in detail, but things should probably be easier if you try to divide up the job into smaller tasks. Don't try to get a comprehensive picture of its workings all at once by reading en_us.t from top to bottom; that would be overwhelming. Instead, try to identify some relatively independent functional blocks, and ask yourself specific questions. For example, the parser can be divided into the following functional elements, which are rather well isolated from each other: 1. The tokenizer, which takes an input string and divides it into a sequence of tokens (words, numbers, punctuation...). I believe that a tokenizer for Esperanto should be quite similar to the one for English, if not simpler. For example, the English tokenizer needs special code to detect that in the sentence "look at Mary's hat", there is a token boundary between "Mary" and "'s", even though there is no whitespace between these two words. In contrast, I believe that, in Esperanto, words are always separated from each other with whitespace, so the English tokenizer's apostrophe-s rule has no equivalent in Esperanto. Can you find the code for the English tokenizer in en_us.t, and for each rule it defines, decide which one would be needed by the Esperanto tokenizer, and which one would be irrelevant to Esperanto and can be discarded? Once you have done that, then essentially you know how to make a functional Esperanto tokenizer. That's one (small) part of the job done. 2. The string comparator, which is used by the game dictionary, and is responsible for deciding whether a token produced by the tokenizer matches a given word in the dictionary. In English, that's rather easy: essentially, a token and a dictionary word match if they are identical (in case-insensitive way). Esperanto has accented characters, so this may be a little bit more complicated. Since the Esperanto accents aren't easily accessible on everybody's keyboard, you may decide to purely ignore accents in the comparison between the token and the dictionary word, but you may also decide that a properly accented word is a better match than an improperly accented one, and use this information later in ranking the various possible command interpretations. 3. Grammar rules: these are used by the command pattern matcher (invoked with parseTokens) to take a list of tokens as input, and produces a list of all the possible structural interpretations of the command. For example, given the token list ['take', 'paintbrush', 'and', 'paint', 'pot'], the command pattern matcher would identify two possible structures: 1/ a single "take" command with a direct object list containing two noun phrases, or 2/ a "take" command with a single direct object followed by a "paint" command with a single direct object (assuming the game defines a "paint" action). Grammar is a big part of the parser, but it can itself be divided into several main blocks: a. Compound command grammar. This is the set of rules that defines how several atomic commands (or "predicates") can be combined to form a complex command. The example above uses the particular rule that says that successive predicates can be separated by the word "and". b. Predicate grammar. The English module takes the straightforward approach of hard-coding each verb directly into the predicate grammar rules, via the VerbRule statements. For example, you have a VerbRule that says that the word "take" followed by a list of noun phrases is a valid predicate, another that says that the word "give" followed by a list of noun phrases followed by "to" followed by a single noun phrase is a valid predicate, and so on. This approach works for English because there are relatively few different possible orderings of the words in the sentence, so that each VerbRule remains relatively short. In Esperanto, though, I believe that the word ordering is much more flexible: for example, I think you can equivalently say: give / flower / to Mary give / to Mary / flower flower / give / to Mary flower / to Mary / give to Mary / give / flower to Mary / flower / give so that hard-coding each possible word ordering again and again for every single VerbRule could become tedious. Instead, you'd probably want to add one level of abstraction and define all the allowed word orderings once and for all, like this: verb / direct object / indirect object verb / indirect object / direct object ... and so on for each of the six possible orderings, and then have separate definitions that say: "give" is a verb, "show" is a verb, "throw" is a verb, and so on. Fortunately, the general parser framework allows this approach as well. There are a few complications when you start thinking about the details, but you get the basic idea. This is probably one area where language modules that use this approach are going to diverge significantly from the English module, though -- and no matter how comprehensive the TADS 3 manual is going to be, it can't possibly cover the needs of every language in detail, so I think it's unavoidable that each translator will have to figure out a number of things completely on their own at some point, but hopefully these dark areas will remain limited. c. Grammar for noun phrases (and similar things like topic phrases, literals, compass directions, ...) This set of rules also form a big part of the parser, but one you can also treat relatively independently of the others, I think. 4. Predicate-to-action resolution. This is the part that maps a given predicate structure to the corresponding action object. Again, in English, this step is pretty straightforward, since each VerbRule contains the verb hard-coded in the grammar, and makes the predicate production match object derive directly from the associated Action subclass. So, when the predicate producation match object is asked to resolve itself into an Action object, all it has to do is to return itself. If you take the more indirect approach of defining the predicate grammar in the abstract, without reference to a particular verb, then the predicate production object must analyze its own structure (i.e., find out which verb and preposition it contains) and then use this information to construct a valid Action object programmatically. The comments in the Library source code offer suggestions about how to do that, but there is no real working code to take inspiration from yet. 5. Noun phrase resolution and disambiguation. Once you have a valid Action object, you still need to determine which game objects this action applies to, so the Action object must ask the noun phrase productions to resolve themselves into game objects. 6. Command ranking. If you have all the possible Actions corresponding to the command input string and enough information about the game objects they apply to, then you can rank the different Actions using various comparison criteria (for example, prefer a logical Action over an illogical one; prefer a command interpretation where all words are properly accented, and so on...), and then pick out the best interpretation. The six steps listed above can probably be treated as relatively independent problems, which should simplify their analysis somewhat. A similar analysis and decomposition of the other parts of the English module (such as the part that synthetizes text) would be the best way to approach them, I think. I realize that I've just been babbling much about general things that you may probably have figured out by yourself, so I'm not sure I've been really helpful here; but if you have more specific questions, I can try to help. --Michel.