Note: this is an archived copy of this message, which we're keeping here to avoid broken links in case the tads3 list archive moves its URL. If you want to see the original copy from the tads3 list archive, and view related messages from the thread it was posted to, you can find the original here.
Re: [Tads3] the language-specific interface

From: "Michel Nizette" <mnizette@xxxxxxxxx>
Subject: Re: [Tads3] the language-specific interface
Date: Sat, 4 Jun 2005 12:38:33 +0200
To: <tads3@xxxxxxxxxxx>
Steve Breslin wrote:

> But certainly Michel or Mike will be much better answering specifics than
I;
> Michel has done a lot of work on translation [...]

Well, if thinking about the problem, requesting features, and then producing
nothing but vaporware is what you call "work", then yes, I have done a lot
of that.  :-)

But seriously, yes, I'd be happy to answer specifics to the best of my
ability.  I'm afraid there is no real substitute for studying the English
module source code in detail, but things should probably be easier if you
try to divide up the job into smaller tasks.  Don't try to get a
comprehensive picture of its workings all at once by reading en_us.t from
top to bottom; that would be overwhelming.  Instead, try to identify some
relatively independent functional blocks, and ask yourself specific
questions.

For example, the parser can be divided into the following functional
elements, which are rather well isolated from each other:

1.  The tokenizer, which takes an input string and divides it into a
sequence of tokens (words, numbers, punctuation...).

I believe that a tokenizer for Esperanto should be quite similar to the one
for English, if not simpler.  For example, the English tokenizer needs
special code to detect that in the sentence "look at Mary's hat", there is a
token boundary between "Mary" and "'s", even though there is no whitespace
between these two words.  In contrast, I believe that, in Esperanto, words
are always separated from each other with whitespace, so the English
tokenizer's apostrophe-s rule has no equivalent in Esperanto.  Can you find
the code for the English tokenizer in en_us.t, and for each rule it defines,
decide which one would be needed by the Esperanto tokenizer, and which one
would be irrelevant to Esperanto and can be discarded?  Once you have done
that, then essentially you know how to make a functional Esperanto
tokenizer.  That's one (small) part of the job done.

2.  The string comparator, which is used by the game dictionary, and is
responsible for deciding whether a token produced by the tokenizer matches a
given word in the dictionary.

In English, that's rather easy: essentially, a token and a dictionary word
match if they are identical (in case-insensitive way).  Esperanto has
accented characters, so this may be a little bit more complicated.  Since
the Esperanto accents aren't easily accessible on everybody's keyboard, you
may decide to purely ignore accents in the comparison between the token and
the dictionary word, but you may also decide that a properly accented word
is a better match than an improperly accented one, and use this information
later in ranking the various possible command interpretations.

3.  Grammar rules: these are used by the command pattern matcher (invoked
with parseTokens) to take a list of tokens as input, and produces a list of
all the possible structural interpretations of the command.

For example, given the token list ['take', 'paintbrush', 'and', 'paint',
'pot'], the command pattern matcher would identify two possible structures:
1/ a single "take" command with a direct object list containing two noun
phrases, or 2/ a "take" command with a single direct object followed by a
"paint" command with a single direct object (assuming the game defines a
"paint" action).

Grammar is a big part of the parser, but it can itself be divided into
several main blocks:

a.  Compound command grammar.

This is the set of rules that defines how several atomic commands (or
"predicates") can be combined to form a complex command.  The example above
uses the particular rule that says that successive predicates can be
separated by the word "and".

b.  Predicate grammar.

The English module takes the straightforward approach of hard-coding each
verb directly into the predicate grammar rules, via the VerbRule statements.
For example, you have a VerbRule that says that the word "take" followed by
a list of noun phrases is a valid predicate, another that says that the word
"give" followed by a list of noun phrases followed by "to" followed by a
single noun phrase is a valid predicate, and so on.  This approach works for
English because there are relatively few different possible orderings of the
words in the sentence, so that each VerbRule remains relatively short.

In Esperanto, though, I believe that the word ordering is much more
flexible: for example, I think you can equivalently say:

give / flower / to Mary
give / to Mary / flower
flower / give / to Mary
flower / to Mary / give
to Mary / give / flower
to Mary / flower / give

so that hard-coding each possible word ordering again and again for every
single VerbRule could become tedious.  Instead, you'd probably want to add
one level of abstraction and define all the allowed word orderings once and
for all, like this:

verb / direct object / indirect object
verb / indirect object / direct object
...

and so on for each of the six possible orderings, and then have separate
definitions that say: "give" is a verb, "show" is a verb, "throw" is a verb,
and so on.  Fortunately, the general parser framework allows this approach
as well.  There are a few complications when you start thinking about the
details, but you get the basic idea.

This is probably one area where language modules that use this approach are
going to diverge significantly from the English module, though -- and no
matter how comprehensive the TADS 3 manual is going to be, it can't possibly
cover the needs of every language in detail, so I think it's unavoidable
that each translator will have to figure out a number of things completely
on their own at some point, but hopefully these dark areas will remain
limited.

c.  Grammar for noun phrases (and similar things like topic phrases,
literals, compass directions, ...)

This set of rules also form a big part of the parser, but one you can also
treat relatively independently of the others, I think.

4.  Predicate-to-action resolution.  This is the part that maps a given
predicate structure to the corresponding action object.

Again, in English, this step is pretty straightforward, since each VerbRule
contains the verb hard-coded in the grammar, and makes the predicate
production match object derive directly from the associated Action subclass.
So, when the predicate producation match object is asked to resolve itself
into an Action object, all it has to do is to return itself.

If you take the more indirect approach of defining the predicate grammar in
the abstract, without reference to a particular verb, then the predicate
production object must analyze its own structure (i.e., find out which verb
and preposition it contains) and then use this information to construct a
valid Action object programmatically.  The comments in the Library source
code offer suggestions about how to do that, but there is no real working
code to take inspiration from yet.

5.  Noun phrase resolution and disambiguation.  Once you have a valid Action
object, you still need to determine which game objects this action applies
to, so the Action object must ask the noun phrase productions to resolve
themselves into game objects.

6.  Command ranking.  If you have all the possible Actions corresponding to
the command input string and enough information about the game objects they
apply to, then you can rank the different Actions using various comparison
criteria (for example, prefer a logical Action over an illogical one; prefer
a command interpretation where all words are properly accented, and so
on...), and then pick out the best interpretation.

The six steps listed above can probably be treated as relatively independent
problems, which should simplify their analysis somewhat.  A similar analysis
and decomposition of the other parts of the English module (such as the part
that synthetizes text) would be the best way to approach them, I think.

I realize that I've just been babbling much about general things that you
may probably have figured out by yourself, so I'm not sure I've been really
helpful here; but if you have more specific questions, I can try to help.

--Michel.