GrammarProd

The GrammarProd intrinsic class is a specialized pattern-matching class that's designed to be used for implementing parsers. GrammarProd stands for "grammar production", which we'll define in a moment.

GrammarProd provides a "parser", but it's not what we think of as "the parser" in an IF context. GrammarProd's parser is essentially a robotic sentence diagrammer, a la elementary school grammar lessons. Its function is to take some concrete input text from the user and match it up to an abstract grammar. Elementary school sentence diagramming is exactly the same thing: you take a sentence, and you identify how it divides into phrases and how the phrases combine to form the overall structure of the sentence. A GrammarProd parser is a useful tool for building an IF parser, but it's only a small part of the overall parsing process; the rest of the parser must take the sentence diagrams that GrammarProd produces and imbue them with meaning.

When we talk about a "grammar", we're using the word in the computerese sense of a formal description of a language's structural syntax. In concrete terms, a grammar in TADS is a collection of grammar statements. Each grammar statement defines one element of a grammar, and how that element is composed of finer elements defined in other grammar statements. A collection of inter-related grammar statements amounts to a grammar.

GrammarProd objects are run-time objects representing the parts of the grammar. Each GrammarProd object corresponds to a set of grammar statements with the same object name.

For the most part, you won't create or manipulate GrammarProd objects directly. You'll usually create them implicitly using the compiler's grammar statement. At run-time, the most common way to use a GrammarProd object is to carry out a sentence diagramming operation, which you do by calling the parseTokens() method on the root GrammarProd object of your grammar.

Starting in TADS 3.1, it's possible to create and modify the elements of a grammar dynamically at run-time. We'll see more on this later.

You should #include <gramprod.h> in your source files that use GrammarProd objects and methods.

"Production" defined

We use the term "production" a lot in this chapter (not surprising, since it's part of the name of the class we're talking about!). This is a technical term in computer parsing. A production is an element of a grammar that's composed of smaller parts. For example, if we were creating a grammar for English, we might include a "sentence" production that consists of a subject, a predicate, and one or more noun phrases. Each of those parts would in turn be a production, consisting of smaller word groups. We'd eventually get down to the level of individual words (which are usually called "terminal" elements of the grammar, since they don't decompose into any finer parts).

Productions are so named because computer parsers are frequently designed to build grammatical structures by starting at individual words and working up through sequentially larger structures. Each time a set of words is recognized as a functional group, the parser produces the larger structure from the constituent parts, thus the larger structure is called a production.

One of the most important things about productions is that a production almost always has more than one way of being built. This is really the entire point of defining productions, because it allows us to recognize the different forms a particular syntactic element can take. For example, in English, there are many different kinds of noun phrases: a simple noun, an adjective followed by a noun, a pronoun, a possessive pronoun followed by a noun, a possessive pronoun followed by an adjective followed by a noun, a possessive prounoun followed by an adjective followed by another adjective followed by a noun; we could go on all day. Despite all of these different syntactic forms a noun phrase can take, though, we can label them all with the generic term "noun phrase". We can then plug the general notion of a "noun phrase" into structures, such as verb phrases or whole sentences. The point is that the larger structures don't have to worry about all the different noun phrase formats - we don't have to define "verb with possessive noun phrase", "verb with adjective noun phrase", "verb with prepositional noun phrase", etc - we just define "verb with noun phrase", and we magically have all of those variations by virtue of defining "noun phrase" as the collection of all of them.

The grammar statement

A grammar rule is defined using the grammar keyword. A grammar statement is mostly like an ordinary object or class definition, but with a few added elements:

grammar prodName [ ( tag ) ]  : rules : superclass [ , superclass ... ] 
  propsAndMethods
;

The optional tag is a symbol or number token enclosed in parentheses. This isn't required - but if it's present, it provides a way to distinguish the rule from other rules associated with the same prodName at run-time, and to refer to the rule in modify and replace statements. This tag is also included in the string in the first element returned by the grammarInfo() method, to distinguish a particular matched rule from other rules for the same production.

If the tag is present, the combination of the production name, tag, and parentheses forms the full name of the object. This combination must be globally unique, just as any other object name must be.

The list of superclasses and the list of properties and methods (propsAndMethods) are defined in exactly as for ordinary objects. The reason there's a superclass list and a property/method list is that the grammar statement actually does define an ordinary class, in addition to defining a GrammarProd object. The ordinary class that's defined has no name, but is otherwise like any other class.

The prodName specifies the name of the GrammarProd object. A given production name can occur in any number of grammar statements; a grammar statement does not uniquely define a production object, but simply adds one or more alternative syntax rules to the production. (That's why the tag is so useful - it lets us identify the individual grammar statements making up a single GrammarProd object.)

The rules section is a set of one or more syntax rules to be associated with the production. Each alternative list looks like this:

itemList [ | itemList ... ]

The vertical bar "|" (the same symbol used for the bitwise-OR operator) separates multiple item lists. Using a bar is equivalent to writing a separate grammar statement for each item list. The "|" syntax is usually a lot more concise than writing each rule as a separate statement, since all of the properties and methods you define for this grammar statement apply to each rule in the item list.

Each item list specifies a syntax rule for the production. An item list looks like this:

  [ qualifiers ]  [ item [ item ... ]  ]  [ * ]

Each item in the list can be one of the following:

A literal string, enclosed in single quotes. This simply matches the the text of an input token. If a default Dictionary object is in effect at compile-time when the rule is defined (via a dictionary statement), the compiler automatically enters the text into the Dictionary, associating it with the production object and the miscVocab property. Literals are matched using the same comparison rule as the Dictionary object used at the time of parsing; this means that any case folding, truncation, accent elisions, and other special matching rules that the Dictionary uses are applied in the same manner to grammar literals.
A literal string, enclosed in double quotes. This is identical to a string in single quotes. (Double quotes don't have their usual special meaning here, because there's only one kind of string in this context. This is intended mostly as a convenience for addAlt(), where the whole rule text is given as a single-quoted string: using double quotes for the quoted tokens within the string avoids the hassle of backslash-escaping the quote marks.)
A token type name - that is, an enum token symbol. A token type item tells the parser to match any input token of the given type. For example, the standard library tokenizer defines tokInt as the integer token type, so you can match any integer in the input with a tokInt item.
A dictionary property - a property previously declared with the dictionary property statement. This matches an input token that appears in the dictionary under the given property.
A list of dictionary properties enclosed in angle brackets (< >), and separated by spaces. This matches an input token that appears in the Dictionary under any of the listed properties.
A production object (a prodName symbol from another grammar statement, or even the name of the current statement's production object). This matches if any of the alternatives defined for the sub-production match. Note that you must not use the "tagged" version of the name here: you must use only the prodName part. The reason is that a reference to a sub-production inherently incorporates all of the different rules associated with the sub-production.
A group of alternative sub-lists enclosed in parentheses, with the alternative sub-lists separated by vertical bars (|).

Each item type can optionally be followed by an arrow symbol, -> (a hyphen followed by a greater-than sign), then a property name. If this sequence is present, it indicates that, when the parser successfully matches the item, it will store the matching value in the given property of the object created to represent the production match. For a token type or dictionary property item, the value stored in the property is simply the token value of the input token that matches the item. For a sub-production item, the value stored in the property is the object created to represent the sub-production match.

If an asterisk (*) is present, it must be the last element of the item list. This symbol indicates that any input tokens that remain in the input after the tokens that match the syntax rule up to this point should simply be ignored. In a sense, the * is a "wildcard" symbol that matches everything remaining in the input token list; however, you shouldn't think of it this way, because that's not really how it works. The * doesn't actually match anything; instead, it simply indicates that any remaining tokens should be ignored. If an alternative of the root production does not end with a * symbol, either directly in the rule of the alternative in the root production or indirectly in the last subproduction, the parser will match the alternative only if the root alternative matches the entire input token list. If the root alternative does end (directly or indirectly) with the * symbol, however, the parser will match the alternative even if extra input tokens remain after matching the alternative's items. The * symbol, if present, must always be the last item in an alternative's list.

The optional qualifiers, if present, specify additional information about the alternative. Only one qualifier is currently valid:

[ badness integer ]

This qualifier assigns the alternative a "badness" rating, which can be used to create catch-all syntax patterns that you don't want to use except as a last resort. The integer value gives the degree of badness; this value is meaningful only relative to other "badness" values assigned to other productions. When GrammarProd is considering more than one rule with badness for a possible match, it picks the one with the lowest badness first.

Assigning a badness rating tells the parser that the alternative should be ignored until all other alternatives are exhausted. This is especially useful for handling syntax errors in the user input, because it allows you to create alternatives that match anything in particular parts of the input, which helps pinpoint where the problem is, which in turn lets you give the user better feedback about the problem.

Using modify and replace with grammar rules

A grammar rule object can be replaced or modified by another grammar rule, just as a normal object can be replaced or modified, using the replace and modify keywords. For example, you might want to modify or replace a grammar rule when you're using a library, since the library might define general-purpose rules that don't exactly fit your needs.

To use modify with a grammar rule, use this syntax:

modify grammar prodName ( tag ) : rules :
  propsAndMethods
;

This is almost the same as the normal grammar syntax, but note that no class list follows the colon after the rule list. No superclasses are specified with modify, because a modified object always has the same superclass or superclasses as the original object being modified. Note also that the tag is required, because this provides the unique name for the match object that distinguishes it from other match objects defined for the same production name.

Note that the rules list is optional: if you leave it out (so you just put two colons in a row after the name), then the compiler retains the original rule list for the object being modified. This lets you override just the properties or methods of the grammar rule object, without changing any of the grammar it matches.

To use replace with a grammar rule, use this syntax:

replace grammar prodName ( tag ) : rules : superclass [ , superclass ... ] 
  propsAndMethods
;

This is exactly the same as a normal grammar definition, except that the replace keyword precedes the definition.

If you use replace or modify with a grammar rule, the original grammar rule is completely replaced by the new grammar rule. In this respect, replace and modify behave exactly the same way. The differences between the two are their treatment of the original object's property list (in the case of replace, the original list is completely lost; in the case of modify, the original properties are inherited by the modified object) and of the original's superclass (replace specifies a brand new superclass, and modify uses the original object's superclass).

If you want to delete an existing grammar rule entirely, you can use the replace syntax, and specify an unmatchable rule list. A rule list is unmatchable if it contains a token string that the tokenizer will never produce; for example, in most cases, tokenizers do not return spaces as tokens, so you can use the string ' ' as an unmatchable alternative:

replace grammar nounPhrase(1): ' ': object;

grammar defines two objects

The grammar statement is unusual in that it defines not one, but two separate objects.

First, a grammar statement defines - or adds to - a GrammarProd object. This object's name is given by the prodName in the grammar statement.

A grammar statement can "add to" a GrammarProd object. It's legal to write multiple grammar statements that define the same prodName, as long as they have distinct tag names. The compiler automatically gathers together all of these different definitions and rolls them into a single GrammarProd object in the final program.

Second, the statement defines a separate class called the match object class. This class doesn't have a symbol name that you can refer to in your program, so you can't, for example, define a subclass of it. However, for the purposes of modify and replace, its name is prodName(tag), and at run-time the same name is used in string form to identify the rule.

At run-time, you can call methods on the GrammarProd object. For example, the parseTokens() method is used to match an input string to the GrammarProd's rules. The parseTokens() method in turn returns a description of the match, if it can find one, that uses instances of the match object class to represent the precise grammar rules that the input was found to match.

Changing the grammar at run-time

Starting in TADS 3.1, you can add new grammar rules and change existing rules at run-time.

You can create an entirely new production object with the new operator:

local prod = new GrammarProd();

This creates an unnamed production with no rules. You can add rules to the new production with the addAlt() method. For example:

prod.addAlt('noun->n1', new NounPhraseProd(), cmdDict, symtab);

Once you have a new production set up, you can use it as the root of the grammar for parsing purposes, simply by calling parseTokens() on the production:

prod.parseTokens(toks, cmdDict);

In many cases, after you create a new production, you'll want to refer to it as a sub-production within other rules. Newly created rules are always defined symbolically, so the question is, how do we refer to an unnamed new object symbolically? The trick is that you can manually add the new object to your copy of the symbol table. A symbol table is just a LookupTable, after all, so you're free to invent new symbol names.

symtab['myNewProd'] = prod;
nounPhrase.addAlt('myNewProd->p1', new NounPhraseProd(), cmdDict, symtab);

addAlt() isn't limited to working with newly created productions. As you can see above, you can also use it to add to the grammar of productions defined statically in the program's source code, with grammar statements. What's more, the deleteAlt() and clearAlts() methods let you remove rules from an existing production. You can combine these methods to rewrite any part of the grammar on the fly.

Match objects for new rules

When you define rules statically with grammar, remember that the compiler automatically creates a "match object class" for you. This object is unnamed, but it's not completely invisible: whenever parseTokens() matches input to that grammar rule, it creates an instance of the match object, and returns that instance in the tree of objects representing the parsing match.

When you use addAlt() to create a new rule dynamically, there's no grammar statement involved, and the system doesn't automatically create a match object for you. But a match object is still needed - without a match object, there'd be no way to represent a match to the new rule in parseTokens() results. So where does the match object come from? It's up to you to create one, and pass it to addAlt() as a parameter.

You can in principle use any object for the match object in addAlt(), but there are two special guidelines you should keep in mind:

First, you should create a new object for each addAlt() call. addAlt() automatically adds some information to the match object to describe the alternatives it's associated with. It's important to keep this information separate for each addAlt() call. In particular, addAlt() adds a grammarAltProps property, containing a list of the "->" properties used in the rules being added.
Second, each dynamic match object should inherit from DynamicProd, which is a class defined in the library file gramprod.t. (You should include this file in your project.) DynamicProd is a simple mix-in class, so you can inherit from other classes as well. DynamicProd is important because it defines a grammarInfo() method for dynamic match objects parallel to the one that the compiler automatically builds for statically defined match objects.

GrammarProd methods

addAlt(alt, matchObj, dict?, symtab?)

Add an alternative or set of alternatives to the production.

alt is a string containing the alternative(s) to add. This uses the same syntax as "rules" list in a grammar statement. You can define multiple alternatives as usual using "|" symbols within the string.

matchObj is the match object class for the new rule(s). In a static grammar statement, the compiler creates the match object class automatically, as an unnamed object with the superclasses listed in the grammar statement. When you add rules with addAlt(), you must explicitly supply the match object. See above for guidelines on how to create this object.

The method adds the property grammarAltProps to matchObj. This property is set to a list of all "->" properties used within the alternatives defined for the match object. The method doesn't add grammarTag or grammarInfo properties, but if matchObj inherits from DynamicProd as recommended, it will inherit a grammarInfo method that works the same as in a statically defined match object. You can explicitly add a grammarTag property if you wish (it can be useful for debugging, for example), but it's not required.

dict is an optional Dictionary object, giving the dictionary associated with this grammar. If this is provided, and the new alternatives contain literal tokens, addAlt() automatically adds those literals to the dictionary. This keeps the dictionary in sync with the vocabulary used in dynamically added rules.

symtab is an optional lookup table containing the compiler's global symbols. If you use any symbol names in the alternative (such as property names or other GrammarProd object names as sub-productions), the symbol table is required for resolving those symbols. In most cases, you should simply use the same symbol table that t3GetGlobalSymbols() returns during preinit, since that reflects the global symbols defined in the program's source code.

clearAlts(dict?)

Delete all existing alternatives (token rules) in the production. This is equivalent to calling deleteAlt() for each alternative.

dict is an optional Dictionary object to be updated for the deletion. If this is provided and not nil, the dictionary will be updated to remove literal tokens associated with the production that are being deleted by this method. This keeps the dictionary in sync with the changes to the grammar.

deleteAlt(id, dict?)

Delete one or more alternatives (token rules) from the production. id specifies which alternative(s) to delete:

By tag: if id is a string, the method deletes each alternative whose match object's grammarTag property equals id. The compiler automatically sets the grammarTag property for each match object defined in a grammar statement to the statement's tag, so this makes it easy to delete all of the rules defined in a single grammar statement.
By match object class: if id is an object, the method deletes each alternative whose match object is either equal to id or is a subclass of id.
By index: if id is an integer, it gives the index of the alternative to delete. This corresponds to an index in the list returned by getGrammarInfo(); the first alternative's index is 1. This deletes one alternative.

dict is an optional Dictionary object to be updated for the deletion. If this is provided and not nil, the dictionary will be updated to remove any literal tokens associated with the production that are being deleted by this method. This keeps the dictionary in sync with the changes to the grammar.

getGrammarInfo()

The GrammarProd class provides access to the internal definition of a grammar production object via the getGrammarInfo() method. This gives you complete information on the grammar statements in the program. You could in principle use this information to write a replacement for parseTokens(), since the method provides access to all of the information available to the intrinsic class.

You'll notice as you read through the descriptions below of the data structures that the data structures map directly to the parts of the grammar statement. This is no accident; getGrammarInfo() in essence just returns a run-time representation of the same information that you define in source code using the grammar statement.

This method returns a list, containing zero or more objects of class GrammarAltInfo (this name isn't actually anything special to the intrinsic class; it's rather an "imported" class that the basic run-time library defines in the source file gramprod.t). Each GrammarAltInfo object in the list defines one alternative in the production's alternative list.

(Recall that an alternative consists of a qualifier (such as a [badness] value) and a series of tokens, and that multiple alternatives can be associated with one production, either by separating them with | symbols or by writing multiple grammar statements associated with the same named production. Each GrammarAltInfo object represents one complete group of tokens-a run of tokens between | symbols. Note that the compiler automatically "flattens" any parenthesized token groups to construct equivalent completely un-parenthesized token lists. For example, if you write a rule that says 'a' ('b' | 'c'), the compiler automatically converts this to the equivalent 'a' 'b' | 'a' 'c', leaving no parenthesized groups. This is why you won't find any representation of parenthesized groups in what getGrammarInfo() returns-there simply isn't any such thing after the compiler has finished processing the source code. This greatly simplifies the information representation at run-time.)

Here's what a GrammarAltInfo object looks like:

class GrammarAltInfo: object
  gramBadness = 0
  gramMatchObj = nil
  gramTokens = []
;

The gramBadness property gives the "badness" value for the alternative; this is the value specified in the [badness] qualifier in the grammar statement that defined the alternative. If no [badness] qualifier is present, this value will be zero.

The gramMatchObj property gives the "match object" for the alternative. This is the class that parseTokens() will instantiate to represent the match when the input token list is found to match the alternative - that is, this is the object that's defined directly by the grammar statement itself. (Recall that each grammar statement actually defines two objects: the GrammarProd object and the match class. The GrammarProd object is only indirectly defined, in that multiple grammar statements can add alternatives to the same GrammarProd object, hence any one grammar statement only partially defines the associated GrammarProd object. The match object is a TadsObject class that's uniquely defined by the grammar statement.)

The gramTokens property gives a list of the token slots making up the alternative. This is a list of GrammarAltTokInfo objects (as with GrammarAltInfo, this class name isn't anything special to the intrinsic class; it's simply imported from the library, which defines the class in gramprod.t). The list is in the same order as the tokens appear in the grammar statement, which is the order in which they're matched to an input token list.

Each GrammarAltTokInfo object describes one token slot. These correspond to the "items" in the "item lists" making up an alternative in a grammar statement. Here's what a GrammarAltTokInfo object looks like:

class GrammarAltTokInfo: object
  gramTargetProp = nil
  gramTokenType = nil
  gramTokenInfo = nil
;

The gramTargetProp property gives the property ID of the "target" property for the token slot. This is simply the property that appears following a -> symbol in a grammar statement. When GrammarProd.parseTokens() finds a match to a production, it constructs a match object to represent the match and then sets the property indicated here in the match object to the actual matched value for the token slot.

The gramTokenType property gives the type of value this token slot matches, and gramTokenInfo gives extra information that depends on the type. This type will be one of the following (these values are defined in the system header file gramprod.h):

GramTokTypeProd: the token slot is a reference to a sub-production (that is, another GrammarProd object), and the slot matches the corresponding part of the input token list if and only if the sub-production matches the input tokens. For this type of token slot, gramTokenInfo contains a reference to the sub-production (a GrammarProd object).
GramTokTypeSpeech: the token slot matches a specific "part of speech." This means that the token slot matches a single input token, and matches if and only if the input token has the same part of speech as the token slot. A part of speech is simply a "dictionary property" value, and a token has a given part of speech if it appears in the Dictionary passed to parseTokens() under the same dictionary property value. The gramTokenInfo in this case will be the property ID of the dictionary property (e.g., &noun).
GramTokTypeNSpeech: the token slot matches any of several parts of speech. This is just like GramTokTypeSpeech, but the slot will match any one of a list of property values. The gramTokenInfo value in this case will be a list of property IDs.
GramTokTypeLiteral: the token slot matches a literal string. In this case, an input token will match the token slot only if the input token compares equal to the literal, using the Dictionary's string comparator. In this case, the gramTokenInfo value is a string giving the literal text to be matched.
GramTokTypeTokEnum: the token slot matches a token type. This means that an input token will match the token slot only if the input token has the given token type. A token type is simply an "enum" value defined with the "enum token" syntax; each token in an input list has a token type, assigned by the tokenizer and stored in the standard token list format. In this case, the gramTokenInfo value will be the "enum" value giving the token type that the slot matches.
GramTokTypeStar: the token slot matches all remaining input tokens. This is used to create an alternative that allows a match with more tokens left after the end of the alternative's explicitly listed tokens. There's no extra information for this token slot type, so gramTokenInfo is simply nil in this case.

You might be wondering what the difference is between GrammarProd.getGrammarInfo() and the match object's grammarInfo() method. The difference is that GrammarProd.getGrammarInfo() gives you information on the grammar itself, whereas the match object's grammarInfo() gives you information on the matched syntax - that is, it tells you how a particular input token list matched a particular grammar using parseTokens().

So, the match object's grammarInfo() gives you information on the syntax of a particular input token list. In contrast, GrammarProd.getGrammarInfo() tells you about the grammar itself, independent of any input token list; it returns a direct representation of the information defined in grammar statements that appeared in the source code.

parseTokens(tokenList, dict)

This method matches input, in the form of a list of tokens, to the GrammarProd object's rule list.

tokenList is a list of input tokens to match to the rule. (This can be a list, a vector, or a list-like object.) Each entry in the list is a token specifier, which is a sublist consisting of at least two elements: the first element is the value of the token, and the second element is the type of the token. Each token's sublist can include more elements if desired; the parser will preserve the additional elements, but doesn't currently use any beyond the first two. This format is compatible with the format produced by the system library's Tokenizer class, so you can simply feed a token list produced by a Tokenizer object directly into the parseTokens() method.

dict is a Dictionary object, or nil if no dictionary is to be used. If this argument is not nil, it must be an object of intrinsic class Dictionary. If any of the syntax rules that the parser encounters include dictionary property items, the parser will look up the corresponding token in the given dictionary to determine the properties under which the word is defined. A dictionary is not required if no dictionary properties are used in the syntax rules to be examined. The dictionary also specifies how literals in grammar rules are matched to tokens: literals are matched using the Dictionary's "comparator" object, so that literals are matched using the same rules that the Dictionary uses to match tokens against dictionary words.

You always call this method on the "root" production of the grammar you wish to match. There's nothing special about a root production object - you can use any production here. For example, if you want to parse just a noun phrase, you can call parseTokens() on your noun phrase production object, even if the noun phrase production is used as a sub-production in other rules.

This method returns a list of matches. If the return list is empty, it indicates that there were no matches at all. Otherwise, each entry in the list is the top object of a "match tree."

A match tree is a tree of objects that the parser dynamically creates to represent the syntax structure of the match. Each object is an instance of one of the match object classes defined in a grammar statement. Each of these objects has the properties that appear after arrow symbols (->) in the grammar item list set to the actual values from the input token list.

Finding the original tokens

As parseTokens() builds the match tree, it sets properties of each match object to indicate the indices of the first and last tokens involved in the match; these bounds are inclusive. The properties the parser sets are called firstTokenIndex and lastTokenIndex.

In addition, parseTokens() automatically sets the tokenList property of each match tree object to a reference to the original token list passed into parseTokens(). So, for a given match tree object match, the tokens matching the production can be obtained as follows:

toks = match.tokenList.sublist(
   match.firstTokenIndex, 
   match.lastTokenIndex - match.firstTokenIndex + 1);

In addition to the token list, parseTokens() stores a list of "match results" in each object in the match tree, in the property tokenMatchList. This list gives the result of the matchValues() method in the Dictionary's comparator object for each token that matched a literal in a grammar rule. Each element of this list gives the match result for the corresponding element of the token list, so tokenMatchList[3] gives the matchValues() result for the third token. If a token matches a dictionary property rather than a literal, its tokenMatchList entry will be nil, since matchValues() is not used to match such tokens.

The tokenMatchList information can be used to find out how well a particular token matched a grammar literal. For example, this can be used to determine if the token matched with truncation, or with accent substitution using an equivalence mapping (see the StringComparator class for more details on these types of matches).

(Note: to be precise, the parser uses the properties exported under the global names "GrammarProd.firstTokenIndex" and "GrammarProd.lastTokenIndex" for the indices, "GrammarProd.tokenList" for the token list, and "GrammarProd.tokenMatchList" for the token match result list. Since the GrammarProd header file, <gramprod.h>, defines these exports, most users can ignore this detail.)

grammarTag

Each match object class defines the property grammarTag as a string value containing the "tag" for the grammar rule that defined it. This property definition is automatically supplied by the compiler for each match object defined by a grammar statement.

Note that when you add a grammar rule dynamically with the addAlt() method, this property isn't automatically added to the match object class you supply. You can manually define the property on the match object if you wish.

grammarInfo

The compiler automatically generates a method called grammarInfo() for each match object class. This method provides information that allows the program to traverse a match tree without knowing anything about the structure of the tree, which is useful for debugging as well as for searching a tree for particular types of matches.

Note that when you add a grammar rule dynamically with the addAlt() method, the grammarInfo() method isn't automatically added to the match object class you supply. You must manually define this method if you want it to be available for a dynamically added rule.

The grammarInfo() method returns a list. The first element in the list is the name of the match, which is simply the name of the production, plus the tag if one was specified. Each subsequent element is the value of one of the properties used after an arrow (->) in the production's rule or rules; the properties appear in the list in the same order they are specified in the rule. If the rule contains multiple alternatives, each property appears only once in the list.

For example, suppose we define a rule like this:

grammar nounPhrase(1): adjective->adj_ noun->noun_ : object;

Now, suppose this rule matched the input "magic book." The grammarInfo() method for the match object would look like this:

['nounPhrase(1)', 'magic', 'book']

The first element is the production name with its tag. The second element is the value of the adj_ property, which in this case is the literal token matched; likewise, the third element is the value of the noun_ property, which is another literal token.

The grammarInfo() method can be used to write generic routines that traverse arbitrary match trees. For example, we could write a simple routine to display, for debugging purposes, the contents of a match tree:

showGrammar(match, indent)
{
  local info;

  /* indent by the desired amount (two spaces per level) */
  for (local i = 0 ; i < indent ; ++i)
    "\ \ ";

  /* if it's not a sub-production, treat it specially */
  if (match == nil)
  {
    /* this tree element isn't used - skip it */
    return;

  }
  else if (dataType(match) == TypeSString)
  {
    /* it's a literal token match - show it */
    "'<<match>>'\n";
    return;
  }

  /* get the grammar info for the object */
  info = match.grammarInfo();

  /* show the production rule name, and the original text it matched */
  "<<info[1]>> [<<showGrammarText(match)>>]\n";

  /* show each sub-match */
  for (local i = 2 ; i <= info.length ; ++i)
    showGrammar(info[i], indent + 1);
}

/* show the text of a match tree item */
showGrammarText(match)
{
  for (local i = match.firstTokenIndex ; i <= match.lastTokenIndex ; ++i)
  {
    /* show a space before each entry except the first */
    if (i != match.firstTokenIndex)
      " ";

    /* show this token's text */
    "<<match.tokenList[i]>>";
  }
}

Sample Grammar Rules

Let's define some grammar rules for a very simple noun phrase parser. For the purposes of this example, we'll keep things very simple: our noun phrases will consist of a noun, or an adjective and a noun, or an adjective and an adjective and a noun, or so on.

The simplest way we could define these rules is by enumerating all of the different phrasings we could use. So, we could write something like this:

grammar nounPhrase: noun->noun_: object;
grammar nounPhrase: adjective->adj_ noun->noun_: object;
grammar nounPhrase: adjective->adj1_ adjective->adj2_ noun->noun_:
  object;
grammar nounPhrase: adjective->adj1_ adjective->adj2_
  adjective->adj3_ noun->noun_: object;

And so on, up to some fixed limit to the number of adjectives we'll allow.

Now, this will work, but it's not very flexible. If we stop at six adjectives, users might complain that they can't use eight. We could add phrasings for seven and eight adjectives, but then users might complain about ten. Our work would never end.

Fortunately, there is a more general approach. The recursion-minded reader might have by now observed that we could in principle express a rule for an unlimited number of adjectives by saying that a noun phrase is a noun, or an adjective followed by a noun phrase. This neatly subsumes any number of adjectives: with one adjective, we have a noun phrase consisting of an adjective followed by a noun phrase consisting of a noun; with two adjectives, we have an adjective followed by a noun phrase which consists of an adjective followed by a noun phrase which consists of a noun; and so on.

Not only can we express our noun phrase syntax this way in principle, we can express it this way in fact. This is precisely the kind of thing at which our production scheme excels. Here's the simpler and much more flexible noun phrase grammar:

grammar nounPhrase: noun->noun_: object;
grammar nounPhrase: adjective->adj_ nounPhrase->np_: object;

That's it - this completely solves our problem, no matter how many adjectives our verbose user types. Suppose the user types "little red wagon": we first match the second alternative (with adjective "little"), leaving "red wagon" still to be parsed; we then match the second alternative again (with adjective "red"), leaving just "wagon" remaining; and finally we match the first alternative (with noun "wagon"). Here's how we actually call the parser to parse this:

match = nounPhrase.parseTokens(tokList, gDict);

(Where do we get the tokList value to pass into the method? We can use the standard run-time library class Tokenizer, or a subclass, and call its tokenize() method to produce a token list from a simple string we want to parse.)

So, what good is all of this? Yes, we've built a set of rules that define a syntax structure, and we know how to call the parser to scan the rules. However, how do we actually use this for anything? What do we know, apart from whether or not an input token list matches the syntax? It turns out we know a great deal.

When the parser matches our syntax lists, it creates "match trees" that give us the details of exactly how the input tokens are structured, according to our grammar. The match tree consists of objects that the parser creates dynamically; these objects are instances of our grammar objects-not of the production objects, but of the unnamed "processor" objects that go with the grammar statement.

Let us, for the moment, assign some arbitrary labels to our processor objects. These aren't really class names - we're just using these to keep track of what's going on:

grammar nounPhrase: noun->noun_: object; // pseudo-class "A"
grammar nounPhrase: adjective->adj_ nounPhrase->np_: 
  object; // pseudo-class "B"

When we call the parseTokens() method, the parser builds up match trees consisting of instances of "A" and "B". For our "little red wagon" example, here's what the match tree looks like, assigning arbitrary names to the objects (which wouldn't really happen, since they're dynamically created):

obj1: B  adj_ = 'little'  np_ = obj2 ;
obj2: B  adj_ = 'red'     np_ = obj3 ;
obj3: A  noun_ = 'wagon' ;

The "root" of the match tree is obj1. This example might make it more apparent why we call this a "tree": obj1 is the root, with an adjective on one branch and obj2 on the other; obj2 in turn branches to an adjective and obj3; and obj3 has a noun.

The next step in making these match trees useful is to give them some methods.

The fact that a nounPhrase can be used in syntax anywhere a noun phrase is required means that all of the different alternative forms of noun phrases should provide a common interface. Clearly, they don't right now: each alternative has its own set of properties. However, these properties are not meant as a public interface; they're intended only for the use of the specific alternative processor object. What we must now do is define a public interface, which provides a common set of functionality for every nounPhrase alternative, and then specify an alternative-specific implementation of the public interface; the implementation effectively provides the bridge from the alternative-specific properties to the common information that any noun phrase must provide. For now, though, let's define a very simple public interface that simply provides a debugging display of the object. To do this, we'll define a method, debugPrint(), that any nounPhrase processor object must provide. Here's how we could implement this:

grammar nounPhrase: noun->noun_: object
  debugPrint() { "noun phrase: noun = <<noun_>>\n"; }
;
grammar nounPhrase: adjective->adj_ nounPhrase->np_: object
  debugPrint()
  {
    "noun phrase: adjective = <<adj_>>, nounPhrase = {\n";
    np_.debugPrint();
    "}\n";

  }
;

Note how the public interface method is implemented in every noun phrase processor object, but the actual implementation varies according to the underlying information in the processor object. We actually take advantage of this in the recursive call to np_.debugPrint() in the second alternative: we know for a fact that np_ refers to a noun phrase processor object, because our syntax says so, so we can call its debugPrint() method without having to know what kind of nounPhrase it is. This is important in our "little red wagon" example, because we'll actually have both kinds of nounPhrase alternatives present.

Here's some code that will parse a noun phrase and then show the debugging display for the result:

local str;
local toks;
local match;

/* ask for a string of input */
">";
str = inputLine();

/* tokenize the string with the standard tokenizer */
toks = Tokenizer.tokenize(str);

/* parse the tokens */
match = nounPhrase.parseTokens(toks, gDict);

/* display the match tree */
for (local i = 1, local cnt = match.length() ; i <= cnt ; ++i)
{
  "Match #<<i>>:\n";
  match[i].debugPrint();
}

For our "little red wagon" example, we'll see something like this:

Match #1:
noun phrase: adjective = little, nounPhrase = {
noun phrase: adjective = red, nounPhrase = {
noun phrase: noun = wagon
}
}

Exported class construction

The getGrammarInfo() method depends upon the program to export the GramAltInfo and GramAltTokInfo classes. In most cases, you won't have to write these classes yourself, because you can simply use the implementations provided in the basic run-time library, in the source file gramprod.t. For reference purposes, we'll describe what GrammarProd requires of these classes.

First, the classes must be exported under the names 'GrammarProd.GrammarAltInfo' and 'GrammarProd.GrammarAltTokInfo', respectively. Second, GrammarProd requires the classes to define constructors that accept particular argument lists, because getGrammarInfo() uses the constructors to pass the grammar description information to the classes. The parameter lists for the constructors are as follows:

GrammarAltInfo.construct(score, badness, matchObj, tokList)
GrammarAltTokInfo.construct(prop, typ, info)

The arguments are:

score: an integer value; this isn't currently used and can be discarded
badness: an integer giving the [badness] attribute for the alternative
matchObj: an object reference giving the match object class for the alternative
tokList: a list of GrammarAltTokInfo objects
prop: a property ID giving the target property of the token slot
typ: an integer giving the token slot type (one of the GramTokTypeXxx values)
info: extra information that depends on the token slot type; this is the same information described earlier for the GrammarAltTokInfo.gramTokenInfo property

To ensure compatibility with any future extensions to the GrammarProd class, the implementations in the run-time library use variable argument lists in their constructors-they use the argument lists shown above, but add a "..." symbol at the end so that they'll accept arbitrary additional arguments. This ensures that code compiled with the current version of the library will continue to run under any future VMs that include updated GrammarProd classes that pass additional arguments to the GrammarAltInfo and/or GrammarAltTokInfo constructors.

In the run-time library implementation of the classes, the constructors simply store the argument values in the corresponding instance properties, so that program code can refer to it as needed.