Tokenizing

"Tokenizing" is the process of scanning a string of characters, such as a line of text that the user types at a command prompt, and converting the character string into a list of words and punctuation marks.  Each item in this list is called a "token."  During parsing, we wish to deal with tokens, not directly with the original character string; it's much easier and faster to work with tokens.  To parse a string, we must find word boundaries, skip whitespace, and find matching delimiters (such as quotes and parentheses); we do all of this work in advance, when we tokenize the string, so that we don't have to do it repeatedly while analyzing the syntax of the command.

 

TADS 3 has no built-in tokenizer.  Instead, the standard library provides a class called "Tokenizer" that does this job.  An author can create a custom tokenizer, if desired, but in most cases this shouldn't be necessary, because the standard Tokenizer class allows for fairly extensive customization with a declarative set of "rules."

Calling the Tokenizer

To use the Tokenizer class, include the header file "tok.h" in your source code and link "tok.t" into your program.  To use the default rules defined in the class, simply use the class directly; to tokenize a string, make a call like this:

 

  local str, tokList;
 
  str = inputLine();
  tokList = Tokenizer.tokenize(str);

 

The tokenize() method scans the string and converts it into a list of tokens.  The return value is a list consisting of two sublists.  The first sublist is a list of token strings; the second sublist is a list of the corresponding token types.  The two sublists always have the same length, because an element of the first sublist corresponds to the item in the second sublist at the same index.

 

A "token type" is simply an enum token value.  The default Tokenizer rules produce tokens of type tokPunct (punctuation marks), tokWord (words), tokString (strings), and tokInt (integer numbers).

 

The following code displays the text of each token in a string:

 

  for (local i = 1, local cnt = tokList[1].length() ; i <= cnt ; ++i)
    "[<<i>>] = <<tokList[1][i]>>\n";
Customizing the Tokenizer

You can customize the rules the Tokenizer class uses.  To do this, subclass Tokenizer and override the rules_ property.  This property's value must be a list of lists.  Each sublist consists of four elements:  a regular expression, specifying a pattern to match; a token type, which is the enum token value to assign to tokens matching the regular expression; an integer giving flag values for the token type; and a conversion rule, specifying how the token text to be stored in the result list is obtained.

 

There is currently only one flag defined (in tok.h): TOKFLAG_SKIP.  If this flag is included in a rule's flags, it indicates that the rule does not add anything to the result list.  Instead, when the tokenizer matches text to this rule, it simply discards the matching text.  This is useful for skipping characters that are meaningful only for separating other tokens, such as whitespace characters.

 

The conversion rule can be nil, a string, or a property pointer.  If the conversion rule is nil, then the token text stored in the result list will simply be the exact text of the input string that matches the regular expression.  If the rule is a string, it specifies a replacement string, using the same rules as reReplace(), that is applied to the matching text; the result of the replacement is stored in the result list.  If the conversion rule is a property pointer, it specifies a property (of the Tokenizer object) to be evaluated to yield the value to be stored in the result list; this property is passed the matching text of the input string as its argument, and must return the string value to be stored in the result list.

 

The sublists in the rules list are specified in order of priority.  The tokenizer starts with the first rule; if its regular expression matches, the tokenizer uses the match and ignores all of the remaining rules.  If the first rule's regular expression does not match, the tokenizer tries the second rule, and so on until it runs out of rules.

 

Each time the tokenizer finds a matching rule, it adds the result of applying the conversion rule to the result list, along with the token type specified by the rule.  The tokenizer then removes the matching text from the input string.  If that leaves the input string empty, the tokenizer returns the result list to the caller.  If the input string is not yet empty, the tokenizer starts over, searching from the first rule to find a match to the remainder of the string.  The tokenizer repeats this process until the input string is empty.

 

If the tokenizer exhausts its list of rules, it throws a TokErrorNoMatch exception.  This exception object has a property, remainingStr_, which gives the text of the remainder of the string at the point at which the tokenizer could find no matching rule.

Customization Example

Suppose we wished to build a simple four-function calculator, which reads arithmetic expressions typed by the user and displays the results.  For this calculator, we'd need to recognize two types of tokens: operators, and numbers.  There's already a tokInt type defined by the Tokenizer class, but we'd have to define our own token type for operators:

 

enum token tokOp;

 

The default tokenizer rules won't work for the calculator because they don't accept all of the punctuation marks we'd need to use for operators (and besides, the default rules classify the punctuation marks they do recognize as type tokPunct, when we want tokOp tokens).


We'll need the following token rules:

 

Here's how our subclass would look to implement these rules:

 

CalcTokenizer: Tokenizer
  rules_ =
  [
    /* skip whitespace */
    ['[ \t]+', nil, TOKFLAG_SKIP, nil],
 
    /* integer numbers */
    ['[0-9]+', tokInt, 0, nil],
 
    /* operators */
    ['[()+*-/]', tokOp, 0, nil]
  ]
;

 

To tokenize using our customized rules, we'd simply call our subclasses tokenizer rather than the default tokenizer:

 

  tokList = CalcTokenizer.tokenize(str);