TADS 3 provides several string functions which use "regular expressions." A regular expression is a similar to a "wildcard" search string, but regular expressions are much more powerful than simple wildcards.
A regular expression is specified with a pattern string. The simplest kind of regular expression is simply a string of literal text. For example, this is a valid regular expression:
abc
This simply matches the string "abc", because the pattern consists entirely of "ordinary characters," and each ordinary character of the regular expression is matched literally to a character of the string to be searched.
An "ordinary character" is any character that doesn't have some other meaning in the regular expression language. All of the alphabetic characters (including accented characters), all of the digits, and space characters of all kinds are ordinary characters. The following punctuation marks have special meanings:
% < > + . * ? [ ^ $ | ( )
Everything else is an ordinary character.
You can use most of these special characters as though they were ordinary characters by putting a percent sign ("%") in front of them. So, to search for the letters "abc" enclosed in parentheses, we could write this:
%(abc%)
However, there is one pair of exceptions: the sequences "%<" and "%>" have special meanings of their own, so you can't use "%<" to match a less-than sign, and you can't use "%>" to match a greater-than sign. To match these characters, you must use a range expression:
[<]abc[>]
This matches the letters "abc" enclosed in angle brackets.
The meanings of all of these special characters are explained in the sections that follow.
Even the simple string above uses one of the construction principles that lets you build complex search patterns. The string above consists of three ordinary characters that are concatenated together to form a longer string. When you concatenate a regular expression element to a regular expression, you get a new regular expression that matches what the first one matches, plus what the new element matches. This is pretty obvious for simple cases like the one above, because if we add a new element – say the letter "d" – we get a new regular expression which matches a longer literal string:
abcd
Another construction principle that lets you combine expressions is alternation. With alternation, you specify that the pattern matches one regular expression or another regular expression. You specify alternation with the character "|" (the vertical bar).
We know that the expression "abc" matches the literal string "abc", and the expression "def" matches the literal string "def". So, we could combine these with alternation to make a new regular expression that matches either "abc" or "def":
abc|def
If you've ever used an operating system like DOS or Unix, you're probably familiar with "wildcard" characters for file directory listings. A wildcard is a character that matches any other character.
Regular expressions have a wildcard character, too, but it's not what you might expect if you're thinking about filename wildcards from DOS or Unix. The regular expression wildcard character is the period ("."). This simply matches any single character. So, if we wanted to match the word "the" followed by a space followed by any three characters, we'd write this:
the ...
Regular expressions don't stop at simple wildcards, though: they let you get much more specific. First, you can use "ranges," which let you match one of a selected group of specific characters. For example, if you want to match any single character that is a vowel, you could write a range like this:
[aeiouAEIOU]
Note that, by default, regular expressions are case-sensitive, which is why we wrote the vowels in both upper- and lower-case. You can, however, control the case-sensitivity of a search, so you don't always have to write your expressions this way.
You can use a range expression in an expression wherever an ordinary character
can go. So, to write a pattern that
matches "button", followed by a space, followed by a digit from 0 to
9, you could write this:
button [0123456789]
Ranges can also specify that you want to exclude characters. An "exclusive" range works just the opposite of a regular range: it matches anything that's not listed in the range. You specify an exclusive range by putting a caret ("^") as the first character inside the brackets of the range. So, to match any single character that isn't a vowel, you'd write this:
[^aeiouAEIOU]
Note that exclusive ranges match anything that's not in the range, so the range above will match anything that isn't a vowel, including digits, spaces, and punctuation characters.
You can also use a range to specify contiguous portions of the Unicode character set simply by giving the endpoints of the portion. Do this by listing the ends of the range, separated by a hyphen ("-"). For example, to match any letter in the Roman alphabet, not including any accented characters, you'd write this:
[a-zA-Z]
This matches any character whose Unicode character code value is between "a" and "z" inclusive, or between "A" and "Z" inclusive. (The Unicode character set includes the ASCII character set as a subset, assigning the same character code values as ASCII does to the ASCII characters, so if you're familiar with Unix-style regular expression ranges, you will find Unicode ranges end up working exactly the same way.)
You can use exclusion with subset ranges as well:
[^a-zA-Z]
This matches any single character that is not in the Roman alphabet.
If you want to include the character "^" in a range expression, you can do so, as long as it's not the first character – if the "^" appears as the first character, it's taken to indicate an exclusive range. So, to specify a match for either an ampersand or a caret, you'd have to write the range expression like this:
[&^]
Similarly, note that, if you want to include a hyphen character in a range expression, it must be the first character in the range list. If a hyphen appears anywhere else, it's taken as a subset specifier. So, to write a range that matches a pound sign or a hyphen, you'd have to write this:
[-#]
In addition, if you want to include a right square bracket in a search string, it must be the first character in an inclusive range, or the first character after "^" in an exclusive range.
Combining all of the rules above, if we wanted to write an inclusive search for all of the special range characters – hyphen, caret, and right square bracket – we'd have to write this:
[]-^]
And to write a search that excludes all of these characters:
[^]-^]
The two examples above are the exact orders needed for these special situations. If you want to write these ranges and include additional characters, just add them after the "^". If you don't want to include all of the special characters, take out the ones you don't want from the example above, leaving the remaining ones in the same order.
Note that, other than the three special range characters ("^", "-", and "]"), all of the characters that are special elsewhere in a pattern lose their special meaning within a range. So, the following range expression matches a period, a star, or a percent sign:
[.*%]
Ranges are useful for matching a specific group of characters, but it's harder to write a good range expression for more complex character sets, such as any alphabetic character or any digit. Unicode has so many different groups of alphabetic characters, since it includes support for so many different languages, that it would take a lot of work to list all of the different alphabetic ranges. Fortunately, TADS regular expressions provide a short-hand notation for certain important character sets, called "character classes."
Each character class is written as a name enclosed in angle brackets ("<" and ">"). Each class matches a single character. The classes are:
Note that the class names are not case-sensitive (regardless of whether or not the search itself is), so <Alpha>, <alpha>, and <ALPHA> are all equivalent.
You can use a character class in place of an ordinary character. So, to search for a five-letter word starting with an upper-case letter followed by four lower-case letters, we could write this:
<Upper><lower><lower><lower><lower>
If you've used filename patterns on DOS or Unix, you're probably wondering by now how you match a variable-length string, the way the "*" character does for filename matches on these systems. Regular expressions let you do this, but in a different and more powerful way than filename patterns do.
There are three ways of specifying variable-length regular expression matches. The first is the "optionality" operator, which specifies that the immediately preceding expression character is optional – specifically, that the preceding character can be present zero or one times in the match string. The optionality operator is the question mark, "?", and immediately follows the character to be made optional. So, to search for either "you" or "your", we could write this:
your?
The second variable-length operator is the one-or-more "closure." This operator is the plus sign, "+", and specifies that the immediately preceding character is to be repeated once or more – any number of times, as long as it appears at least once. So, to match a string of any number of copies of the letter "A", we'd write this:
A+
This matches "A", "AA", "AAA", and so on without limit.
The third variable-length operator is almost the same: it's the zero-or-more closure. This operator is the asterisk, "*". This specifies that the preceding character is to match any number of times, and furthermore that it need not be present at all.
abcd*
This matches "abc", or "abcd", or "abcdd", or "abcddd", and so on.
You can apply the closure operators to more complex expressions than a single ordinary character. For example, to search for one or more digits, you could write this:
<digit>+
To search for any word of any length written with an upper-case initial letter and lower-case letters following, you'd write this:
<upper><lower>*
To search for any number of repetitions of an arithmetic operator character, we could write this amusing sequence of punctuation marks:
[-+*/]*
Each construction rule has a default grouping. For example, the alternation operator ("|") considers everything to the left of the "|" to be one complete regular expression, and everything to the right to be another complete expression: the pattern "abc|def" thus matches "abc" or "def". Sometimes, however, you will want to change the default grouping, to extend or limit the extent to which an operator applies. You can do this by putting a portion of the expression in parentheses ("(" and ")").
For example, suppose we wanted to construct an expression that matches either "the red ball" or "the blue ball". We might first attempt something like this:
the red|blue ball
However, this wouldn't work the way we want: the "|" operator applies to everything to its left and right, so what this expression actually matches is "the red" or "blue ball". This is where parentheses come in handy: we can enclose in parentheses the part of the expression to which we want to apply the "|" operator:
the (red|blue) ball
You can also use parentheses to achieve the opposite effect with the closure operators. Using parentheses, you can make the closure operators apply to more than just the single character preceding the closure. For example, to match any number of repetitions of the word "the" followed by a space, you could write this:
(the )+
You can use parentheses within parentheses for more complex grouping. For example, to search for the word "the" followed by any number of repetitions of "ball", and then repeating the whole thing any number of times, we'd write this:
(the (ball )+)+
Parenthesized groups have another use besides controlling operator grouping. Each time you use parentheses, the regular expression matcher automatically assigns a "group number" to the expression contained within the parentheses. The group numbers start at 1, and increase each time the parser encounters an open parenthesis. (Nesting doesn't matter for numbering – the order of appearance of the open parentheses establishes the group numbering.)
The regular expression functions let you look at the exact text that matched a particular group after a search. For example, suppose you defined a search like this:
say "(.*)" to (<alphanum>*)
This expression has two groups. Group number 1 is the part within the quote marks. Group number two is the part after "to". Now, suppose we match this string:
say "hello there" to Mark
If we ask the regular expression matcher for group number 1, it will give us the string "hello there" (no quotes – the group is inside the quotes, so the quotes won't be part of the group string). Similarly, group number 2 is the string "Mark".
Groups can also be used within an expression. If you write the sequence "%1" in an expression, it specifies a match to the same thing that group number 1 already matched in the same string. Similarly, "%2" matches the same text as group number 2, and so on, up to "%9" for group 9. This allows you to look for repeated sequences that are separated from one another. For example:
(<alphanum>*) is %1
This will match any string of the form "word is word", where the two words are the same. So, it will match "red is red" and "blue is blue", but it won't match "blue is red".
The regular expression matcher provides a number of special match types.
The "^" character specifies a match to the very beginning of the search string. If specified, this has to be the first character in the pattern (or the first character within a parenthesized group at a top-level alternation). The "^" character doesn't match any characters – it simply matches if the search position is the very start of the string.
The "$" character specifies a match to the very end of the string. This must be the last character in the pattern or within a parenthesized group at a top-level alternation.
The sequence "%<" matches the start of a word, which is defined as a position where the preceding character is not a word character, and the following character is. A word character is any alphanumeric character. "%<" doesn't actually match any characters – it just requires that the current position is the start of a word.
The sequence "%>" matches the end of a word.
The sequence "%w" matches any word character, which is defined as an alphanumeric character. This is equivalent to "<AlphaNum>", but is shorter to type in.
The sequence "%W" matches any non-word character.
The sequence "%b" matches any word boundary, which is either the beginning or ending of a word.
The sequence "%B" matches anywhere that is not a word boundary.
By default, searches are sensitive to case, which means that an upper-case letter in the search pattern will match only the identical upper-case letter in the string being searched. You can, however, make a search insensitive to case. To do this, add the <NoCase> flag to the search pattern. There's also a <Case> flag to make the case sensitivity explicit, but this is the default, so you won't usually need to specify it.
The <Case> and <NoCase> flags don't match anything themselves; they're just flag sequences that control the overall search mode. You can put these anywhere in the search, but normally you'd just want to put them at the start of the search string to avoid confusion. Note that these flags are global, which means that the entire search is case-sensitive or case-insensitive; you can't make part of your search string sensitive to case and another part insensitive. If the flags appear more than once, only the last one that appears is obeyed.
For example, to search for a match to "abc", ignoring case, we'd write this:
<NoCase>abc
When you use <NoCase>, the case of the letters in your pattern is mostly irrelevant, since the pattern matcher will match "A" or "a" to a pattern character "a", and will likewise match "A" or "a" to a pattern character "A". However, there are some cases involving non-English languages where the case of the pattern characters might be significant. In particular, when the matcher encounters an alphabetic character in case-insensitive mode, it will first convert the string character to the case of the corresponding pattern character, and it will then perform the comparison. In some languages, a few characters have ambiguous translations from upper-case to lower-case or vice versa. In these languages, you can control how the matcher performs the translation by using the correct unambiguous case in the pattern string. For most languages whose writing systems are based on the Roman alphabet, there is no ambiguity, so you won't have to worry about this.
There are times when a particular expression can match a string in several different ways. For example, consider this pattern:
say (.*) to (.*)
For many strings, there will be only one way to match this. In some cases, though, we could type a string that could be interpreted different ways. For example:
say time to go to Bob
This could match in several different ways. We could end up with group 1 as "time to go" and group 2 as "Bob". We could also have group 1 as "time" and group 2 as "go to Bob". We could also have group 1 as "time" and group 2 as "go", or even an empty group 2 – ".*" can match zero characters, after all.
Normally, the matcher will give us the longest match that begins earliest in the search string. The matcher will furthermore give the earliest groups in the string the longest matches. So, of all of the choices above, the matcher will normally pick the one where group 1 is longest and group 2 is longest given that group 1 is already longest – thus, group 1 is "time to go" and group 2 is "Bob".
You can, however, control this behavior.
Two flags control whether the matcher picks the longest or shortest match for a string. If you put the <Max> flag somewhere in your expression (it's a global flag, so it doesn't matter where it goes), the parser will always choose the longest string it can for each subexpression, giving precedence to the earliest expression. This is the default behavior. If you use the <Min> flag, though, the matcher will use the shortest match that it possibly can for the overall match. Thus, consider this new expression:
<Min>say (.*) to (.*)
Now if match this to "say time to go to Bob", we'll get "time" for group 1, and an empty group 2.
Note that the matcher still always
tries to give the earliest groups the longest matches, but this is only after
figuring out which is the shortest overall match. Consider this example
tell (.*) to (.*)
If we type in something like "tell Bob to eat my shorts", there's no ambiguity. But if we try a string like "tell Bob to go to the store", the parser matches group 1 as "Bob to to" and group 2 as "the store", which isn't what we want. How do we solve this?
Unfortunately, <Min> doesn't help us much with a situation like this, because the second group is free to match nothing at all. So, if we try this:
<Min>tell (.*) to (.*)
and we try "tell Bob to go to the store", we'll have "Bob" for group 1, as we want, but now we'll have an empty group 2 – the shortest match to the string is simply "tell bob to ", since the second group can match nothing. We could change the expression like so:
<Min>tell (.*) to (.*)$
This forces the expression to match to the end of the string. But this still doesn't do what we want, because now the first group will be "Bob to go" and the second will be "the store" – so we're back where we started. The reason that <Min> doesn't help us here is that <Min> affects only the length of the complete match, and doesn't affect the matcher's preference for putting the longer string in the earlier group in case of ambiguity.
Unfortunately, there's no good way to solve this problem with a single regular expression. The easiest solution is to use two separate regular expressions. For the first, we eliminate the second anything-goes wildcard sequence, and end the expression at the "to":
tell (.*) to<space>
Now, this reduces the ambiguity of the expression, but it still doesn't do what we want – when we match "tell Bob to go to the store", we again find that group 1 is "Bob to go", since the parser by default matches the longest sequence it can. However, we finally have a situation where the <Min> flag solves our problem:
<Min>tell (.*) to<space>
This gives us what we want – group 1 is simply "Bob", since the shortest possible string that matches the complete pattern is now "tell Bob to ". We can finish by using the match length for the overall expression to learn what's left in the rest of the string, which gives us what we formerly tried to get from the second group.
You can also specify whether the matcher finds the matching string that begins first or ends first. By default, the matcher finds a string that begins earliest in the search string. However, there are times when you might want to find the string that ends earliest. To do this, use the <FirstEnd> flag, which you can also write as simply <FE>. The default flag, <FirstBegin> or <FB>, finds the string that begins earliest.
| |
Alternation operator |
( ) |
Grouping operator |
+ |
Repeat preceding expression one or more times |
* |
Repeat preceding expression zero or more times |
? |
Repeat preceding expression zero or one times |
. |
Match any single character |
^ |
Match only at beginning of string |
$ |
Match only at end of string |
[ ] |
Character range |
[^ ] |
Exclusive character range |
<Alpha> |
Any single alphabetic character |
<Upper> |
Any single upper-case alphabetic character |
<Lower> |
Any single lower-case alphabetic character |
<Digit> |
Any single digit character |
<AlphaNum> |
Any single alphabetic or digit character |
<Space> |
Any single space character |
<Punct> |
Any single punctuation mark character |
%1 |
Match the same text that the first parenthesized group matched |
%2 |
Match the same text as the second parenthesized group |
%9 |
Match the same text as the ninth parenthesized group |
%< |
Match only at the beginning of a word |
%> |
Match only at the end of a word |
%w |
Match any single word character |
%W |
Match any single non-word character |
%b |
Match at any word boundary |
%B |
Match only at a non-word boundary |
<Case> |
Make the match case-sensitive (default) |
<NoCase> |
Make the match insensitive to case |
<FirstBegin> |
Find the match that begins earliest in the search text (default) |
<FB> |
Same as <FirstBegin> |
<FirstEnd> |
Find the match that ends earliest in the search text |
<FE> |
Same as <FirstEnd> |
<Max> |
Find the longest match (default) |
<Min> |
Find the shortest match |
% |
Quote the following special character (except "<" and ">") |
This expression matches a North American telephone number, with optional area code:
(%([0-9][0-9][0-9]%)<space>*[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]
The next expression matches a C-style floating point number. These numbers start with an optional sign character, then have either a string of digits, a decimal point, and a string of zero or more digits; or a decimal point followed by one or more digits. After this is an optional exponent, written with the letter "E" (capital or small) followed by an optional sign followed by one or more digits.
[-+]?([0-9]+%.?|[0-9]*%.[0-9]+)([eE][-+]?[0-9]+)?
Note the way we constructed the alternation that gives us the mantissa (the part before the exponent). We use the alternation to gives us one of two expressions:
[0-9]+%.?
[0-9]*%.[0-9]+
The first expression matches a string of one or more digits, followed by an optional decimal point. This matches numbers that have no decimal point at all, as well as numbers that end in a decimal point. The second expression matches zero or more digits, a decimal point, and then one or more digits. One might wonder why we didn't write the expression more simply like this:
[0-9]*%.?[0-9]*
In other words, as zero or more digits, an optional decimal, and zero or more digits. The reason we didn't write the expression this way is that everything in this expression is optional – this one would match an empty string. It would also match a period, without any digits on either side. Obviously, we don't want to consider either an empty string or simply a period as a valid floating point number, so this simpler form of the expression is a little too general. The alternation solves these problems, because it allows for starting with a decimal, ending with a decimal, or containing an embedded decimal, but there must always be one or more digits on one side or the other of the decimal.