String
"String" is a native TADS 3 datatype, but it's also an intrinsic class. Any string value is an instance of the String class, so you can call methods defined by the class on any string value.
Value semantics
Strings have "value semantics". This means that a given string value's text is immutable: once you've created a string, the text within that string never changes. All of the methods and operators that might appear to change the value of a string actually create a new string with the modified value, leaving the original value intact. For example, consider this code:
local x = 'foo'; local y = x; x += 'bar';
Superficially, it appears that the last line changes the string in x. In fact, the original string is not changed - if we display the value of y, we'll see that y still contains 'foo'. When the the last line above is executed, it creates a new string to hold the concatenated value, and assigns the result to x.
Value semantics make it very easy to work with strings, because you don't have to worry about whether a function might modify a string you pass to it: this can never happen, because a given string's text is constant.
Operators
The addition operator + can be used to concatenate two strings yielding a string that consists of the contents of the right-hand string appended to the end of the contents of the left-hand string.
local x = 'foo'; local y = 'bar'; local z = x + y; // z = 'foobar'
You can use the += operator to replace the left-hand variable with the resulting string.
local x = 'foo'; local y = 'bar'; x += y; // x = 'foobar'
String vs StringBuffer
There's a related class called StringBuffer, that's designed especially for complex string construction tasks. StringBuffer objects can be edited in place, meaning that you can change the text contained in a StringBuffer object, rather than creating a new object for every modification. Refer to the StringBuffer documentation for more information.
String methods
digestMD5()
MD5 was originally designed for cryptographic applications, but it has some known weaknesses and is no longer considered secure. Even so, it's still considered a good checksum, and it's widely used for message integrity checking. It's also part of several Internet standards (e.g., HTTP digest authentication). In an Interactive Fiction context, Babel uses MD5 to generate IFIDs for older games. If you're looking for a secure hash, consider SHA-2 (see sha256()) instead of MD5.
endsWith(str)
find(str, index?)
If index is given, it gives the starting index in self for the search; a value of 1 indicates that the search starts at the first character. If the index value is omitted, the default value is 1. The starting index value can be used to search for another occurrence of the same substring following a previous search, for example. A negative value for index is an index from the end of the string: -1 is the last character, -2 the second to last, etc.
Examples:
'abcdef'.find('cd') yields 3 'abcdef'.find('g') yields nil 'abcdef'.find('c', 3) yields 3 'abcdef'.find('c', 4) yields nil 'abcabcabc'.find('c', 4) yields 6 'abcabcabc'.find('c', 7) yields 9
findReplace(origStr, newStr, flags?, index?)
origStr is the search string. This is the string to find within the subject string. newStr is the string to replace it with on each occurrence, or a function to invoke to determine the replacement text (more on this in a bit). The search string and replacement can also be specified as lists (see below).
flags is optional. If it's missing, the default is ReplaceAll, to replace all occurrences of the search string within the subject string. If provided, flags is a bitwise combination (with the | operator) of the following flag values:
- ReplaceOnce: replace only the first occurrence of the search string.
- ReplaceAll: replace all occurrences of the search string. This is the default if ReplaceOnce isn't specified, and it supersedes ReplaceOnce if both are specified.
- ReplaceIgnoreCase: ignore case (that is, capitalization) when searching for origStr. By default, the search is case-sensitive, so capitals can only match capitals and minuscules can only match minuscules.
- ReplaceFollowCase: each time a match is replaced, change lower-case letters in the replacement text to follow the capitalization pattern of the matched text. There are three possibilities: if all of the letters in the matched text are capitals, all letters in the replacement text are capitalized; if all of the letters in the match are lower-case, the replacement text isn't changed; if the match has a mix of capitals and lower-case letters, the first lower-case letter in the replacement text is capitalized, and the rest are left unchanged.
- ReplaceSerial: use the serial replacement mode. See below for details.
Note that you should never use 0 as the flags value. For compatibility with older versions, 0 has a special meaning equivalent to ReplaceOnce. If you have no other flags to specify, always use either ReplaceOnce or ReplaceAll, or simply omit the flags argument entirely.
If index is specified, it gives the starting index in self for the search. Any matches that start before this starting point will not be replaced. If index is 1, the search starts at the first character; this is the default if index is omitted. A negative value is an index from the end of the string: -1 is the last character, -2 the second to last, and so on. Note that a negative index doesn't change the direction of the search; it still runs left-to-right.
Search lists: Instead of searching for just a single search string, you can search for several strings at once, by using a list as the origStr argument. This will search for each of the items in the list, and replace each one with the newStr replacement string.
If you supply a list for the search term, you can optionally also supply a list for the newStr replacement value. If you do, each match to an element of origStr is replaced with the corresponding element of the newStr list - that is, the item at the same list index. If there are more origStr elements than newStr elements, matches to the excess origStr elements are replaced with empty strings. Excess newStr elements are simply ignored.
If newStr isn't a list, it's used as the replacement for all of the search strings. Note that this is different from passing a one-element list for newStr, because in that case it would only specify a replacement for origStr[1], and the remaining origStr elements would all be replaced with empty strings.
Here's an example that replaces each of the special HTML characters with their markup codes, using a single findReplace() call:
str = str.findReplace(['&', '<', '>'], ['&', '<', '>']);
When you use a list of search terms, there are two modes for iterating through the list. The default is "parallel" mode. In this mode, findReplace() starts by searching for all of the search terms at once. It then replaces the single leftmost match with its corresponding replacement text. (If two of the search strings match at the same position, the one at the lower origStr index takes precedence.) If ReplaceOnce was specified, we're done. Otherwise, findReplace() next repeats the search in the remainder of the string, after (to the right of) that first replacement, again searching for all of the terms, and again replacing the single leftmost match among them. This repeats until there are no more matches.
The other mode is "serial" mode, which you select by including ReplaceSerial in the flags. In this mode, findReplace() starts by searching only for the first origStr element. It replaces every match for the first term, or just the first match if ReplaceOnce is specified. If a match was found and ReplaceOnce was specified, we're done. Otherwise, we start over with the updated string - containing all replacements for the first term - and search this new string for the second search term. We once again replace all occurrences of this term. We repeat this process for each additional term in the origStr list.
The key difference between the parallel and serial modes is that serial mode rescans each replaced result string for each term. This means that replacement text from the first search term is subject to further replacement by the second search term, and that's subject to yet more replacement by the third term, and so on. In contrast, parallel mode never rescans replacement text, so once a replacement is made, it won't be further modified.
Replacement function: The replacement value newStr is normally given as a literal string to substitute for each occurrence of the search string. For more flexibility, though, you can instead provide a function, which findReplace() calls for each match to determine the replacement text for that match. This allows you to vary the replacement text according to the exact text of the match (which can vary if you're using the ReplaceIgnoreCase flag or you're using a list of search strings), the position of the match within the subject string, or whatever other conditions you choose.
The callback function can be a regular or anonymous function. It's called like this, once for each match found in the subject string:
newStrFunc(match, index, orig);
match is the text actually matched in the subject string. index is the character index within the string of the start of the match (the first character is at index 1, as usual). orig is the entire original subject string.
You can omit one or more of the parameters when you define the callback function, because findReplace will only supply as many arguments as the function actually wants. The arguments are always in the same order, though - the names don't matter, just the order. This means that if you provide a callback that only takes one argument, it gets the match string value; with two arguments, they'll be assigned the match string and match index, respectively.
The function must return a string value giving the replacement text (it can alternatively return nil, which is treated as an empty string).
If you use a list of search strings and a list of corresponding replacements, each element of the replacement list can be a separate function. The replacement list can also be a mix of strings and functions.
htmlify(flags?)
- HtmlifyTranslateSpaces: converts each space after the first space in a run of multiple spaces to the sequence (the HTML non-breaking space). This ensures that, when the string is rendered in HTML mode, the display shows the same number of spaces that appeared in the original string. Note that the method never converts the first space in a run of whitespace to the sequence, because the first space in a run of whitespace is significant in HTML and thus requires no special handling.
- HtmlifyTranslateTabs: converts each tab character (\t) in the string to the sequence <tab>.
- HtmlifyTranslateNewlines: converts each newline character (\n) in the string to the sequence <br>.
- HtmlifyTranslateWhitespace: this is simply a combination of HtmlifyTranslateSpaces, HtmlifyTranslateTabs, and HtmlifyTranslateNewlines.
This method is useful if you obtain a string from an external source, such as from the user (via the inputLine() function, for example) or from a text file, and you then want to display the string in HTML mode. Without conversions, any markup-significant characters in the string might not be displayed properly, since the HTML parser would attempt to interpret the characters as HTML formatting codes. You can use this method to ensure that a string obtained externally is displayed verbatim in HTML mode.
length()
mapToByteArray(charset?)
If a charset value is included and isn't nil, it must be either a CharacterSet object, or a string giving the name of a character set. The method maps the string to bytes using the given character set, and creates a new ByteArray object with the mapped bytes as the contents. The usual default/missing character defined by the mapping is substituted for any unmappable characters.
If the character set represented by charset is unknown (i.e,. there's no mapping available for the character set in the run-time TADS installation), an UnknownCharSetException is thrown. You can determine whether the the character set is known using the isMappingKnown() method on the CharacterSet object.
If charset is omitted or nil, the method creates a ByteArray with one byte per character of the string, using the Unicode character code of each character as the byte value. Since a byte can only hold values from 0 to 255, a numeric overflow error will be thrown if the string contains any characters outside of this range.
packBytes(format, ...)
local s = String.packBytes('s*', 1, 2, 3);
format is a format string describing the packed formats for the values. The remaining arguments are the values to be packed.
The return value is a new String object containing the packed bytes. The bytes are represented as characters, so each character in the new string will have a Unicode value from 0 to 255.
There are a couple of uses for packing bytes into strings. One is when you want to create a packed byte list that will eventually find its way into a file or other external object, but you need to create a temporary version in memory first. Packing the bytes into a string can be a convenient way to accomplish this. Another potential use is for generating text for a structured text format, such as for spreadsheet input. The byte packer makes it easy to generate formats with fixed-width text fields.
Note that the string returned might not be particularly human-readable, since many format codes generate binary byte values that will look like random gibberish if displayed.
See Byte Packing for more information.
sha256()
Secure hashes are useful when you want to store or transmit information in such a way that another party can prove it knows the original information, without actually revealing the information. For example, passwords are often stored in a hashed format, because this prevents a third party who steals the password file from being able to recover the original password values, while still allowing password entries to be verified, by computing the matching hash value on an entered password.
specialsToHtml(stateObject?)
The main purpose of this function is to make it easier to port games between the traditional TADS user interface and the Web UI. The TADS formatting characters, such as '\n' and '\b', are specific to TADS - they can't be used directly in a standard Web browser. The Web UI uses a standard browser as the user interface, though, so to write a game for the Web UI, you must either avoid the TADS formatting codes and use only standard HTML, or translate strings containing the TADS codes into standard HTML. The former option would require a lot of work for an existing game; it's inconvenient even for new work, since most TADS authors are in the habit of using the TADS codes, and the TADS codes are more concise than the HTML equivalents. This function makes it easy to implement the translation approach, allowing you to continue to use TADS formatting codes even with the Web UI.
Note that game authors won't generally have to call this function directly, because the Web UI library will do this automatically in most cases. You should only need this function if you're writing a library extension or creating a custom Web UI window type.
stateObject is an optional object, of class SpecialsToHtml. This keeps track of the state of the output stream from one call to the next. Many of the TADS formatting codes are context-sensitive, so when you're writing a series of strings to the display, it's important keep track of the global context across strings. You should use a separate object for each window or output stream.
If you omit the state object or pass nil, the function will treat each string as the start of a new stream, with no context from past calls to the function. To reset the stream context (for example, after clearing the window), call its resetState() method.
The function performs the following translations:
- \n is translated to <BR> when it occurs within a line of text, or nothing if it occurs at the start of a new line (ensuring that it ends a line, but doesn't produce any blank lines).
- \b is translated to <BR> if it occurs at the start of a new line, or <BR><BR> if it occurs within a line of text (ensuring that it always produces one blank line).
- \^ is removed, but causes the next non-markup character to be converted to upper-case.
- \v is removed, but causes the next non-markup character to be converted to lower-case.
- \ (quoted space) is converted to if it's followed by another quoted space, or to a regular space if followed by anything else. This reproduces the standard TADS quoted space behavior: a quoted space doesn't combine with adjacent quoted spaces, but does allow a line break. Any regular space characters adjacent to the quoted space in the source string are removed.
- \t is converted to a series of characters followed
by one regular space, sufficient to pad the current line to the next
multiple of four characters in length. This is a very rough
approximation of the way \t works in the console UI, but note
how the algorithm merely counts characters, and doesn't take
into account font metrics.
(Because this doesn't take into account font metrics, it's mostly only useful with monospaced fonts. But if you're using the Web UI, you have access to the much greater capabilities of full HTML layout, so you shouldn't have much use for tabs anyway.)
- <Q>...</Q> tag sequences are converted to “...” and ‘...’ sequences, alternating at each nesting level, with double quotes at the outermost level.
- <BR HEIGHT=N> tags are converted to a series of N <BR> tags when used at the start of a line, or N+1 <BR> tags when used within a line of text.
- <P>, <DIV>, <CENTER>, <TABLE>, <TD>, <TH>, and <CAPTION> tags (both open and close tags) are left exactly as given, but are recognized as line breaks for the purposes of translating \n, \b, and <BR HEIGHT=N>.
The stateObject value is for internal use within the function, and you shouldn't have to access its properties directly. For reference, though, they are:
- flags_ is an integer containing a number of bit fields:
- 0x0001 - on (non-zero) if the stream is in the midst of a line of text, off (zero) at the start of a new line
- 0x0002 - on if the capitalization flag ('\^') is pending
- 0x0004 - on if the lower-case flag ('\v') is pending
- 0x0008 - on if an HTML tag is in progress
- 0x0010 - on if within a double-quoted attribute value in an HTML tag
- 0x0020 - on if within a single-quoted attribute value in an HTML tag
- 0x0040 - on if the last character was an ordinary space
- 0x0080 - on if the last character was a quoted space ('\ ')
- 0x0100 - on if the <Q> tag quote nesting level is odd, off if the level is even (at even levels, double quotes are used; at odd levels, single quotes are used)
- 0x0200 - on if an HTML entity (& sequence) is in progress (only used with specialsToText())
- 0x3000 - the current tab stop column (shift this value right 12 bits to get the integer value: that is, compute ((obj.flags_ & 0x3000) >> 12). This is the number of characters in the line since the last multiple of 4 columns.
- tag_ is a string containing the text of the tag in progress. When a string ends in mid-tag, this contains the fragment of the tag up to the end of the string, so that the next call can resume parsing the tag where the last call left off.
specialsToText(stateObject?)
stateObject has the same meaning as in specialsToHtml().
This function performs the following conversions:
- \n is translated to \n when it occurs within a line of text, or nothing if it occurs at the start of a new line (ensuring that it ends a line, but doesn't produce any blank lines).
- \b is translated to \n if it occurs at the start of a new line, or \n\n if it occurs within a line of text (ensuring that it always produces one blank line).
- \^ is removed, but causes the next non-markup character to be converted to upper-case.
- \v is removed, but causes the next non-markup character to be converted to lower-case.
- \ (quoted space) is converted to a regular space.
- <Q>...</Q> tag sequences are converted to "..." and '...' sequences, alternating at each nesting level, with double quotes at the outermost level.
- <BR HEIGHT=N> tags are converted to a series of N \n characters when used at the start of a line, or N+1 \n characters when used within a line of text.
- <P> is converted to \n if it appears at the start of a line, or \n\n within a line.
- <DIV>, <CENTER>, <TABLE>, <TD>, <TH>, and <CAPTION> tags (both open and close tags) are converted to \n.
- <Tag> for any tag not mentioned above is simply stripped out.
- is converted to a space.
- > is converted to >.
- < is converted to <.
- & is converted to &.
- ", “, and ” are converted to " (a plain double-quote).
- ‘ and ’ are converted to ' (a plain single-quote).
- &dddd; (where the ds are digits) is converted to the Unicode character with value dddd.
splice(index, deleteLength, insertString?)
This function's effect can be achieved by concatenating together substrings of the original string, but splice() is more concise and somewhat clearer. It's also a little more efficient, since it bypasses the need to create the two intermediate substrings.
split(delim?, limit?)
delim is the delimiter, which can be one of the following:
- A string. split searches for exact matches to this substring within the subject string, and splits the string at each instance.
- A RexPattern object. The method searches for matches to the regular expression, and splits the string at each match found.
- An integer, which must be at least 1. The method splits the string into substrings of exactly this length (except that the last substring in the list might be shorter, since it'll have whatever's left over at the end).
If delim is omitted or nil, the default is 1 (i.e., split the string into one-character substrings).
limit is an optional integer giving the maximum number of elements to return in the result list. When the method reaches this limit, it stops searching and returns the remainder of the string after the last split as the final element of the list. If limit is 1, no splits are possible, so the result is simply a single-element list containing the entire original string. If limit is omitted or nil, the method splits the string at every instance of the delimiter without a limit.
The delimiter string or pattern isn't included in the result list.
Examples:
'one,two,three'.split(',') returns ['one', 'two', 'three']
'one,two, three, four'.split(new RexPattern(',<space>*'))
returns ['one', 'two', 'three', 'four']
'one,two,three'.split(',', 2) returns ['one', 'two,three']
'abcdefghi'.split(2) returns ['ab', 'cd', 'ef', 'gh', 'i']
startsWith(str)
substr(start, length?)
If start is negative, it indicates an offset from the end of the string: -1 indicates that the substring is to start at the last character, -2 at the second-to-last, and so on.
If length is negative, it indicates the number of characters to discard from the end of the string. With a length of -1, the result is the whole rest of the string starting at start minus the last character; with -2, it's the rest after start minus the last 2 characters; and so on.
Examples:
'abcdef'.substr(3) yields 'cdef' 'abcdef'.substr(3, 2) yields 'cd' 'abcdefghi'.substr(-3) yields 'ghi' 'abcdefghi'.substr(-3, 2) yields 'gh' 'abcdefghi'.substr(1, -1) yields 'abcdefgh' 'abcdefghi'.substr(2, -2) yields 'bcdefg' 'abcdefghi'.substr(4, -2) yields 'defg' 'abcdefghi'.substr(4, -4) yields 'de' 'abcdefghi'.substr(-4, -2) yields 'fg'
toLower()
toUnicode(idx?)
If the idx argument is provided, it specifies the character index within the string of the single character to convert (the first character is at index 1), and the method returns an integer containing the Unicode code point for the character at that index. A negative value for idx is an index from the end of the string: -1 is the last character, -2 the second to last, and so on.
If idx is omitted, the function returns a list of character codes. Each element in the list is an integer giving the Unicode code point value for the corresponding character in the source string. The list has one element per character in the source string.
This function can be used to decompose a string into its individual characters, which is sometimes an easier or more efficient way to manipulate a string. You can convert a list of Unicode code point values back into a string using the makeString() function in the tads-gen function set.
toUpper()
unpackBytes(format)
The string must only contain characters with Unicode values from 0 to 255, since the unpacker treats each character as a byte value. If the unpacker encounters any characters outside this range, it'll throw an error.
format is the format string describing the byte formats of the values to unpack. The return value is a list containing the unpacked values.
The string to be unpacked can be one that you previously created with String.packBytes(), but it doesn't have to be. As long as the source string's characters are all in the 0-255 range, the unpacker will be able to interpret each character as a byte. For example, you can use this method to parse plain text strings that use fixed-width fields:
local lst = '123456'.unpack('a3 a3');
This returns the list ['123', '456'].
See Byte Packing for more information.
urlDecode()
- %xx sequences are converted to the corresponding characters
- + is converted to a space character
- everything else is left as-is
In addition, the method converts any multi-byte %xx sequences that form valid UTF-8 characters into the corresponding characters. For example, the sequence %C3%A1 represents the character 'á', so '%C3%A1'.urlDecode() returns 'á'. Any %xx sequence that doesn't form a valid UTF-8 character is converted to '?'.
urlEncode()
- Plain ASCII letters, digits, '-', and '_' characters are left unchanged
- Space characters are converted to '+'
- Other characters are converted to their "%xx" representations
The "%xx" representation encodes a special character using its hexadecimal ASCII or Unicode byte value. For example, a comma "," is encoded as "%2C". Characters outside of the plain ASCII range (\u0000 to \u007F) are encoded using the multi-byte UTF-8 representation. For example, 'á' is encoded as '%C3%A1'.
Note that this method is appropriate for encoding components of URL strings, not entire URL strings. Applying this method to an entire URL would encode all of the scheme and path characters (such as the ":" and "//" in the "http://" prefix), which would make the string unusable as a URL. This method is intended only for encoding the building blocks of URL strings, such as the value portion of a "?name=value" query parameter.