Copyright © 1999, 2002 by Michael J. Roberts. Preliminary version 15 March 1999.
From | To | Binary Encoding |
---|---|---|
0x0000 | 0x007f | 0bbbbbbb |
0x0080 | 0x07ff | 110bbbbb 10bbbbbb |
0x0800 | 0xffff | 1110bbbb 10bbbbbb 10bbbbbb |
The bits of the 16-bit value are encoded into the b's in the table above with the most significant group of bits in the first byte. So: 0x571 has the 16-bit Unicode binary pattern 0000011001110001, and is encoded in UTF-8 as 11011001 10110001.
Note that UTF-8 encodes the most significant bits in the first bytes; this is different than the byte ordering used for other types (such as integers), which are all stored in little-endian format (least-significant byte first). The reason for this disparity is that this ordering makes it possible to compare two UTF-8 strings in a byte-wise fashion. (This type of magnitude comparison is not always especially useful, since it does not produce correct results for a localized sorting order, but it at least produces a uniform sorting order based on the Unicode code points stored in the string and hence may be useful in certain cases for building internal indices and tables.)
16-bit integers are stored as 2-byte arrays. The first byte has the low-order 8 bits, the second byte has the high-order 8 bits.
32-bit integers are stored as 4-byte arrays. The first byte has the low-order 8 bits, the second byte has the next more significant 8 bits, the third byte has the next more significant 8 bits, and the fourth byte has the most significant 8 bits.
In order to store these "variant" types, the VM defines a composite type called a data holder. This composite contains the type information along with the value.
To store a data holder portably, we store a 5-byte array. The first byte contains the type ID value. The remaining 4 bytes encode the value using the standard primitive type encodings; the table below shows the correspondence between the primitive types and their encodings.
When an encoding does not take up the full 4 bytes, the value is packed into the earlier bytes, and the later bytes have arbitrary values. For example, a property ID is encoded in a data holder as follows:
Byte Index | Value |
---|---|
0 | 6 (the type code for VM_PROP) |
1 | low-order 8 bits of property ID value |
2 | high-order 8 bits of property ID value |
3 | arbitrary |
4 | arbitrary |
Type Name | Type ID | Description | Value Encoding |
---|---|---|---|
VM_NIL | 1 | nil (boolean "false" or null pointer) | none |
VM_TRUE | 2 | boolean "true" | none |
VM_STACK | 3 | Reserved for implementation use for storing native machine pointers to stack frames (see note below) | none |
VM_CODEPTR | 4 | Reserved for implementation use for storing native machine pointers to code (see note below) | none |
VM_OBJ | 5 | object reference as a 32-bit unsigned object ID number | UINT4 |
VM_PROP | 6 | property ID as a 16-bit unsigned number | UINT2 |
VM_INT | 7 | integer as a 32-bit signed number | INT4 |
VM_SSTRING | 8 | single-quoted string; 32-bit unsigned constant pool offset | UINT4 |
VM_DSTRING | 9 | double-quoted string; 32-bit unsigned constant pool offset | UINT4 |
VM_LIST | 10 | list constant; 32-bit unsigned constant pool offset | UINT4 |
VM_CODEOFS | 11 | code offset; 32-bit unsigned code pool offset | UINT4 |
VM_FUNCPTR | 12 | function pointer; 32-bit unsigned code pool offset | UINT4 |
VM_EMPTY | 13 | no value (this is useful in some cases to represent an explicitly unused data slot, such as a slot that has never been initialized) | none |
VM_NATIVE_CODE | 14 | Reserved for implementation use for storing native machine pointers to native code (see note below) | none |
VM_ENUM | 15 | enumerated constant; 32-bit integer | UINT4 |
Note that types 3 (VM_STACK), 4 (VM_CODEPTR), and 14 (VM_NATIVE_CODE) are reserved for implementation use. These will never appear in a portable binary file; we list them here only for completeness. These types are intended to allow implementations to store native datatypes (such as native machine pointers) for which there is no meaningful portable representation. Implementations are free to use these types for any purposes of their own; the names and descriptions in the table are for mnemonic value only and shouldn't be taken to imply a required use for these types.
Name | Description |
---|---|
SBYTE | Signed 8-bit byte |
UBYTE | Unsigned 8-bit byte |
UTF8 | Unicode text encoded as UTF-8 |
INT2 | Signed 16-bit (2-byte) integer |
UINT2 | Unsigned 16-bit (2-byte) integer |
INT4 | Signed 32-bit (4-byte) integer |
UINT4 | Unsigned 32-bit (4-byte) integer |
DATA_HOLDER | Data holder for any primitive type |