Definitions
Table of contents
Unicode
References to Unicode in this document refer to the Unicode standard, version 16.0.
References to the Unicode character database refer to version 16.0.0.
NFC normalisation
References to NFC-normalised strings are talking about Unicode's Normalization Form C, defined in Unicode Standard Annex #15.
Byte
For the purposes of this document, byte means the same thing as Rust's u8
(corresponding to a natural number in the range 0 to 255 inclusive).
Character
For the purposes of this document, character means the same thing as Rust's char.
That means, in particular:
- there's exactly one character for each Unicode scalar value
- the things that Unicode calls "noncharacters" are characters
- there are no characters corresponding to surrogate code points
- there is a character for each unassigned code point
Notation for characters
This document identifies characters in the following ways:
Printable ASCII characters other than space are represented by themselves
using highlighting like a.
For example \ represents character U+005C (REVERSE SOLIDUS).
ASCII control characters and space are represented as follows:
U+0000 | NUL |
U+000A | LF |
U+000D | CR |
U+0009 | HT |
U+0020 | SP |
Other characters are identified by hexadecimal scalar value and name,
for example U+FEFF (BYTE ORDER MARK).
Sequence
When this document refers to a sequence of items, it means a finite, but possibly empty, ordered list of those items.
"character sequence" and "sequence of characters" are different ways of saying the same thing.