Definitions

Unicode
- NFC normalisation
Byte
Character
- Notation for characters
Sequence

Unicode

References to Unicode in this document refer to the Unicode standard, version 16.0.

References to the Unicode character database refer to version 16.0.0.

NFC normalisation

References to NFC-normalised strings are talking about Unicode's Normalization Form C, defined in Unicode Standard Annex #15.

Byte

For the purposes of this document, byte means the same thing as Rust's u8 (corresponding to a natural number in the range 0 to 255 inclusive).

Character

For the purposes of this document, character means the same thing as Rust's char. That means, in particular:

there's exactly one character for each Unicode scalar value
the things that Unicode calls "noncharacters" are characters
there are no characters corresponding to surrogate code points
there is a character for each unassigned code point

Notation for characters

This document identifies characters in the following ways:

Printable ASCII characters other than space are represented by themselves using highlighting like a. For example \ represents character U+005C (REVERSE SOLIDUS).

ASCII control characters and space are represented as follows:


`U+0000`	`NUL`
`U+0009`	`HT`
`U+000A`	`LF`
`U+000D`	`CR`
`U+0020`	`SP`

Other characters are identified by hexadecimal scalar value and name, for example U+FEFF (BYTE ORDER MARK).

Sequence

When this document refers to a sequence of items, it means a finite, but possibly empty, ordered list of those items.

"character sequence" and "sequence of characters" are different ways of saying the same thing.

Writeup of Rust's lexer