Overview

The following processes might be considered to be part of Rust's lexer:

Decode: interpret UTF-8 input as a sequence of Unicode characters
Clean:
- Byte order mark removal
- CRLF normalisation
- Shebang removal
Tokenise: interpret the characters as ("fine-grained") tokens
Further processing: to fit the needs of later parts of the spec
- For example, convert fine-grained tokens to compound tokens
- possibly different for the grammar and the two macro implementations

This document attempts to completely describe the "Tokenise" process.