Overview
The following processes might be considered to be part of Rust's lexer:
- Decode: interpret UTF-8 input as a sequence of Unicode characters
- Clean:
- Byte order mark removal
- CRLF normalisation
- Shebang removal
- Tokenise: interpret the characters as ("fine-grained") tokens
- Further processing: to fit the needs of later parts of the spec
- For example, convert fine-grained tokens to compound tokens
- possibly different for the grammar and the two macro implementations
This document attempts to completely describe the "Tokenise" process.