Processing that happens before tokenising

This document's description of tokenising takes a sequence of characters as input.

rustc obtains that sequence of characters as follows:

This description is taken from the Input format chapter of the Reference.

Source encoding

Each source file is interpreted as a sequence of Unicode characters encoded in UTF-8. It is an error if the file is not valid UTF-8.

Byte order mark removal

If the first character in the sequence is U+FEFF (BYTE ORDER MARK), it is removed.

CRLF normalisation

Each pair of characters U+000D CR immediately followed by U+000A LF is replaced by a single U+000A LF.

Other occurrences of the character U+000D CR are left in place (they are treated as whitespace).

Note: it's still possible for the sequence CRLF to be passed on to the tokeniser: that will happen if the source file contained the sequence CRCRLF.

Shebang removal

If the remaining sequence begins with the characters #!, the characters up to and including the first U+000A LF are removed from the sequence.

For example, the first line of the following file would be ignored:

#!/usr/bin/env rustx

fn main() {
    println!("Hello!");
}

As an exception, if the #! characters are followed (ignoring intervening comments or whitespace) by a [ punctuation token, nothing is removed. This prevents an inner attribute at the start of a source file being removed.

See open question: How to model shebang removal

Writeup of Rust's lexer

Processing that happens before tokenising

Source encoding

Byte order mark removal

CRLF normalisation

Shebang removal