Processing that happens before tokenising
This document's description of tokenising takes a sequence of characters as input.
rustc
obtains that sequence of characters as follows:
This description is taken from the Input format chapter of the Reference.
Source encoding
Each source file is interpreted as a sequence of Unicode characters encoded in UTF-8. It is an error if the file is not valid UTF-8.
Byte order mark removal
If the first character in the sequence is U+FEFF
(BYTE ORDER MARK), it is removed.
CRLF normalization
Each pair of characters U+000D
CR immediately followed by U+000A
LF is replaced by a single U+000A
LF.
Other occurrences of the character U+000D
CR are left in place (they are treated as whitespace).
Note: this document's description of tokenisation doesn't assume that the sequence CRLF never appears in its input; that makes it more general than necessary, but should do no harm.
In particular, in places where the Reference says that tokens may not contain "lone CR", this description just says that any CR is rejected.
Shebang removal
If the remaining sequence begins with the characters #!, the characters up to and including the first U+000A
LF are removed from the sequence.
For example, the first line of the following file would be ignored:
#!/usr/bin/env rustx
fn main() {
println!("Hello!");
}
As an exception, if the #! characters are followed (ignoring intervening comments or whitespace) by a [
punctuation token, nothing is removed.
This prevents an inner attribute at the start of a source file being removed.
See open question: How to model shebang removal