Processing that happens before tokenising
This document's description of tokenising takes a sequence of characters as input.
rustc
obtains that sequence of characters as follows:
This description is taken from the Input format chapter of the Reference.
Source encoding
Each source file is interpreted as a sequence of Unicode characters encoded in UTF-8. It is an error if the file is not valid UTF-8.
Byte order mark removal
If the first character in the sequence is U+FEFF
(BYTE ORDER MARK), it is removed.
CRLF normalisation
Each pair of characters U+000D
CR immediately followed by U+000A
LF is replaced by a single U+000A
LF.
Other occurrences of the character U+000D
CR are left in place (they are treated as whitespace).
Note: it's still possible for the sequence CRLF to be passed on to the tokeniser: that will happen if the source file contained the sequence CRCRLF.
Shebang removal
If the remaining sequence begins with the characters #!, the characters up to and including the first U+000A
LF are removed from the sequence.
For example, the first line of the following file would be ignored:
#!/usr/bin/env rustx
fn main() {
println!("Hello!");
}
As an exception, if the #! characters are followed (ignoring intervening comments or whitespace) by a [
punctuation token, nothing is removed.
This prevents an inner attribute at the start of a source file being removed.
See open question: How to model shebang removal