Processing that happens before tokenising
This document's description of tokenising takes a sequence of characters as input.
That sequence of characters is derived from an input sequence of bytes by peforming the steps listed below in order.
It is also possible for one of the steps below to determine that the input should be rejected, in which case tokenising does not take place.
Normally the input sequence of bytes is the contents of a single source file.
Decoding
The input sequence of bytes is interpreted as a sequence of characters represented using the UTF-8 Unicode encoding scheme.
If the input sequence of bytes is not a well-formed UTF-8 code unit sequence, the input is rejected.
Byte order mark removal
If the first character in the sequence is U+FEFF
(BYTE ORDER MARK), it is removed.
CRLF normalisation
Each pair of characters CR immediately followed by LF is replaced by a single LF character.
Note: It's not possible for two such pairs to overlap, so this operation is unambiguously defined.
Note: Other occurrences of the character CR are left in place. It's still possible for the sequence CRLF to be passed on to the tokeniser: that will happen if the input contained the sequence CRCRLF.
Shebang removal
Shebang removal is performed if:
- the remaining sequence begins with the characters #!; and
- the result of finding the first non-whitespace token with the characters following the #! as input is not a
Punctuation
token whose mark is the [ character.
If shebang removal is performed:
- the characters up to and including the first LF character are removed from the sequence
- if the sequence did not contain a LF character, all characters are removed from the sequence.
Note: The check for [ prevents an inner attribute at the start of the input being removed. See #70528 and #71487 for history.
Frontmatter removal
Stability: As of Rust 1.90 frontmatter removal is unstable. Under stable rustc 1.90, and under nightly rustc without the
frontmatter
feature flag, input which would undergo frontmatter removal is rejected.
If the FRONTMATTER
nonterminal defined in the frontmatter grammar matches at the start of the remaining sequence,
the characters consumed by that match are removed from the sequence.
Otherwise, if the RESERVED
nonterminal defined in the frontmatter grammar matches at the start of the remaining sequence,
the input is rejected.
The frontmatter grammar is the following Parsing Expression Grammar:
FRONTMATTER = {
WHITESPACE_ONLY_LINE * ~
START_LINE ~
CONTENT_LINE * ~
END_LINE
}
WHITESPACE_ONLY_LINE = {
( !"\n" ~ PATTERN_WHITE_SPACE ) * ~
"\n"
}
START_LINE = {
FENCE¹ ~
HORIZONTAL_WHITESPACE * ~
( INFOSTRING ~ HORIZONTAL_WHITESPACE * ) ? ~
"\n"
}
CONTENT_LINE = {
!FENCE² ~
( !"\n" ~ ANY ) * ~
"\n"
}
END_LINE = {
FENCE² ~
HORIZONTAL_WHITESPACE * ~
( "\n" | EOI )
}
FENCE = { "---" ~ "-" * }
INFOSTRING = {
( XID_START | "_" ) ~
( XID_CONTINUE | "-" | "." ) *
}
HORIZONTAL_WHITESPACE = { " " | "\t" }
RESERVED = {
PATTERN_WHITE_SPACE * ~
FENCE
}
These definitions require an extension to the Parsing Expression Grammar formalism:
each of the expressions marked as FENCE²
fails unless the text it matches is the same as the text matched by the (only) successful match using the expression marked as FENCE¹
.
See Grammar for raw string literals for a discussion of alternatives to this extension.
Note: If there are any
WHITESPACE_ONLY_LINE
s, rustc emits a single whitespace token to represent them. But I think that token isn't observable by Rust programs, so it isn't modelled here.