Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Processing that happens before tokenising

This document's description of tokenising takes a sequence of characters as input.

That sequence of characters is derived from an input sequence of bytes by peforming the steps listed below in order.

It is also possible for one of the steps below to determine that the input should be rejected, in which case tokenising does not take place.

Normally the input sequence of bytes is the contents of a single source file.

Decoding

The input sequence of bytes is interpreted as a sequence of characters represented using the UTF-8 Unicode encoding scheme.

If the input sequence of bytes is not a well-formed UTF-8 code unit sequence, the input is rejected.

Byte order mark removal

If the first character in the sequence is U+FEFF (BYTE ORDER MARK), it is removed.

CRLF normalisation

Each pair of characters CR immediately followed by LF is replaced by a single LF character.

Note: It's not possible for two such pairs to overlap, so this operation is unambiguously defined.

Note: Other occurrences of the character CR are left in place. It's still possible for the sequence CRLF to be passed on to the tokeniser: that will happen if the input contained the sequence CRCRLF.

Shebang removal

Shebang removal is performed if:

  • the remaining sequence begins with the characters #!; and
  • the result of finding the first non-whitespace token with the characters following the #! as input is not a Punctuation token whose mark is the [ character.

If shebang removal is performed:

  • the characters up to and including the first LF character are removed from the sequence
  • if the sequence did not contain a LF character, all characters are removed from the sequence.

Note: The check for [ prevents an inner attribute at the start of the input being removed. See #70528 and #71487 for history.

Frontmatter removal

Stability: As of Rust 1.90 frontmatter removal is unstable. Under stable rustc 1.90, and under nightly rustc without the frontmatter feature flag, input which would undergo frontmatter removal is rejected.

If the FRONTMATTER nonterminal defined in the frontmatter grammar matches at the start of the remaining sequence, the characters consumed by that match are removed from the sequence.

Otherwise, if the RESERVED nonterminal defined in the frontmatter grammar matches at the start of the remaining sequence, the input is rejected.

The frontmatter grammar is the following Parsing Expression Grammar:

FRONTMATTER = {
    WHITESPACE_ONLY_LINE * ~
    START_LINE ~
    CONTENT_LINE * ~
    END_LINE
}

WHITESPACE_ONLY_LINE = {
    ( !"\n" ~ PATTERN_WHITE_SPACE ) * ~
    "\n"
}

START_LINE = {
    FENCE¹ ~
    HORIZONTAL_WHITESPACE * ~
    ( INFOSTRING ~ HORIZONTAL_WHITESPACE * ) ? ~
    "\n"
}

CONTENT_LINE = {
    !FENCE² ~
    ( !"\n" ~ ANY ) * ~
    "\n"
}

END_LINE = {
    FENCE² ~
    HORIZONTAL_WHITESPACE * ~
    ( "\n" | EOI )
}

FENCE = { "---" ~ "-" * }

INFOSTRING = {
    ( XID_START | "_" ) ~
    ( XID_CONTINUE | "-" | "." ) *
}

HORIZONTAL_WHITESPACE = { " " | "\t" }


RESERVED = {
    PATTERN_WHITE_SPACE * ~
    FENCE
}

These definitions require an extension to the Parsing Expression Grammar formalism: each of the expressions marked as FENCE² fails unless the text it matches is the same as the text matched by the (only) successful match using the expression marked as FENCE¹.

See Grammar for raw string literals for a discussion of alternatives to this extension.

Note: If there are any WHITESPACE_ONLY_LINEs, rustc emits a single whitespace token to represent them. But I think that token isn't observable by Rust programs, so it isn't modelled here.