Processing that happens before tokenising

Decoding
Byte order mark removal
CRLF normalisation
Shebang removal
Frontmatter removal

This document's description of tokenising takes a sequence of characters as input.

That sequence of characters is derived from an input sequence of bytes by performing the steps listed below in order.

It is also possible for one of the steps below to determine that the input should be rejected, in which case tokenising does not take place.

Normally the input sequence of bytes is the contents of a single source file.

Decoding

The input sequence of bytes is interpreted as a sequence of characters represented using the UTF-8 Unicode encoding scheme.

If the input sequence of bytes is not a well-formed UTF-8 code unit sequence, the input is rejected.

Byte order mark removal

If the first character in the sequence is U+FEFF (BYTE ORDER MARK), it is removed.

CRLF normalisation

Each pair of characters CR immediately followed by LF is replaced by a single LF character.

Note: It's not possible for two such pairs to overlap, so this operation is unambiguously defined.

Note: Other occurrences of the character CR are left in place. It's still possible for the sequence CRLF to be passed on to the tokeniser: that will happen if the input contained the sequence CRCRLF.

Shebang removal

Shebang removal is performed if:

the remaining sequence begins with the characters #!; and
the result of finding the first non-whitespace token with the characters following the #! as input is not a Punctuation token whose mark is the [ character.

If shebang removal is performed:

the characters up to and including the first LF character are removed from the sequence
if the sequence did not contain a LF character, all characters are removed from the sequence.

Note: The check for [ prevents an inner attribute at the start of the input being removed. See #70528 and #71487 for history.

Frontmatter removal

Stability: As of Rust 1.91 frontmatter removal is unstable. Under stable rustc 1.91, and under nightly rustc without the frontmatter feature flag, input which would undergo frontmatter removal is rejected.

If an attempt to match the FRONTMATTER nonterminal against the remaining sequence succeeds, the characters consumed by that match are removed from the sequence.

Otherwise, if an attempt to match the RESERVED nonterminal against the remaining sequence succeeds, the input is rejected.

These nonterminals are defined in the following Parsing Expression Grammar (the frontmatter grammar):

FRONTMATTER = {
    WHITESPACE_ONLY_LINE * ~
    START_LINE ~
    CONTENT_LINE * ~
    END_LINE
}

WHITESPACE_ONLY_LINE = {
    ( !LF ~ PATTERN_WHITE_SPACE ) * ~
    LF
}

START_LINE = {
    FENCE¹ ~
    HORIZONTAL_WHITESPACE * ~
    ( INFOSTRING ~ HORIZONTAL_WHITESPACE * ) ? ~
    LF
}

CONTENT_LINE = {
    !FENCE² ~
    ( !LF ~ ANY ) * ~
    LF
}

END_LINE = {
    FENCE² ~
    HORIZONTAL_WHITESPACE * ~
    ( LF | EOI )
}

FENCE = { "-" {3, 255} }

INFOSTRING = {
    ( XID_START | "_" ) ~
    ( XID_CONTINUE | "-" | "." ) *
}

HORIZONTAL_WHITESPACE = { " " | TAB }


RESERVED = {
    PATTERN_WHITE_SPACE * ~
    FENCE
}

See Special terminals for the definition of PATTERN_WHITE_SPACE.

These definitions require an extension to the Parsing Expression Grammar formalism: each of the parsing expressions marked as FENCE² fails unless the characters it consumes are the same as the characters consumed by the (only) match of the expression marked as FENCE¹.

See Grammar for raw string literals for a discussion of alternatives to this extension.

Note: If there are any WHITESPACE_ONLY_LINEs, rustc emits a single whitespace token to represent them. But I think that token isn't observable by Rust programs, so it isn't modelled here.

Writeup of Rust's lexer

Processing that happens before tokenising

Table of contents

Decoding

Byte order mark removal

CRLF normalisation

Shebang removal

Frontmatter removal

Keyboard shortcuts

Writeup of Rust's lexer

Processing that happens before tokenising

Table of contents

Decoding

Byte order mark removal

CRLF normalisation

Shebang removal

Frontmatter removal