Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

This document contains a description of rustc's lexer, which is aiming to be both correct and verifiable.

It's accompanied by a reimplementation of the lexer in Rust based on that description and a framework for comparing its output to rustc's.

One component of the description is a Parsing Expression Grammar; the reimplementation uses the Pest library to generate the corresponding parser.

Table of contents

Scope

Rust language version

This document describes Rust version 1.90 and the following unstable features (as they behave in the nightly version of rustc identified below):

That means it describes raw lifetimes/labels and the additional reservations in the 2024 edition, but not

  • rfc3349 (Mixed UTF-8 literals)
  • pr131656 (allowing more numeric suffixes beginning with e)

Other statements in this document are intended to be true as of September 2025.

The reimplementation is intended to be compiled against (and compared against)
rustc 1.92.0-nightly (caccb4d03 2025-09-24)

Editions

This document describes the editions supported by Rust 1.90:

  • 2015
  • 2018
  • 2021
  • 2024

There are no differences in lexing behaviour between the 2015 and 2018 editions.

In the reimplementation, "2015" is used to refer to the common behaviour of Rust 2015 and Rust 2018.

Accepted input

This description aims to accept input exactly if rustc's lexer would.

In particular, the description of tokenisation aims to model what's accepted as input to a function-like macro (a procedural macro or a by-example macro using the tt fragment specifier).

It's not attempting to accurately model rustc's "reasons" for rejecting input, or to provide enough information to reproduce error messages similar to rustc's.

It's not attempting to describe rustc's "recovery" behaviour (where input which will be reported as an error provides tokens to later stages of the compiler anyway).

Size limits

This description doesn't attempt to characterise rustc's limits on the size of the input as a whole.

As far as I know, rustc has no limits on the size of individual tokens beyond its limits on the input as a whole. But I haven't tried to test this.

Output form

This document only goes as far as describing how to produce a "least common denominator" stream of tokens.

Further writing will be needed to describe how to convert that stream to forms that fit the (differing) needs of the grammar and the macro systems.

In particular, this representation may be unsuitable for direct use by a description of the grammar because:

  • there's no distinction between identifiers and keywords;
  • there's a single "kind" of token for all punctuation;
  • sequences of punctuation such as :: aren't glued together to make a single token.

(The reimplementation includes code to make compound punctuation tokens so they can be compared with rustc's, and to organise them into delimited trees, but those processes aren't described here.)

Licence

This document and the accompanying lexer implementation are released under the terms of both the MIT license and the Apache License (Version 2.0).

Authorship and source access

© Matthew Woodcraft 2024,2025

The source code for this document and the accompanying lexer implementation is available at https://github.com/mattheww/lexeywan

Overview

The following processes might be considered to be part of Rust's lexer:

  • Decode: interpret UTF-8 input as a sequence of Unicode characters
  • Clean:
    • Byte order mark removal
    • CRLF normalisation
    • Shebang removal
    • Frontmatter removal
  • Tokenise: interpret the characters as ("fine-grained") tokens
  • Lower doc-comments: convert doc-comments into attributes
  • Build trees: organise tokens into delimited groups
  • Combine: convert fine-grained tokens to compound tokens (for declarative macros)
  • Prepare proc-macro input: convert fine-grained tokens to the form used for proc-macros
  • Remove whitespace: remove whitespace tokens

This document attempts to completely describe the "Decode", "Clean", "Tokenise", and "Lower doc-comments" processes.

Definitions

Byte

For the purposes of this document, byte means the same thing as Rust's u8 (corresponding to a natural number in the range 0 to 255 inclusive).

Character

For the purposes of this document, character means the same thing as Rust's char. That means, in particular:

  • there's exactly one character for each Unicode scalar value
  • the things that Unicode calls "noncharacters" are characters
  • there are no characters corresponding to surrogate code points

Sequence

When this document refers to a sequence of items, it means a finite, but possibly empty, ordered list of those items.

"character sequence" and "sequence of characters" are different ways of saying the same thing.

NFC normalisation

References to NFC-normalised strings are talking about Unicode's Normalization Form C, defined in Unicode Standard Annex #15.

Parsing Expression Grammars

This document relies on two parsing expression grammars (one for tokenising and one for recognising frontmatter).

Parsing Expression Grammars are described informally in §2 of Ford 2004.

Table of contents

Grammar notation

The notation used in this document is the variant used by the Pest Rust library, so that it's easy to keep in sync with the reimplementation.

In particular:

  • the sequencing operator is written explicitly, as ~
  • the ordered choice operator is |
  • ?, *, and + have their usual senses (as expression suffixes)
  • {0, 255} is a repetition suffix, meaning "from 0 to 255 repetitions"
  • the not-predicate (for negative lookahead) is ! (as an expression prefix)
  • a terminal matching an individual character is written like "x"
  • a terminal matching a sequence of characters is written like "abc"
  • a terminal matching a range of characters is written like '0'..'9'
  • "\"" matches a single " character
  • "\\" matches a single \ character
  • "\n" matches a single LF character

The ordered choice operator | has the lowest precedence, so

a ~ b | c ~ d

is equivalent to

( a ~ b ) | ( c ~ d )

The sequencing operator ~ has the next-lowest precedence, so

!"." ~ SOMETHING

is equivalent to

(!".") ~ SOMETHING

"Any character except" is written using the not-predicate and ANY, for example

( !"'" ~ ANY )

matches any single character except '.

See Grammar for raw string literals for a discussion of extensions used to model raw string literals and frontmatter fences.

Special terminals

The following named terminals are available in all grammars in this document.

Grammar
EOI
ANY
PATTERN_WHITE_SPACE
XID_START
XID_CONTINUE

EOI matches only when the sequence remaining to be matched is empty, without consuming any characters

ANY matches any Unicode character.

PATTERN_WHITE_SPACE matches any character which has the Pattern_White_Space Unicode property. These characters are:

U+0009(horizontal tab, '\t')
U+000A(line feed, '\n')
U+000B(vertical tab)
U+000C(form feed)
U+000D(carriage return, '\r')
U+0020(space, ' ')
U+0085(next line)
U+200E(left-to-right mark)
U+200F(right-to-left mark)
U+2028(line separator)
U+2029(paragraph separator)

Note: This set doesn't change in updated Unicode versions.

XID_START matches any character which has the XID_Start Unicode property (as of Unicode 16.0.0).

XID_CONTINUE matches any character which has the XID_Continue Unicode property (as of Unicode 16.0.0).

Processing that happens before tokenising

Table of contents

This document's description of tokenising takes a sequence of characters as input.

That sequence of characters is derived from an input sequence of bytes by performing the steps listed below in order.

It is also possible for one of the steps below to determine that the input should be rejected, in which case tokenising does not take place.

Normally the input sequence of bytes is the contents of a single source file.

Decoding

The input sequence of bytes is interpreted as a sequence of characters represented using the UTF-8 Unicode encoding scheme.

If the input sequence of bytes is not a well-formed UTF-8 code unit sequence, the input is rejected.

Byte order mark removal

If the first character in the sequence is U+FEFF (BYTE ORDER MARK), it is removed.

CRLF normalisation

Each pair of characters CR immediately followed by LF is replaced by a single LF character.

Note: It's not possible for two such pairs to overlap, so this operation is unambiguously defined.

Note: Other occurrences of the character CR are left in place. It's still possible for the sequence CRLF to be passed on to the tokeniser: that will happen if the input contained the sequence CRCRLF.

Shebang removal

Shebang removal is performed if:

  • the remaining sequence begins with the characters #!; and
  • the result of finding the first non-whitespace token with the characters following the #! as input is not a Punctuation token whose mark is the [ character.

If shebang removal is performed:

  • the characters up to and including the first LF character are removed from the sequence
  • if the sequence did not contain a LF character, all characters are removed from the sequence.

Note: The check for [ prevents an inner attribute at the start of the input being removed. See #70528 and #71487 for history.

Frontmatter removal

Stability: As of Rust 1.90 frontmatter removal is unstable. Under stable rustc 1.90, and under nightly rustc without the frontmatter feature flag, input which would undergo frontmatter removal is rejected.

If the FRONTMATTER nonterminal defined in the frontmatter grammar matches at the start of the remaining sequence, the characters consumed by that match are removed from the sequence.

Otherwise, if the RESERVED nonterminal defined in the frontmatter grammar matches at the start of the remaining sequence, the input is rejected.

The frontmatter grammar is the following Parsing Expression Grammar:

FRONTMATTER = {
    WHITESPACE_ONLY_LINE * ~
    START_LINE ~
    CONTENT_LINE * ~
    END_LINE
}

WHITESPACE_ONLY_LINE = {
    ( !"\n" ~ PATTERN_WHITE_SPACE ) * ~
    "\n"
}

START_LINE = {
    FENCE¹ ~
    HORIZONTAL_WHITESPACE * ~
    ( INFOSTRING ~ HORIZONTAL_WHITESPACE * ) ? ~
    "\n"
}

CONTENT_LINE = {
    !FENCE² ~
    ( !"\n" ~ ANY ) * ~
    "\n"
}

END_LINE = {
    FENCE² ~
    HORIZONTAL_WHITESPACE * ~
    ( "\n" | EOI )
}

FENCE = { "---" ~ "-" * }

INFOSTRING = {
    ( XID_START | "_" ) ~
    ( XID_CONTINUE | "-" | "." ) *
}

HORIZONTAL_WHITESPACE = { " " | "\t" }


RESERVED = {
    PATTERN_WHITE_SPACE * ~
    FENCE
}

These definitions require an extension to the Parsing Expression Grammar formalism: each of the expressions marked as FENCE² fails unless the text it matches is the same as the text matched by the (only) successful match using the expression marked as FENCE¹.

See Grammar for raw string literals for a discussion of alternatives to this extension.

Note: If there are any WHITESPACE_ONLY_LINEs, rustc emits a single whitespace token to represent them. But I think that token isn't observable by Rust programs, so it isn't modelled here.

Tokenising

Table of contents

The tokenisation grammar

The tokenisation grammar is a Parsing Expression Grammar which describes how to divide the input into fine-grained tokens.

The tokenisation grammar isn't strictly a Parsing Expression Grammar. See Grammar for raw string literals

The tokenisation grammar defines a tokens nonterminal and a token nonterminal for each Rust edition:

EditionTokens nonterminalToken nonterminal
2015 or 2018TOKENS_2015TOKEN_2015
2021TOKENS_2021TOKEN_2021
2024TOKENS_2024TOKEN_2024

Their definitions are presented in Token nonterminals below.

Each tokens nonterminal allows any number of repetitions of the corresponding token nonterminal.

Each token nonterminal is defined as a choice expression, each of whose subexpressions is a single nonterminal (a token-kind nonterminal).

The token-kind nonterminals are distinguished in the grammar as having names in Title_case.

The rest of the grammar is presented in the following pages in this section. The definitions of some nonterminals are repeated on multiple pages for convenience. The full grammar is also available on a single page.

The token-kind nonterminals are presented in an order consistent with their appearance in the token nonterminals. That means they appear in priority order (highest priority first).

Tokenisation

Tokenisation takes a character sequence (the input), and either produces a sequence of fine-grained tokens or reports that lexical analysis failed.

The analysis depends on the Rust edition which is in effect when the input is processed.

So strictly speaking, the edition is a second parameter to the process described here.

First, the edition's tokens nonterminal is matched against the input. If it does not succeed and consume the complete input, lexical analysis fails.

Strictly speaking we have to justify the assumption that matches will always either fail or succeed, which basically means observing that the grammar has no left recursion.

Otherwise, the sequence of fine-grained tokens is produced by processing each match of a token-kind nonterminal which participated in the tokens nonterminal's match, as described below.

If any match is rejected, lexical analysis fails.

Processing a token-kind nonterminal match

This operation considers a match of a token-kind nonterminal against part of the input, and either produces a fine-grained token or rejects the match.

The following pages describe how to process a match of each token-kind nonterminal, underneath the presentation of that nonterminal's section of the grammar.

Each description specifies which matches are rejected. For matches which are not rejected, a token is produced whose kind is the name of the token-kind nonterminal. The description specifies the token's attributes.

If for any match the description doesn't either say that the match is rejected or specify a well-defined value for each attribute needed for the token's kind, it's a bug in this writeup.

In these descriptions, notation of the form NTNAME denotes the sequence of characters consumed by the nonterminal named NTNAME which participated in the token-kind nonterminal match.

If this notation is used for a nonterminal which might participate more than once in the match, it's a bug in this writeup.

Finding the first non-whitespace token

This section defines a variant of the tokenisation process which is used in the definition of Shebang removal.

The operation of finding the first non-whitespace token in a character sequence (the input) is:

Match the edition's tokens nonterminal against the input, giving a sequence of matches of token-kind nonterminals.

Consider the sequence of tokens obtained by processing each of those matches, stopping as soon as any match is rejected.

The operation's result is the first token in that sequence which does not represent whitespace, or no token if there is no such token.

For this purpose a token represents whitespace if it is any of:

  • a Whitespace token
  • a Line_comment token whose style is non-doc
  • a Block_comment token whose style is non-doc

Fine-grained tokens

Tokenising produces fine-grained tokens.

Each fine-grained token has a kind, which is the name of one of the token-kind nonterminals. Most kinds of fine-grained token also have attributes, as described in the tables below.

KindAttributes
Whitespace
Line_commentstyle, body
Block_commentstyle, body
Punctuationmark
Identrepresented ident
Raw_identrepresented ident
Lifetime_or_labelname
Raw_lifetime_or_labelname
Character_literalrepresented character, suffix
Byte_literalrepresented byte, suffix
String_literalrepresented string, suffix
Raw_string_literalrepresented string, suffix
Byte_string_literalrepresented bytes, suffix
Raw_byte_string_literalrepresented bytes, suffix
C_string_literalrepresented bytes, suffix
Raw_c_string_literalrepresented bytes, suffix
Integer_literalbase, digits, suffix
Float_literalbody, suffix

Note: Some token-kind nonterminals do not appear in this table. These are the reserved forms, whose matches are always rejected. The names of reserved forms begin with Reserved_ or Unterminated_.

These attributes have the following types:

AttributeType
basebinary / octal / decimal / hexadecimal
bodysequence of characters
digitssequence of characters
marksingle character
namesequence of characters
represented bytesingle byte
represented bytessequence of bytes
represented charactersingle character
represented identsequence of characters
represented stringsequence of characters
stylenon-doc / inner doc / outer doc
suffixsequence of characters

Note: At this stage

  • Both _ and keywords are treated as instances of Ident.
  • There are explicit tokens representing whitespace and comments.
  • Single-character tokens are used for all punctuation.
  • A lifetime (or label) is represented as a single token (which includes the leading ').

Token nonterminals

As explained above, each of the token nonterminals defined below is an ordered choice of token-kind nonterminals.

Grammar
TOKENS_2015 = { TOKEN_2015 * }
TOKENS_2021 = { TOKEN_2021 * }
TOKENS_2024 = { TOKEN_2024 * }

TOKEN_2015 = {
    Whitespace |
    Line_comment |
    Block_comment |
    Unterminated_block_comment |
    Character_literal |
    Byte_literal |
    String_literal |
    Byte_string_literal |
    Raw_string_literal |
    Raw_byte_string_literal |
    Unterminated_literal_2015 |
    Reserved_single_quoted_literal_2015 |
    Float_literal |
    Reserved_float |
    Integer_literal |
    Lifetime_or_label |
    Raw_ident |
    Reserved_prefix_2015 |
    Ident |
    Punctuation
}

TOKEN_2021 = {
    Whitespace |
    Line_comment |
    Block_comment |
    Unterminated_block_comment |
    Character_literal |
    Byte_literal |
    String_literal |
    Byte_string_literal |
    C_string_literal |
    Raw_string_literal |
    Raw_byte_string_literal |
    Raw_c_string_literal |
    Reserved_literal_2021 |
    Reserved_single_quoted_literal_2021 |
    Float_literal |
    Reserved_float |
    Integer_literal |
    Raw_lifetime_or_label |
    Reserved_lifetime_or_label_prefix |
    Lifetime_or_label |
    Raw_ident |
    Reserved_prefix_2021 |
    Ident |
    Punctuation
}

TOKEN_2024 = {
    Whitespace |
    Line_comment |
    Block_comment |
    Unterminated_block_comment |
    Character_literal |
    Byte_literal |
    String_literal |
    Byte_string_literal |
    C_string_literal |
    Raw_string_literal |
    Raw_byte_string_literal |
    Raw_c_string_literal |
    Reserved_literal_2021 |
    Reserved_single_quoted_literal_2021 |
    Reserved_guard |
    Float_literal |
    Reserved_float |
    Integer_literal |
    Raw_lifetime_or_label |
    Reserved_lifetime_or_label_prefix |
    Lifetime_or_label |
    Raw_ident |
    Reserved_prefix_2021 |
    Ident |
    Punctuation
}

Whitespace and comment tokens

Table of contents

Whitespace

Grammar
Whitespace = { PATTERN_WHITE_SPACE + }

See Special terminals for the definition of PATTERN_WHITE_SPACE.

Attributes

(none)

Rejection

No matches are rejected.

Line comment

Grammar
Line_comment = { "//" ~ LINE_COMMENT_CONTENT }
LINE_COMMENT_CONTENT = { ( !"\n" ~ ANY )* }
Attributes

The token's style and body are determined from LINE_COMMENT_CONTENT as follows:

  • if LINE_COMMENT_CONTENT begins with //:

    • style is non-doc
    • body is empty
  • otherwise, if LINE_COMMENT_CONTENT begins with /,

    • style is outer doc
    • body is the characters from LINE_COMMENT_CONTENT after that /
  • otherwise, if LINE_COMMENT_CONTENT begins with !,

    • style is inner doc
    • body is the characters from LINE_COMMENT_CONTENT after that !
  • otherwise

    • style is non-doc
    • body is empty

Note: The body of a non-doc comment is ignored by the rest of the compilation process

Rejection

The match is rejected if the token's body would include a CR character.

Block comment

Grammar
Block_comment = { BLOCK_COMMENT }
BLOCK_COMMENT = { "/*" ~ BLOCK_COMMENT_CONTENT ~ "*/" }
BLOCK_COMMENT_CONTENT = { ( BLOCK_COMMENT | !"*/" ~ !"/*" ~ ANY ) * }

Note: See Nested block comments for discussion of the !"/*" subexpression.

Attributes

The comment content is the sequence of characters consumed by the first (and so the outermost) instance of BLOCK_COMMENT_CONTENT which participated in the match.

The token's style and body are determined from the block comment content as follows:

  • if the comment content begins with **:

    • style is non-doc
    • body is empty
  • otherwise, if the comment content begins with * and contains at least one further character,

    • style is outer doc
    • body is the characters from the comment content after that *
  • otherwise, if the comment content begins with !,

    • style is inner doc
    • body is the characters from the comment content after that !
  • otherwise

    • style is non-doc
    • body is empty

Note: It follows that /**/ and /***/ are not doc-comments

Note: The body of a non-doc comment is ignored by the rest of the compilation process

Rejection

The match is rejected if the token's body would include a CR character.

Unterminated block comment

Grammar
Unterminated_block_comment = { "/*" }
Rejection

All matches are rejected.

Note: This definition makes sure that an unterminated block comment isn't accepted as punctuation (* followed by /).

String and byte literal tokens

Table of contents

Single-quoted literals

The following nonterminals are common to the definitions below:

Grammar
SQ_REMAINDER = {
    "'" ~ SQ_CONTENT ~ "'" ~
    SUFFIX ?
}
SQ_CONTENT = {
    "\\" ~ ANY ~ ( !"'" ~ ANY ) * |
    !"'" ~ ANY
}

SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }

Character literal

Grammar
Character_literal = { SQ_REMAINDER }
Definitions

Define a represented character, derived from SQ_CONTENT as follows:

  • If SQ_CONTENT is the single character LF, CR, or TAB, the match is rejected.

  • If SQ_CONTENT is any other single character, the represented character is that character.

  • If SQ_CONTENT is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:

  • Otherwise the match is rejected

Attributes

The token's represented character is the represented character.

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

  • SQ_CONTENT is unacceptable, as described in the definition of the represented character above; or
  • the token's suffix would consist of the single character _.

Byte literal

Grammar
Byte_literal = { "b" ~ SQ_REMAINDER }
Definitions

Define a represented character, derived from SQ_CONTENT as follows:

  • If SQ_CONTENT is the single character LF, CR, or TAB, the match is rejected.

  • If SQ_CONTENT is a single character with Unicode scalar value greater than 127, the match is rejected.

  • If SQ_CONTENT is any other single character, the represented character is that character.

  • If SQ_CONTENT is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:

  • Otherwise the match is rejected

Attributes

The token's represented byte is the represented character's Unicode scalar value. (This is well defined because the definition above ensures that value is less than 256.)

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

  • SQ_CONTENT is unacceptable, as described in the definition of the represented character above; or
  • the token's suffix would consist of the single character _.

(Non-raw) double-quoted literals

The following nonterminals are common to the definitions below:

Grammar
DQ_REMAINDER = {
    "\"" ~ DQ_CONTENT ~ "\"" ~
    SUFFIX ?
}
DQ_CONTENT = {
    (
        "\\" ~ ANY |
        !"\"" ~ ANY
    ) *
}

SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }

String literal

Grammar
String_literal = { DQ_REMAINDER }
Attributes

The token's represented string is derived from DQ_CONTENT by replacing each escape sequence of any of the following forms with the escape sequence's escaped value:

These replacements take place in left-to-right order. For example, a match against the characters "\\x41" is converted to the characters \ x 4 1.

See Wording for string unescaping

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

  • a \ character appears in DQ_CONTENT and is not part of one of the above forms of escape; or
  • a CR character appears in DQ_CONTENT and is not part of a string continuation escape; or
  • the token's suffix would consist of the single character _.

Byte-string literal

Grammar
Byte_string_literal = { "b" ~ DQ_REMAINDER }
Definitions

Define a represented string (a sequence of characters) derived from DQ_CONTENT by replacing each escape sequence of any of the following forms with the escape sequence's escaped value:

These replacements take place in left-to-right order. For example, a match against the characters b"\\x41" is converted to the characters \ x 4 1.

See Wording for string unescaping

Attributes

The token's represented bytes are the sequence of Unicode scalar values of the characters in the represented string. (This is well defined because of the first rejection case below.)

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

  • any character whose unicode scalar value is greater than 127 appears in DQ_CONTENT; or
  • a \ character appears in DQ_CONTENT and is not part of one of the above forms of escape; or
  • a CR character appears in DQ_CONTENT and is not part of a string continuation escape; or
  • the token's suffix would consist of the single character _.

C-string literal

Grammar
C_string_literal = { "c" ~ DQ_REMAINDER }
Attributes

DQ_CONTENT is treated as a sequence of items, each of which is either a single Unicode character other than \ or an escape.

The token's represented bytes are derived from that sequence of items in the following way:

See Wording for string unescaping

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

  • a \ character appears in DQ_CONTENT and is not part of one of the above forms of escape; or
  • a CR character appears in DQ_CONTENT and is not part of a string continuation escape; or
  • any of the token's represented bytes would have value 0; or
  • the token's suffix would consist of the single character _.

Raw double-quoted literals

The following nonterminals are common to the definitions below:

Grammar
RAW_DQ_REMAINDER = {
    HASHES¹ ~
    "\"" ~ RAW_DQ_CONTENT ~ "\"" ~
    HASHES² ~
    SUFFIX ?
}
RAW_DQ_CONTENT = {
    ( !("\"" ~ HASHES²) ~ ANY ) *
}
HASHES = { "#" {0, 255} }

SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }

These definitions require an extension to the Parsing Expression Grammar formalism: each of the expressions marked as HASHES² fails unless the text it matches is the same as the text matched by the (only) successful match using the expression marked as HASHES¹ in the same attempt to match the current token-kind nonterminal.

See Grammar for raw string literals for a discussion of alternatives to this extension.

Raw string literal

Grammar
Raw_string_literal = { "r" ~ RAW_DQ_REMAINDER }
Attributes

The token's represented string is RAW_DQ_CONTENT.

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

  • a CR character appears in RAW_DQ_CONTENT; or
  • the token's suffix would consist of the single character _.

Raw byte-string literal

Grammar
Raw_byte_string_literal = { "br" ~ RAW_DQ_REMAINDER }
Attributes

The token's represented bytes are the Unicode scalar values of the characters in RAW_DQ_CONTENT. (This is well defined because of the first rejection case below.)

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

  • any character whose unicode scalar value is greater than 127 appears in RAW_DQ_CONTENT; or
  • a CR character appears in RAW_DQ_CONTENT; or
  • the token's suffix would consist of the single character _.

Raw C-string literal

Grammar
Raw_c_string_literal = { "cr" ~ RAW_DQ_REMAINDER }
Attributes

The token's represented bytes are the UTF-8 encoding of RAW_DQ_CONTENT

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

  • a CR character appears in RAW_DQ_CONTENT; or
  • any of the token's represented bytes would have value 0; or
  • the token's suffix would consist of the single character _.

Reserved forms

Reserved or unterminated literal

Grammar
Unterminated_literal_2015 = { "r\"" | "br\"" | "b'" }
Reserved_literal_2021 = { IDENT ~ ( "\"" | "'" ) }
Rejection

All matches are rejected.

Note: I believe in the Unterminated_literal_2015 definition only the b' form is strictly needed: if that definition matches using one of the other subexpressions then the input will be rejected eventually anyway (given that the corresponding string literal nonterminal didn't match).

Note: Reserved_literal_2021 catches both reserved forms and unterminated b' literals.

Reserved single-quoted literal

Grammar
Reserved_single_quoted_literal_2015 = { "'" ~ IDENT ~ "'" }
Reserved_single_quoted_literal_2021 = { "'" ~ "r#" ? ~ IDENT ~ "'" }
Rejection

All matches are rejected.

Note: This reservation is to catch forms like 'aaa'bbb, so this definition must come before Lifetime_or_label.

Reserved guard (Rust 2024)

Grammar
Reserved_guard = { "##" | "#\"" }
Rejection

All matches are rejected.

Note: This definition is listed here near the double-quoted string literals because these forms were reserved during discussions about introducing string literals formed like #"…"#.

Escape processing

Table of contents

The descriptions of processing string and character literals make use of several forms of escape.

Each form of escape is characterised by:

  • an escape sequence: a sequence of characters, which always begins with \
  • an escaped value: either a single character or an empty sequence of characters

In the definitions of escapes below:

  • An octal digit is any of the characters in the range 0..=7.
  • A hexadecimal digit is any of the characters in the ranges 0..=9, a..=f, or A..=F.

Simple escapes

Each sequence of characters occurring in the first column of the following table is an escape sequence.

In each case, the escaped value is the character given in the corresponding entry in the second column.

Escape sequenceEscaped value
\0U+0000 NUL
\tU+0009 HT
\nU+000A LF
\rU+000D CR
\"U+0022 QUOTATION MARK
\'U+0027 APOSTROPHE
\\U+005C REVERSE SOLIDUS

Note: The escaped value therefore has a Unicode scalar value which can be represented in a byte.

8-bit escapes

The escape sequence consists of \x followed by two hexadecimal digits.

The escaped value is the character whose Unicode scalar value is the result of interpreting the final two characters in the escape sequence as a hexadecimal integer, as if by u8::from_str_radix with radix 16.

Note: The escaped value therefore has a Unicode scalar value which can be represented in a byte.

7-bit escapes

The escape sequence consists of \x followed by an octal digit then a hexadecimal digit.

The escaped value is the character whose Unicode scalar value is the result of interpreting the final two characters in the escape sequence as a hexadecimal integer, as if by u8::from_str_radix with radix 16.

Unicode escapes

The escape sequence consists of \u{, followed by a hexadecimal digit, followed by a sequence of characters each of which is a hexadecimal digit or _, followed by }, with the following restrictions:

  • there are no more than six hexadecimal digits in the entire escape sequence; and
  • the result of interpreting the hexadecimal digits contained in the escape sequence as a hexadecimal integer, as if by u32::from_str_radix with radix 16, is a Unicode scalar value.

The escaped value is the character with that Unicode scalar value.

String continuation escapes

The escape sequence consists of \ followed immediately by LF, and all following whitespace characters before the next non-whitespace character.

For this purpose, the whitespace characters are HT (U+0009), LF (U+000A), CR (U+000D), and SPACE (U+0020).

The escaped value is an empty sequence of characters.

The Reference says this behaviour may change in future; see String continuation escapes.

Numeric literal tokens

Table of contents

The following nonterminals are common to the definitions below:

Grammar
DECIMAL_DIGITS = { ('0'..'9' | "_") * }
HEXADECIMAL_DIGITS = { ('0'..'9' | 'a' .. 'f' | 'A' .. 'F' | "_") * }
LOW_BASE_TOKEN_DIGITS = { DECIMAL_DIGITS }
DECIMAL_PART = { '0'..'9' ~ DECIMAL_DIGITS }

SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }

Float literal

Grammar
Float_literal = {
    FLOAT_BODY_WITH_EXPONENT ~ SUFFIX ? |
    FLOAT_BODY_WITHOUT_EXPONENT ~ !("e"|"E") ~ SUFFIX ? |
    FLOAT_BODY_WITH_FINAL_DOT ~ !"." ~ !IDENT_START
}

FLOAT_BODY_WITH_EXPONENT = {
    DECIMAL_PART ~ ("." ~ DECIMAL_PART ) ? ~
    ("e"|"E") ~ ("+"|"-") ? ~ EXPONENT_DIGITS
}
EXPONENT_DIGITS = { "_" * ~ '0'..'9' ~ DECIMAL_DIGITS }

FLOAT_BODY_WITHOUT_EXPONENT = {
    DECIMAL_PART ~ "." ~ DECIMAL_PART
}

FLOAT_BODY_WITH_FINAL_DOT = {
    DECIMAL_PART ~ "."
}

Note: The ! "." subexpression makes sure that forms like 1..2 aren't treated as starting with a float. The ! IDENT_START subexpression makes sure that forms like 1.some_method() aren't treated as starting with a float.

Attributes

The token's body is FLOAT_BODY_WITH_EXPONENT, FLOAT_BODY_WITHOUT_EXPONENT, or FLOAT_BODY_WITH_FINAL_DOT, whichever one participated in the match.

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

No matches are rejected.

Reserved float

Grammar
Reserved_float = {
    RESERVED_FLOAT_EMPTY_EXPONENT | RESERVED_FLOAT_BASED
}
RESERVED_FLOAT_EMPTY_EXPONENT = {
    DECIMAL_PART ~ ("." ~ DECIMAL_PART ) ? ~
    ("e"|"E") ~ ("+"|"-") ?
}
RESERVED_FLOAT_BASED = {
    (
        ("0b" | "0o") ~ LOW_BASE_TOKEN_DIGITS |
        "0x" ~ HEXADECIMAL_DIGITS
    )  ~  (
        ("e"|"E") |
        "." ~ !"." ~ !IDENT_START
    )
}
Rejection

All matches are rejected.

Integer literal

Grammar
Integer_literal = {
    ( INTEGER_BINARY_LITERAL |
      INTEGER_OCTAL_LITERAL |
      INTEGER_HEXADECIMAL_LITERAL |
      INTEGER_DECIMAL_LITERAL ) ~
    SUFFIX_NO_E ?
}

INTEGER_BINARY_LITERAL = { "0b" ~ LOW_BASE_TOKEN_DIGITS }
INTEGER_OCTAL_LITERAL = { "0o" ~ LOW_BASE_TOKEN_DIGITS }
INTEGER_HEXADECIMAL_LITERAL = { "0x" ~ HEXADECIMAL_DIGITS }
INTEGER_DECIMAL_LITERAL = { DECIMAL_PART }

SUFFIX_NO_E = { !("e"|"E") ~ SUFFIX }

Note: See rfc0879 for the reason we accept all decimal digits in binary and octal tokens; the inappropriate digits cause the token to be rejected.

Note: The INTEGER_DECIMAL_LITERAL nonterminal is listed last in the Integer_literal definition in order to resolve ambiguous cases like the following:

  • 0b1e2 (which isn't 0 with suffix b1e2)
  • 0b0123 (which is rejected, not accepted as 0 with suffix b0123)
  • 0xy (which is rejected, not accepted as 0 with suffix xy)
  • 0x· (which is rejected, not accepted as 0 with suffix )
Attributes

The token's base is looked up in the following table, depending on which nonterminal participated in the match:

INTEGER_BINARY_LITERALbinary
INTEGER_OCTAL_LITERALoctal
INTEGER_HEXADECIMAL_LITERALhexadecimal
INTEGER_DECIMAL_LITERALdecimal

The token's digits are LOW_BASE_TOKEN_DIGITS, HEXADECIMAL_DIGITS, or DECIMAL_PART, whichever one participated in the match.

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

  • the token's digits would consist entirely of _ characters; or
  • the token's base would be binary and its digits would contain any character other than 0, 1, or _; or
  • the token's base would be octal and its digits would contain any character other than 0, 1, 2, 3, 4, 5, 6, 7, or _.

Note: In particular, a match which would make an Integer_literal with empty digits is rejected.

Ident, lifetime, and label tokens

Table of contents

This writeup uses the term ident to refer to a token that lexically has the form of an identifier, including keywords and _.

Note: the procedural macros system uses the name Ident to refer to what this writeup calls Ident and Raw_ident.

The following nonterminals are common to the definitions below:

Grammar
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }

Note: This is following the specification in Unicode Standard Annex #31 for Unicode version 16.0, with the addition of permitting underscore as the first character.

See Special terminals for the definitions of XID_START and XID_CONTINUE.

Raw lifetime or label (Rust 2021 and 2024)

Grammar
Raw_lifetime_or_label = { "'r#" ~ IDENT }
Attributes

The token's name is IDENT.

Note that the name is not NFC-normalised. See NFC normalisation for lifetime/label.

Rejection

The match is rejected if IDENT is one of the following sequences of characters:

  • _
  • crate
  • self
  • super
  • Self

Reserved lifetime or label prefix (Rust 2021 and 2024)

Grammar
Reserved_lifetime_or_label_prefix = { "'" ~ IDENT ~ "#" }
Rejection

All matches are rejected.

(Non-raw) lifetime or label

Grammar
Lifetime_or_label = { "'" ~ IDENT }

Note: The Reserved_single_quoted_literal definitions make sure that forms like 'aaa'bbb are not accepted.

See Modelling lifetimes and labels for a discussion of why this model doesn't simply treat ' as punctuation.

Attributes

The token's name is IDENT.

Note that the name is not NFC-normalised. See NFC normalisation for lifetime/label.

Rejection

No matches are rejected.

Raw ident

Grammar
Raw_ident = { "r#" ~ IDENT }
Attributes

The token's represented ident is the NFC-normalised form of IDENT.

Rejection

The match is rejected if the token's represented ident would be one of the following sequences of characters:

  • _
  • crate
  • self
  • super
  • Self

Reserved prefix

Grammar
Reserved_prefix_2015 = { "r#" | "br#" }
Reserved_prefix_2021 = { IDENT ~ "#" }
Rejection

All matches are rejected.

Note: This definition must appear here in priority order. Tokens added in future which match these reserved forms wouldn't necessarily be forms of identifier.

(Non-raw) ident

Grammar
Ident = { IDENT }

Note: The Reference adds the following when discussing identifiers: "Zero width non-joiner (ZWNJ U+200C) and zero width joiner (ZWJ U+200D) characters are not allowed in identifiers." Those characters don't have XID_Start or XID_Continue, so that's only informative text, not an additional constraint.

Attributes

The token's represented ident is the NFC-normalised form of IDENT

Rejection

No matches are rejected.

Punctuation tokens

Table of contents

Punctuation

Grammar
Punctuation = {
    ";" |
    "," |
    "." |
    "(" |
    ")" |
    "{" |
    "}" |
    "[" |
    "]" |
    "@" |
    "#" |
    "~" |
    "?" |
    ":" |
    "$" |
    "=" |
    "!" |
    "<" |
    ">" |
    "-" |
    "&" |
    "|" |
    "+" |
    "*" |
    "/" |
    "^" |
    "%"
}
Attributes

The token's mark is the single character consumed by the match.

Rejection

No matches are rejected.

Lowering doc-comments

This phase of processing converts an input sequence of fine-grained tokens to a new sequence of fine-grained tokens.

The new sequence is the same as the input sequence, except that each Line_comment or Block_comment token whose style is inner doc or outer doc is replaced with the following sequence:

  • Punctuation with mark #
  • Whitespace
  • Punctuation with mark ! (omitted if the comment token's style is outer doc)
  • Punctuation with mark [
  • Ident with represented ident doc
  • Punctuation with mark =
  • Whitespace
  • Raw_string_literal with the comment token's body as the represented string and empty suffix
  • Punctuation with mark ]

Note: the whitespace tokens aren't observable by anything currently described in this writeup, but they explain the spacing in the tokens that proc-macros see.

Machine-readable frontmatter grammar

The machine-readable Pest grammar for frontmatter is presented here for convenience.

See Parsing Expression Grammars for an explanation of the notation.

This version of the grammar uses Pest's PUSH, PEEK, and POP for the matched fences.

ANY, EOI, PATTERN_WHITE_SPACE, XID_START, and XID_CONTINUE are built in to Pest and so not defined below.

FRONTMATTER = {
    WHITESPACE_ONLY_LINE * ~
    START_LINE ~
    CONTENT_LINE * ~
    END_LINE
}

WHITESPACE_ONLY_LINE = {
    ( !"\n" ~ PATTERN_WHITE_SPACE ) * ~
    "\n"
}

START_LINE = {
    PUSH(FENCE) ~
    HORIZONTAL_WHITESPACE * ~
    ( INFOSTRING ~ HORIZONTAL_WHITESPACE * ) ? ~
    "\n"
}

CONTENT_LINE = {
    !PEEK ~
    ( !"\n" ~ ANY ) * ~
    "\n"
}

END_LINE = {
    POP ~
    HORIZONTAL_WHITESPACE * ~
    ( "\n" | EOI )
}

FENCE = { "---" ~ "-" * }

INFOSTRING = {
    ( XID_START | "_" ) ~
    ( XID_CONTINUE | "-" | "." ) *
}

HORIZONTAL_WHITESPACE = { " " | "\t" }


RESERVED = {
    PATTERN_WHITE_SPACE * ~
    FENCE
}

The complete tokenisation grammar

The machine-readable Pest grammar for tokenisation is presented here for convenience.

See Parsing Expression Grammars for an explanation of the notation.

This version of the grammar uses Pest's PUSH, PEEK, and POP for the Raw_double_quoted_literal definitions.

ANY, PATTERN_WHITE_SPACE, XID_START, and XID_CONTINUE are built in to Pest and so not defined below.

TOKENS_2015 = { TOKEN_2015 * }
TOKENS_2021 = { TOKEN_2021 * }
TOKENS_2024 = { TOKEN_2024 * }

TOKEN_2015 = {
    Whitespace |
    Line_comment |
    Block_comment |
    Unterminated_block_comment |
    Character_literal |
    Byte_literal |
    String_literal |
    Byte_string_literal |
    Raw_string_literal |
    Raw_byte_string_literal |
    Unterminated_literal_2015 |
    Reserved_single_quoted_literal_2015 |
    Float_literal |
    Reserved_float |
    Integer_literal |
    Lifetime_or_label |
    Raw_ident |
    Reserved_prefix_2015 |
    Ident |
    Punctuation
}

TOKEN_2021 = {
    Whitespace |
    Line_comment |
    Block_comment |
    Unterminated_block_comment |
    Character_literal |
    Byte_literal |
    String_literal |
    Byte_string_literal |
    C_string_literal |
    Raw_string_literal |
    Raw_byte_string_literal |
    Raw_c_string_literal |
    Reserved_literal_2021 |
    Reserved_single_quoted_literal_2021 |
    Float_literal |
    Reserved_float |
    Integer_literal |
    Raw_lifetime_or_label |
    Reserved_lifetime_or_label_prefix |
    Lifetime_or_label |
    Raw_ident |
    Reserved_prefix_2021 |
    Ident |
    Punctuation
}

TOKEN_2024 = {
    Whitespace |
    Line_comment |
    Block_comment |
    Unterminated_block_comment |
    Character_literal |
    Byte_literal |
    String_literal |
    Byte_string_literal |
    C_string_literal |
    Raw_string_literal |
    Raw_byte_string_literal |
    Raw_c_string_literal |
    Reserved_literal_2021 |
    Reserved_single_quoted_literal_2021 |
    Reserved_guard |
    Float_literal |
    Reserved_float |
    Integer_literal |
    Raw_lifetime_or_label |
    Reserved_lifetime_or_label_prefix |
    Lifetime_or_label |
    Raw_ident |
    Reserved_prefix_2021 |
    Ident |
    Punctuation
}


Whitespace = { PATTERN_WHITE_SPACE + }

Line_comment = { "//" ~ LINE_COMMENT_CONTENT }
LINE_COMMENT_CONTENT = { ( !"\n" ~ ANY )* }

Block_comment = { BLOCK_COMMENT }
BLOCK_COMMENT = { "/*" ~ BLOCK_COMMENT_CONTENT ~ "*/" }
BLOCK_COMMENT_CONTENT = { ( BLOCK_COMMENT | !"*/" ~ !"/*" ~ ANY ) * }

Unterminated_block_comment = { "/*" }


SQ_REMAINDER = {
    "'" ~ SQ_CONTENT ~ "'" ~
    SUFFIX ?
}
SQ_CONTENT = {
    "\\" ~ ANY ~ ( !"'" ~ ANY ) * |
    !"'" ~ ANY
}

Character_literal = { SQ_REMAINDER }

Byte_literal = { "b" ~ SQ_REMAINDER }


DQ_REMAINDER = {
    "\"" ~ DQ_CONTENT ~ "\"" ~
    SUFFIX ?
}
DQ_CONTENT = {
    (
        "\\" ~ ANY |
        !"\"" ~ ANY
    ) *
}

String_literal = { DQ_REMAINDER }

Byte_string_literal = { "b" ~ DQ_REMAINDER }

C_string_literal = { "c" ~ DQ_REMAINDER }



RAW_DQ_REMAINDER = {
    PUSH(HASHES) ~
    "\"" ~ RAW_DQ_CONTENT ~ "\"" ~
    POP ~
    SUFFIX ?
}
RAW_DQ_CONTENT = {
    ( !("\"" ~ PEEK) ~ ANY ) *
}
HASHES = { "#" {0, 255} }

Raw_string_literal = { "r" ~ RAW_DQ_REMAINDER }

Raw_byte_string_literal = { "br" ~ RAW_DQ_REMAINDER }

Raw_c_string_literal = { "cr" ~ RAW_DQ_REMAINDER }


Unterminated_literal_2015 = { "r\"" | "br\"" | "b'" }
Reserved_literal_2021 = { IDENT ~ ( "\"" | "'" ) }

Reserved_single_quoted_literal_2015 = { "'" ~ IDENT ~ "'" }
Reserved_single_quoted_literal_2021 = { "'" ~ "r#" ? ~ IDENT ~ "'" }


Reserved_guard = { "##" | "#\"" }


DECIMAL_DIGITS = { ('0'..'9' | "_") * }
HEXADECIMAL_DIGITS = { ('0'..'9' | 'a' .. 'f' | 'A' .. 'F' | "_") * }
LOW_BASE_TOKEN_DIGITS = { DECIMAL_DIGITS }
DECIMAL_PART = { '0'..'9' ~ DECIMAL_DIGITS }


Float_literal = {
    FLOAT_BODY_WITH_EXPONENT ~ SUFFIX ? |
    FLOAT_BODY_WITHOUT_EXPONENT ~ !("e"|"E") ~ SUFFIX ? |
    FLOAT_BODY_WITH_FINAL_DOT ~ !"." ~ !IDENT_START
}

FLOAT_BODY_WITH_EXPONENT = {
    DECIMAL_PART ~ ("." ~ DECIMAL_PART ) ? ~
    ("e"|"E") ~ ("+"|"-") ? ~ EXPONENT_DIGITS
}
EXPONENT_DIGITS = { "_" * ~ '0'..'9' ~ DECIMAL_DIGITS }

FLOAT_BODY_WITHOUT_EXPONENT = {
    DECIMAL_PART ~ "." ~ DECIMAL_PART
}

FLOAT_BODY_WITH_FINAL_DOT = {
    DECIMAL_PART ~ "."
}

Reserved_float = {
    RESERVED_FLOAT_EMPTY_EXPONENT | RESERVED_FLOAT_BASED
}
RESERVED_FLOAT_EMPTY_EXPONENT = {
    DECIMAL_PART ~ ("." ~ DECIMAL_PART ) ? ~
    ("e"|"E") ~ ("+"|"-") ?
}
RESERVED_FLOAT_BASED = {
    (
        ("0b" | "0o") ~ LOW_BASE_TOKEN_DIGITS |
        "0x" ~ HEXADECIMAL_DIGITS
    )  ~  (
        ("e"|"E") |
        "." ~ !"." ~ !IDENT_START
    )
}


Integer_literal = {
    ( INTEGER_BINARY_LITERAL |
      INTEGER_OCTAL_LITERAL |
      INTEGER_HEXADECIMAL_LITERAL |
      INTEGER_DECIMAL_LITERAL ) ~
    SUFFIX_NO_E ?
}

INTEGER_BINARY_LITERAL = { "0b" ~ LOW_BASE_TOKEN_DIGITS }
INTEGER_OCTAL_LITERAL = { "0o" ~ LOW_BASE_TOKEN_DIGITS }
INTEGER_HEXADECIMAL_LITERAL = { "0x" ~ HEXADECIMAL_DIGITS }
INTEGER_DECIMAL_LITERAL = { DECIMAL_PART }

SUFFIX_NO_E = { !("e"|"E") ~ SUFFIX }


Raw_lifetime_or_label = { "'r#" ~ IDENT }

Reserved_lifetime_or_label_prefix = { "'" ~ IDENT ~ "#" }

Lifetime_or_label = { "'" ~ IDENT }

Raw_ident = { "r#" ~ IDENT }

Reserved_prefix_2015 = { "r#" | "br#" }
Reserved_prefix_2021 = { IDENT ~ "#" }

Ident = { IDENT }


Punctuation = {
    ";" |
    "," |
    "." |
    "(" |
    ")" |
    "{" |
    "}" |
    "[" |
    "]" |
    "@" |
    "#" |
    "~" |
    "?" |
    ":" |
    "$" |
    "=" |
    "!" |
    "<" |
    ">" |
    "-" |
    "&" |
    "|" |
    "+" |
    "*" |
    "/" |
    "^" |
    "%"
}


SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }

Command-line interface for the reimplementation

The repository containing this writeup also contains a binary crate which contains the reimplementation and a command line program for comparing the reimplementation against rustc (linking against the rustc implementation via rustc_private).

Run it in the usual way, from a working copy:

cargo run -- <subcommand> [options]

Note the repository includes a rust-toolchain.toml file which will cause cargo run to install the required nightly version of rustc.

Table of contents

Summary usage

Usage: lexeywan [<subcommand>] [...options]

Subcommands:
 *test          [suite-opts]
  compare       [suite-opts] [comparison-opts] [dialect-opts]
  decl-compare  [suite-opts] [comparison-opts] [--edition=2015|2021|*2024]
  inspect       [suite-opts] [dialect-opts]
  coarse        [suite-opts] [dialect-opts]
  identcheck
  proptest      [--count] [--strategy=<name>] [--print-failures|--print-all]
                [dialect-opts]

suite-opts (specify at most one):
  --short: run the SHORTLIST rather than the LONGLIST
  --xfail: run the tests which are expected to fail

dialect-opts:
  --edition=2015|2021|*2024
  --cleaning=none|*shebang|shebang-and-frontmatter
  --lower-doc-comments

comparison-opts:
  --failures-only: don't report cases where the lexers agree
  --details=always|*failures|never

* -- default

Subcommands

The following subcommands are available:

test

This is the main way to check the whole system for disagreements.

test is the default subcommand.

For each known edition, it runs the following for the requested testcases:

  • comparison of rustc_parse's lexer and the reimplementation, like for compare with options
    • --cleaning=none
    • --cleaning=shebang
    • --cleaning=shebang-and-frontmatter
    • --cleaning=shebang --lower-doc-comments
  • comparison via declarative macros, like for decl-compare

For each comparison it reports whether the implementations agreed for all testcases, without further detail.

compare

This is the main way to run the testsuite and see results for individual testcases.

It analyses the requested testcases with the rustc_parse lexer and the reimplementation, and compares the output.

The analysis uses a single dialect.

For each testcase, the comparison agrees if either:

  • both implementations accept the input and produce the same forest of regularised tokens; or
  • both implementations reject the input.

Otherwise the comparison disagrees.

See regular_tokens.rs for what regularisation involves.

The comparison may also mark a testcase as a model error. This happens if rustc panics or the reimplementation reports an internal error.

Example output
‼ R:✓ L:✗ «//comment»
✔ R:✓ L:✓ «'label»

Here, the first line says that rustc (R) accepted the input //comment but the reimplementation (L) rejected it. The initial indicates the disagreement.

The second line says that both rustc and the reimplementation accepted the input 'label. The initial indicates the agreement.

Output control

By default compare prints a line (of the sort shown in the example above) for each testcase. Pass --failures-only to only print lines for the cases where the implementations disagree.

The compare subcommand can also report further detail for a testcase:

  • if the input is accepted, the forest of regularised tokens
  • if the input is rejected, the rustc error message or the reimplementation's reason for rejection

Further detail is controlled as follows:

--details=alwaysReport detail for all testcases
--details=failures (default)Report detail for disagreeing testcases
--details=neverReport detail for no testcases

inspect

This shows more detail than compare, but doesn't report on agreement.

Analyses the requested testcases using the rustc_parse lexer and the reimplementation, and prints each lexer's analysis.

Uses a specified dialect.

Unlike compare, this shows the tokens before regularisation.

For the reimplementation, it shows details about what the grammar matched, and fine-grained tokens.

If rustc rejects the input (and the rejection wasn't a fatal error), it reports the tokens rustc would have passed on to the parser.

If the reimplementation rejects the input, reports what has been tokenised so far. If the rejection comes from processing, describes the rejected match and reports any matches and fine-grained tokens from before the rejection.

coarse

This shows the reimplementation's coarse-grained tokens.

Analyses the requested testcases using the reimplementation only, including combination of fine-grained tokens into coarse-grained tokens, and prints a representation of the analysis.

Uses a specified dialect.

decl-compare

This is a second way to test the observable behaviour of Rust's lexer, which doesn't depend so much on rustc_parse's internals.

It analyses the requested testcases via declarative macros, and compares the result to what the reimplementation expects.

The analysis works by defining a macro using the tt fragment specifier which applies stringify! to each parameter, embedding the testcase in an invocation of that macro, running rustc's macro expander and parser, and inspecting the results.

See the comments in decl_via_rustc.rs for details.

The reimplementation includes a model of what stringify! does.

It uses the selected edition. Doc-comments are always lowered. The only Processing that happens before tokenising step performed is CRLF normalisation.

The --details and --failures-only options work in the same way as for compare; "details" shows the stringified form of each token.

identcheck

This checks that the rustc_parse lexer and the reimplementation agree which characters are permitted in identifiers.

For each Unicode character C this constructs a string containing C aC, and checks the reimplementation and rustc_parse agree on its analysis.

It reports the number of agreements and disagreements.

It will notice if the Unicode version changes in one of the implementations (rustc's Unicode version comes from its unicode-xid dependency, and the reimplementation's comes from its pest dependency).

It won't notice if the Unicode version used for NFC normalisation is out of sync (for both the reimplementation and rustc, this comes from the unicode-normalization dependency).

identcheck always uses the latest available edition.

proptest

This performs randomised testing.

It uses proptest to generate random strings, analyses them with the rustc_parse lexer and the reimplementation, and compares the output. The analysis and comparison is the same as for compare above, for a specified dialect.

If this process finds a string which results in disagreement (or a model error), proptest simplifies it as much as it can while still provoking the problem, then testing stops.

The --count argument specifies how many strings to generate (the default is 5000).

The --strategy argument specified how to generate the strings. See SIMPLE_STRATEGIES in strategies.rs for the list of available strategies. The mix strategy is the default.

Output control

By default proptest prints a single reduced disagreement, if it finds any.

If --print-all is passed it prints each string it generates.

If --print-failures is passed it prints each disagreeing testcase it generates, so you can see the simplification process.

Dialects

The compare, inspect, coarse, and proptest subcommands accept the following options to control the lexers' behaviour:

  • --edition=2015|2021|2024
  • --cleaning=none|shebang|shebang-and-frontmatter
  • --lower-doc-comments

The decl-compare subcommand accepts only --edition.

The options apply both to rustc and the reimplementation.

--edition controls which Rust edition's lexing semantics are used. It defaults to the most recent known edition. There's no 2018 option because there were no lexing changes between the 2015 and 2018 editions.

--cleaning controls which of the steps described in Processing that happens before tokenising are performed. It defaults to shebang. Byte order mark removal and CRLF normalisation are always performed. (The reimplementation doesn't model the "Decoding" step, because the hard-coded testcases are provided as Rust string literals and so are already UTF-8.)

If --lower-doc-comments is passed, doc-comments are converted to attributes as described in Lowering doc-comments.

Choosing the testcases to run

By default, subcommands which need a list of testcases use the list hard-coded as LONGLIST in testcases.rs.

Pass --short to use the list hard-coded as SHORTLIST instead.

Pass --xfail to use the list hard-coded as XFAIL instead. This list includes testcases which are expected to fail or disagree with at least one subcommand and set of options.

Exit status

Each subcommand which compares the reimplementation to rustc reports exit status 0 if all comparisons agreed, or exit status 3 if any comparison disagreed or any model errors were observed.

For all subcommands, exit status 101 indicates an unhandled error.

Rationale for this model

Table of contents

Separating lexing from parsing

This model assumes that we want to define a process of tokenisation, turning a sequence of characters into a sequence of tokens, and that there would be a separate grammar using those tokens as its terminals.

The alternative is to use a single "scannerless" grammar, which combines lexical analysis and parsing into a single process.

Rust's parser is not easy to model, and I think it's likely that the best formalism for describing Rust's parser wouldn't also be a good formalism for describing its lexer.

Rejecting matches

The separate 'processing' step which can reject matches is primarily a way to make the grammar simpler.

The main advantage is dealing with character, byte, and string literals, where we have to reject invalid escape sequences at lexing time.

In this model, the lexer finds the extent of the token using simple grammar definitions, and then checks whether the escape sequences are valid in a separate "processing" operation. So the grammar "knows" that a backslash character indicates an escape sequence, but doesn't model escapes in any further detail.

In contrast the Reference gives grammar productions which try to describe the available escape sequences in each kind of string-like literal, but this isn't enough to characterise which forms are accepted (for example "\u{d800}" is rejected at lexing time, because there is no character with scalar value D800).

Given that we have this separate operation, we can use it to simplify other parts of the grammar too, including:

  • distinguishing doc-comments
  • rejecting CR in comments
  • rejecting the reserved keywords in raw identifiers, eg r#crate
  • rejecting no-digit forms like 0x_
  • rejecting the variants of numeric literals reserved in rfc0879, eg 0b123
  • rejecting literals with a single _ as a suffix

This means we can avoid adding many "reserved form" definitions. For example, if we didn't accept _ as a suffix in the main string-literal grammar definitions, we'd have to have another Reserved_something definition to prevent the _ being accepted as a separate token.

Given the choice to use locally greedy matching (see below), I think an operation which rejects tokens after parsing them is necessary to deal with a case like 0b123, to avoid analysing it as 0b1 followed by 23.

Using a Parsing Expression Grammar

I think a PEG is a good formalism for modelling Rust's lexer (though probably not for the rest of the grammar) for several reasons.

Resolving ambiguities

The lexical part of Rust's grammar is necessarily full of ambiguities.

For example:

  • ab could be a single identifier, or a followed by b
  • 1.2 could be a floating-point literal, or 1 followed by . followed by 2
  • r"x" could be a raw string literal, or r followed by "x"

The Reference doesn't explicitly state what formalism it's using, or what rule it uses to disambiguate such cases.

There are two common approaches: to choose the longest available match (as Lex traditionally does), or to explicitly list rules in priority order and specify locally "greedy" repetition-matching (as PEGs do).

With the "longest available" approach, additional rules are needed if multiple rules can match with equal length.

The 2024 version of this model characterised the cases where rustc doesn't choose the longest available match, and where (given its choice of rules) there are multiple longest-matching rules.

For example, the Reference's lexer rules for input such as 0x3 allow two interpretations, matching the same extent:

  • as a hexadecimal integer literal: 0x3 with no suffix
  • as a decimal integer literal: 0 with a suffix of x3

We want to choose the former interpretation. (We could say that these are both the same kind of token and re-analyse it later to decide which part was the suffix, but we'd still have to understand the distinction inside the lexer in order to reject forms like 0b0123.)

Examples where rustc chooses a token which is shorter than the longest available match are rare. In the model used by the Reference, 0x· is one: rustc treats this as a "reserved number" (0x), rather than 0 with suffix . (Note that · has the XID_Continue property but not XID_Start.)

I think that in 2025 it's clear that a priority-based system is a better fit for Rust's lexer. In particular, if pr131656 is accepted (allowing forms like 1em), the new ambiguities will be resolved naturally because the floating-point literal definitions have priority over the integer literal definitions.

Generative grammars don't inherently have prioritisation, but parsing expression grammars do.

Ambiguities that must be resolved as errors

There are a number of forms which are errors at lexing time, even though in principle they could be analysed as multiple tokens.

Many cases can be handled in processing, as described above.

Other cases can be handled naturally using a PEG, by writing medium-priority rules to match them, for example:

  • the rfc3101 "reserved prefixes" (in Rust 2021 and newer): k#abc, f"...", or f'...'
  • unterminated block comments such as /* abc
  • forms that look like floating-point literals with a base indicator, such as 0b1.0

In this model, these additional rules cause the input to be rejected at processing time.

Lookahead

There are two cases where the Reference currently describes the lexer's behaviour using lookahead:

  • for (possibly raw) lifetime-or-label, to prevent 'ab'c' being analysed as 'ab followed by 'c
  • for floating-point literals, to make sure that 1.a is analysed as 1 . a rather than 1. a

These are easily modelled using PEG predicates (though this writeup prefers a reserved form for the former).

Handling raw strings

The biggest weakness of using the PEG formalism is that it can't naturally describe the rule for matching the number of # characters in raw string literals.

See Grammar for raw string literals for discussion.

Adopting language changes

Rustc's lexer is made up of hand-written imperative code, largely using match statements. It often peeks ahead for a few characters, and it tries to avoid backtracking.

This is a close fit for the behaviour modelled by PEGs, so there's good reason to suppose that it will be easy to update this model for future versions of Rust.

Modelling lifetimes and labels

Like the Reference, this model has a separate kind of token for lifetime-or-label.

It would be nice to be able to treat them as two fine-grained tokens (' followed by an identifier), like they are treated in procedural macro input, but I think it's impractical.

The main difficulty is dealing with cases like 'r"abc". Rust accepts that as a lifetime-or-label 'r followed by a string literal "abc". A model which treats ' as a complete token would analyse this as ' followed by a raw string literal r"abc". This problem can occur with any prefix (including a reserved prefix).

Producing tokens with attributes

This model makes the lexing process responsible for a certain amount of 'interpretation' of the tokens, rather than simply describing how the input source is sliced up and assigning a 'kind' to each resulting token.

The main motivation for this is to deal with string-like literals: it means we don't need to separate the description of the result of "unescaping" strings from the description of which strings contain well-formed escapes.

In particular, describing unescaping at lexing time makes it easy to describe the rule about rejecting NULs in C-strings, even if they were written using an escape.

For numeric literals, the way the suffix is identified isn't always simple (see Resolving ambiguities above); I think it's best to make the lexer responsible for doing it, so that the description of numeric literal expressions doesn't have to.

For identifiers, many parts of the spec will need a notion of equivalence (both for handling raw identifiers and for dealing with NFC normalisation), and some restrictions depend on the normalised form (see ASCII identifiers). I think it's best for the lexer to handle this by defining the represented ident.

Grammar for raw string literals

I believe the PEG formalism can't naturally describe Rust's rule for matching the number of # characters in raw string literals.

(The same limitations apply to matching the number of - characters in frontmatter fences.)

I can think of the following ways to handle this:

Ad-hoc extension

This writeup uses an ad-hoc extension to the formalism, along similar lines to the stack extension described below (but without needing a full stack).

Pest's stack extension

Pest provides a stack extension which is a good fit for this requirement, and is used in the reimplementation.

It looks like this:

RAW_DQ_REMAINDER = {
    PUSH(HASHES) ~
    "\"" ~ RAW_DQ_CONTENT ~ "\"" ~
    POP ~
    SUFFIX ?
}
RAW_DQ_CONTENT = {
    ( !("\"" ~ PEEK) ~ ANY ) *
}
HASHES = { "#" {0, 255} }

The notion of matching an expression is extended to include a context stack (a stack of strings): each match attempt takes the stack as an additional input and produces an updated stack as an additional output.

The stack is initially empty.

There are three additional forms of expression: PUSH(e), PEEK(e), and POP(e), where e is an arbitrary expression.

PUSH(e) behaves in the same way as the expression e; if it succeeds, it additionally pushes the text consumed by e onto the stack.

PEEK(e) behaves in the same way as a literal string expression, where the string is the top entry of the stack. If the stack is empty, PEEK(e) fails.

POP(e) behaves in the same way as PEEK(e). If it succeeds, it then pops the top entry from the stack.

All other expressions leave the stack unmodified.

Scheme of definitions

Because raw string literals have a limit of 255 # characters, it is in principle possible to model them using a PEG with 256 (pairs of) definitions.

So writing this out as a "scheme" of definitions might be thinkable:

RDQ_0 = {
    "\"" ~ RDQ_0_CONTENT ~ "\"" ~
}
RDQ_0_CONTENT = {
    ( !("\"") ~ ANY ) *
}

RDQ_1 = {
    "#"{1} ~
    "\"" ~ RDQ_1_CONTENT ~ "\"" ~
    "#"{1} ~
}
RDQ_1_CONTENT = {
    ( !("\"" ~ "#"{1}) ~ ANY ) *
}

RDQ_2 = {
    "#"{2} ~
    "\"" ~ RDQ_2_CONTENT ~ "\"" ~
    "#"{2} ~
}
RDQ_2_CONTENT = {
    ( !("\"" ~ "#"{2}) ~ ANY ) *
}

…

RDQ_255 = {
    "#"{255} ~
    "\"" ~ RDQ_255_CONTENT ~ "\"" ~
    "#"{255} ~
}
RDQ_255_CONTENT = {
    ( !("\"" ~ "#"{255}) ~ ANY ) *
}

Open questions

Table of contents

Terminology

Some of the terms used in this document are taken from pre-existing documentation or rustc's error output, but many of them are new (and so can freely be changed).

Here's a partial list:

TermSource
processingNew
fine-grained tokenNew
compound tokenNew
literal contentReference
simple escapeReference
escape sequenceReference
escaped valueReference
string continuation escapeReference (as STRING_CONTINUE)
string representationReference
represented byteNew
represented characterReference
represented bytesReference
represented stringReference
represented identNew
style (of a comment)rustc internal
body (of a comment)Reference

Raw string literals

How should raw string literals be documented? See Grammar for raw string literals for some options.

Token kinds and attributes

What kinds and attributes should fine-grained tokens have?

Hash count

Should there be an attribute recording the number of hashes in a raw string or byte-string literal? Rustc has something of the sort.

ASCII identifiers

Should there be an attribute indicating whether an identifier is all ASCII? The Reference lists several places where identifiers have this restriction, and it seems natural for the lexer to be responsible for making this check.

The list in the Reference is:

  • extern crate declarations
  • External crate names referenced in a path
  • Module names loaded from the filesystem without a path attribute
  • no_mangle attributed items
  • Item names in external blocks

I believe this restriction is applied after NFC-normalisation, so it's best thought of as a restriction on the represented ident.

Represented bytes for C strings

At present this document says that the sequence of "represented bytes" for C string literals doesn't include the added NUL.

That's following the way the Reference currently uses the term "represented bytes", but rustc includes the NUL in its equivalent piece of data.

Should this writeup change to match rustc?

Wording for string unescaping

The description of processing for String literals, Byte-string literals, and C-string literals was originally drafted for the Reference. Should there be a more formal definition of unescaping processes than the current "left-to-right order" and "contributes" wording?

I believe that any literal content which will be accepted can be written uniquely as a sequence of (escape-sequence or non-\-character), but I'm not sure that's obvious enough that it can be stated without justification.

This is a place where the reimplementation isn't closely parallel to the writeup.

Rustc oddities

Table of contents

NFC normalisation for lifetime/label

Identifiers are normalised to NFC, which means that Kelvin and Kelvin are treated as representing the same identifier. See rfc2457.

But this doesn't happen for lifetimes or labels, so 'Kelvin and 'Kelvin are different as lifetimes or labels.

For example, this compiles without warning in Rust 1.90, while this doesn't.

In this writeup, the represented ident attribute of Ident and Raw_ident fine-grained tokens is in NFC, and the name attribute of Lifetime_or_label and Raw_lifetime_or_label tokens isn't.

I think this behaviour is a promising candidate for provoking the "Wait...that's what we currently do? We should fix that." reaction to being given a spec to review.

Filed as rustc #126759.

Nested block comments

The Reference says "Nested block comments are supported".

Rustc implements this by counting occurrences of /* and */, matching greedily. That means it rejects forms like /* xyz /*/.

This writeup includes a !"/*" subexpression in the BLOCK_COMMENT_CONTENT definition to match rustc's behaviour.

The grammar production in the Reference seems to be written to assume that these forms should be accepted (but I think it's garbled anyway: it accepts /* /* */).

I haven't seen any discussion of whether this rustc behaviour is considered desirable.

String continuation escapes

rustc has a warning that the behaviour of String continuation escapes (when multiple newlines are skipped) may change in future.

The Reference has a note about this, and points to #1042 for more information.

#136600 asks whether this is intentional.