Introduction

This document contains a description of rustc's lexer, which is aiming to be both correct and verifiable.

It's accompanied by a reimplementation of the lexer in Rust based on that description and a framework for comparing its output to rustc's.

The description uses Parsing Expression Grammars; the reimplementation uses the Pest library to generate the corresponding parsers.

Scope

Rust language version

This document describes Rust version 1.91 and the following unstable features (as they behave in the nightly version of rustc identified below):

rfc3503 (frontmatter)

That means it describes raw lifetimes/labels and the additional reservations in the 2024 edition, but not

rfc3349 (Mixed UTF-8 literals)
pr131656 (allowing more numeric suffixes beginning with e)

Other statements in this document are intended to be true as of December 2025.

The reimplementation is intended to be compiled against (and compared against)
rustc 1.93.0-nightly (b33119ffd 2025-12-04)

Editions

This document describes the editions supported by Rust 1.91:

2015
2018
2021
2024

There are no differences in lexing behaviour between the 2015 and 2018 editions.

In the reimplementation, "2015" is used to refer to the common behaviour of Rust 2015 and Rust 2018.

Accepted input

This description aims to accept input exactly if rustc's lexer would.

In particular, the description of tokenisation aims to model what's accepted as input to a function-like macro (a procedural macro or a by-example macro using the tt fragment specifier).

It's not attempting to accurately model rustc's "reasons" for rejecting input, or to provide enough information to reproduce error messages similar to rustc's.

It's not attempting to describe rustc's "recovery" behaviour (where input which will be reported as an error provides tokens to later stages of the compiler anyway).

Size limits

This description doesn't attempt to characterise rustc's limits on the size of the input as a whole.

As far as I know, rustc has no limits on the size of individual tokens beyond its limits on the input as a whole. But I haven't tried to test this.

Output form

This document only goes as far as describing how to produce a "least common denominator" stream of tokens.

Further writing will be needed to describe how to convert that stream to forms that fit the (differing) needs of the grammar and the macro systems.

In particular, this representation may be unsuitable for direct use by a description of the grammar because:

there's no distinction between identifiers and keywords;
there's a single "kind" of token for all punctuation;
sequences of punctuation such as :: aren't glued together to make a single token.

(The reimplementation includes code to make compound punctuation tokens so they can be compared with rustc's, and to organise them into delimited trees, but those processes aren't described here.)

Licence

This document and the accompanying lexer implementation are released under the terms of both the MIT license and the Apache License (Version 2.0).

Authorship and source access

The source code for this document and the accompanying lexer implementation is available at https://github.com/mattheww/lexeywan

Overview

The following processes might be considered to be part of Rust's lexer:

Decode: interpret UTF-8 input as a sequence of Unicode characters
Clean:
- Byte order mark removal
- CRLF normalisation
- Shebang removal
- Frontmatter removal
Tokenise: interpret the characters as ("fine-grained") tokens
Lower doc-comments: convert doc-comments into attributes
Build trees: organise tokens into delimited groups
Combine: convert fine-grained tokens to compound tokens (for declarative macros)
Prepare proc-macro input: convert fine-grained tokens to the form used for proc-macros
Remove whitespace: remove whitespace tokens

This document attempts to completely describe the "Decode", "Clean", "Tokenise", and "Lower doc-comments" processes.

Definitions

Unicode

References to Unicode in this document refer to the Unicode standard, version 16.0.

References to the Unicode character database refer to version 16.0.0.

NFC normalisation

References to NFC-normalised strings are talking about Unicode's Normalization Form C, defined in Unicode Standard Annex #15.

Byte

For the purposes of this document, byte means the same thing as Rust's u8 (corresponding to a natural number in the range 0 to 255 inclusive).

Character

For the purposes of this document, character means the same thing as Rust's char. That means, in particular:

there's exactly one character for each Unicode scalar value
the things that Unicode calls "noncharacters" are characters
there are no characters corresponding to surrogate code points
there is a character for each unassigned code point

Notation for characters

This document identifies characters in the following ways:

Printable ASCII characters other than space are represented by themselves using highlighting like a. For example \ represents character U+005C (REVERSE SOLIDUS).

ASCII control characters and space are represented as follows:


`U+0000`	`NUL`
`U+0009`	`HT`
`U+000A`	`LF`
`U+000D`	`CR`
`U+0020`	`SP`

Other characters are identified by hexadecimal scalar value and name, for example U+FEFF (BYTE ORDER MARK).

Sequence

When this document refers to a sequence of items, it means a finite, but possibly empty, ordered list of those items.

"character sequence" and "sequence of characters" are different ways of saying the same thing.

Grammars used in this writeup

This document relies on three parsing expression grammars: one for tokenising, one for processing escapes, and one for recognising frontmatter.

This page summarises how these grammars work. See the Parsing Expression Grammars appendix for a more formal treatment.

See Frontmatter grammar, Complete tokenisation grammar, and Escape-processing grammar for the grammars themselves.

There is no standardised notation for parsing expression grammars. This writeup is based on the variant used by the Pest Rust library, so that it's easy to keep in sync with the reimplementation.

See Grammar for raw string literals for a discussion of extensions used to model raw string literals and frontmatter fences. Those extensions are not described on this page.

Grammars

Here is an example of a grammar:

DIGITS = { '0'..'9' + }
NUMBER = { DIGITS ~ "." ~ DIGITS }
VARIABLE = { ( 'a'..'z' | "_" ) + }
VALUE = { NUMBER | VARIABLE }

The name on the left hand side of each definition is a nonterminal.

The right hand side of each definition contains a parsing expression (between { and }) which describes what that nonterminal matches.

Matching

Given a grammar, we can attempt to match a nonterminal against an input sequence of characters.

If the start of the input matches what the nonterminal's parsing expression requires, we say the attempt succeeds and consumes the matched characters. Otherwise we say the attempt fails.

We describe the result of a successful attempt as a match.

The table below summarises the forms a parsing expression can take, and describes what each form matches.

The process of matching a nonterminal often involves matching further nonterminals against a part of the input. We say that those further nonterminals participated in the match, and their matches are participating matches.

Examples

If VALUE in the example grammar is matched against abc, VARIABLE participates in the match, and the participating match consumes the characters abc.

If NUMBER in the example grammar is matched against 123.456, there are two participating matches of the DIGITS nonterminal, the first consuming 123 and the second consuming 456.

Parsing expressions

The following forms of parsing expression are available:

Form	Matching
eg `"abc"`	Match the exact string provided
eg `'a'..'f'`	Match one character from the provided (inclusive) range
A nonterminal	Match the nonterminal's parsing expression
`e₁ ~ e₂`	First match `e₁`, then match `e₂`
`e₁ \| e₂`	Match either `e₁` or `e₂`, with `e₁` having higher priority
`e ?`	Match `e` if possible
`e *`	Match as many repetitions of `e` as possible (possibly zero)
`e +`	Match as many repetitions of `e` as possible (at least one)
`e {m, n}`	Match between `m` and `n` (inclusive) repetitions of `e`
`! e`	Fail if `e` would match at this point
`( e )`	Match `e`, overriding the usual precedence

Here, e, e₁, and e₂ can be any parsing expression, and m and n can be any positive whole number.

Special terminals

In addition, the following named terminals are available in all grammars in this document:

Terminal	Matches
`ANY`	Any single Unicode character
`DOUBLEQUOTE`	A " character
`BACKSLASH`	A \ character
`CR`	A `CR` character
`LF`	An `LF` character
`TAB`	An `HT` character
`PATTERN_WHITE_SPACE`	A character with the Unicode `Pattern_White_Space` property
`XID_START`	A character with the Unicode `XID_Start` property
`XID_CONTINUE`	A character with the Unicode `XID_Continue` property
`EOI`	The end of input

EOI only matches when the remaining input is empty.

The Pattern_White_Space Unicode character property is defined in PropList.txt in the Unicode character database. The XID_Start and XID_Continue Unicode character properties are defined in DerivedCoreProperties.txt in the Unicode character database.

Note: The characters with the PATTERN_WHITE_SPACE Unicode character property are:

U+0009 CHARACTER TABULATION (HT)

U+000A LINE FEED (LF)

U+000B LINE TABULATION (VT)

U+000C FORM FEED (FF)

U+000D CARRIAGE RETURN (CR)

U+0020 SPACE

U+0085 NEXT LINE (NEL)

U+200E LEFT-TO-RIGHT MARK

U+200F RIGHT-TO-LEFT MARK

U+2028 LINE SEPARATOR

U+2029 PARAGRAPH SEPARATOR

This set doesn't change in updated Unicode versions.


U+0009	CHARACTER TABULATION (HT)
U+000A	LINE FEED (LF)
U+000B	LINE TABULATION (VT)
U+000C	FORM FEED (FF)
U+000D	CARRIAGE RETURN (CR)
U+0020	SPACE
U+0085	NEXT LINE (NEL)
U+200E	LEFT-TO-RIGHT MARK
U+200F	RIGHT-TO-LEFT MARK
U+2028	LINE SEPARATOR
U+2029	PARAGRAPH SEPARATOR

Prioritised choice

The prioritised choice operator | is the distinctive feature of parsing expression grammars.

An attempt to match ONE | TWO first attempts to match ONE, and if that succeeds it never considers TWO. If the match of ONE fails it attempts to match TWO instead.

Example

Matching "aa" | "aaa" against aaa consumes the characters aa, not aaa.

Repetition and backtracking

The repetition operators * and + always match as many repetitions as possible. If they are used as part of a larger match attempt which later fails, the matching process does not backtrack to see if the whole match can succeed if the repetition expression consumes fewer repetitions.

For example, matching "a"* ~ "ab" against aaab fails.

Similarly {m, n} and ? match as much as they can when they are first attempted, and there is no backtracking.

Examples

Matching "ab"? ~ "abc" against abc fails.

Matching "ab" ~ "abc" | "abc" against abc succeeds.

Example

With the following grammar
LETTER = { 'a'..'z' }
LETTER_OR_DOT = { 'a'..'z' | "." }
ENDS_WITH_LETTER = { LETTER_OR_DOT * ~ LETTER }
matching ENDS_WITH_LETTER against abcde fails.

With this grammar it succeeds:
LETTER = { 'a'..'z' }
LETTER_OR_DOT = { 'a'..'z' | "." }
ENDS_WITH_LETTER = { LETTER_OR_DOT ~ ENDS_WITH_LETTER | LETTER }

Precedence

The prioritised choice operator | has the lowest precedence, so for example

ONE ~ TWO | THREE ~ FOUR

is equivalent to

( ONE ~ TWO ) | ( THREE ~ FOUR )

The sequencing operator ~ has the next-lowest precedence, so for example

!"." ~ SOMETHING

is equivalent to

(!".") ~ SOMETHING

Both the | and ~ operators can be used repeatedly without parentheses, so for example

ONE | TWO | THREE

means "match ONE, TWO, or THREE, in descending order of priority", and

ONE ~ TWO ~ THREE

means "first match ONE, then match TWO, then match THREE".

Common idioms

"Any character except" is written using the negative lookup operator and ANY.

Example
( !"'" ~ ANY )
matches any single character except '.

Processing that happens before tokenising

This document's description of tokenising takes a sequence of characters as input.

That sequence of characters is derived from an input sequence of bytes by performing the steps listed below in order.

It is also possible for one of the steps below to determine that the input should be rejected, in which case tokenising does not take place.

Normally the input sequence of bytes is the contents of a single source file.

Decoding

The input sequence of bytes is interpreted as a sequence of characters represented using the UTF-8 Unicode encoding scheme.

If the input sequence of bytes is not a well-formed UTF-8 code unit sequence, the input is rejected.

Byte order mark removal

If the first character in the sequence is U+FEFF (BYTE ORDER MARK), it is removed.

CRLF normalisation

Each pair of characters CR immediately followed by LF is replaced by a single LF character.

Note: It's not possible for two such pairs to overlap, so this operation is unambiguously defined.

Note: Other occurrences of the character CR are left in place. It's still possible for the sequence CRLF to be passed on to the tokeniser: that will happen if the input contained the sequence CRCRLF.

Shebang removal

Shebang removal is performed if:

the remaining sequence begins with the characters #!; and
the result of finding the first non-whitespace token with the characters following the #! as input is not a Punctuation token whose mark is the [ character.

If shebang removal is performed:

the characters up to and including the first LF character are removed from the sequence
if the sequence did not contain a LF character, all characters are removed from the sequence.

Note: The check for [ prevents an inner attribute at the start of the input being removed. See #70528 and #71487 for history.

Frontmatter removal

Stability: As of Rust 1.91 frontmatter removal is unstable. Under stable rustc 1.91, and under nightly rustc without the frontmatter feature flag, input which would undergo frontmatter removal is rejected.

If an attempt to match the FRONTMATTER nonterminal against the remaining sequence succeeds, the characters consumed by that match are removed from the sequence.

Otherwise, if an attempt to match the RESERVED nonterminal against the remaining sequence succeeds, the input is rejected.

These nonterminals are defined in the following Parsing Expression Grammar (the frontmatter grammar):

FRONTMATTER = {
    WHITESPACE_ONLY_LINE * ~
    START_LINE ~
    CONTENT_LINE * ~
    END_LINE
}

WHITESPACE_ONLY_LINE = {
    ( !LF ~ PATTERN_WHITE_SPACE ) * ~
    LF
}

START_LINE = {
    FENCE¹ ~
    HORIZONTAL_WHITESPACE * ~
    ( INFOSTRING ~ HORIZONTAL_WHITESPACE * ) ? ~
    LF
}

CONTENT_LINE = {
    !FENCE² ~
    ( !LF ~ ANY ) * ~
    LF
}

END_LINE = {
    FENCE² ~
    HORIZONTAL_WHITESPACE * ~
    ( LF | EOI )
}

FENCE = { "-" {3, 255} }

INFOSTRING = {
    ( XID_START | "_" ) ~
    ( XID_CONTINUE | "-" | "." ) *
}

HORIZONTAL_WHITESPACE = { " " | TAB }


RESERVED = {
    PATTERN_WHITE_SPACE * ~
    FENCE
}

See Special terminals for the definition of PATTERN_WHITE_SPACE.

These definitions require an extension to the Parsing Expression Grammar formalism: each of the parsing expressions marked as FENCE² fails unless the characters it consumes are the same as the characters consumed by the (only) match of the expression marked as FENCE¹.

See Grammar for raw string literals for a discussion of alternatives to this extension.

Note: If there are any WHITESPACE_ONLY_LINEs, rustc emits a single whitespace token to represent them. But I think that token isn't observable by Rust programs, so it isn't modelled here.

Tokenising

The tokenisation grammar

The tokenisation grammar is a Parsing Expression Grammar which describes how to divide the input into fine-grained tokens.

The tokenisation grammar isn't strictly a Parsing Expression Grammar. See Grammar for raw string literals

The tokenisation grammar defines a tokens nonterminal for each Rust edition:

Edition	Tokens nonterminal
2015 or 2018	`TOKENS_2015`
2021	`TOKENS_2021`
2024	`TOKENS_2024`

Their definitions are presented in Tokenisation grammar below.

A nonterminal whose name appears in the kind column of the fine-grained tokens table is a token nonterminal.

A nonterminal whose name appears in the following table is a reserved-form nonterminal.

Reserved form
`Reserved_block_comment_start`
`Reserved_float`
`Reserved_guard`
`Reserved_lifetime_or_label_prefix`
`Reserved_literal_2015`
`Reserved_literal_2021`
`Reserved_prefix_2015`
`Reserved_prefix_2021`
`Reserved_single_quoted_literal_2015`
`Reserved_single_quoted_literal_2021`

A token nonterminal or reserved-form nonterminal is a tokenisation nonterminal.

The tokenisation nonterminals are distinguished in the grammar as having names in Title_case.

Tokenisation

Tokenisation takes a character sequence (the input), and either produces a sequence of fine-grained tokens or reports that lexical analysis failed.

The analysis depends on the Rust edition which is in effect when the input is processed.

So strictly speaking, the edition is a second parameter to the process described here.

First, a match of the edition's tokens nonterminal is attempted against the input. If the attempt does not succeed and consume the complete input, lexical analysis fails.

Otherwise, each member of the sequence of participating matches of tokenisation nonterminals in that attempt is processed as described below, giving the sequence of fine-grained tokens.

If any match is rejected during that processing, lexical analysis fails.

Processing a tokenisation nonterminal match

This operation considers a match of a tokenisation nonterminal against part of the input, and either produces a fine-grained token or rejects the match.

The following pages describe how to process a match of each tokenisation nonterminal, underneath the presentation of that nonterminal's section of the tokenisation grammar.

Each description specifies which matches are rejected. For matches which are not rejected, a token is produced whose kind mirrors the name of the tokenisation nonterminal. The description specifies the token's attributes.

Reserved-form nonterminals are always rejected.

If for any match the description doesn't either say that the match is rejected or specify a well-defined value for each attribute needed for the token's kind, it's a bug in this writeup.

In these descriptions, notation of the form NTNAME denotes the sequence of characters consumed by the single participating match of NTNAME in the tokenisation nonterminal match.

If this notation is used for a nonterminal which might not participate in the match, without saying what happens in that case, it's a bug in this writeup.

If this notation is used for a nonterminal which might participate more than once in the match, it's a bug in this writeup.

Finding the first non-whitespace token

This section defines a variant of the tokenisation process which is used in the definition of Shebang removal.

The operation of finding the first non-whitespace token in a character sequence (the input) is:

Match the edition's tokens nonterminal against the input, giving a sequence of participating matches of tokenisation nonterminals.

Consider the sequence of tokens obtained by processing each of those matches, stopping as soon as any match is rejected.

The operation's result is the first token in that sequence which does not represent whitespace, or no token if there is no such token.

For this purpose a token represents whitespace if it is any of:

a Whitespace token
a Line_comment token whose style is non-doc
a Block_comment token whose style is non-doc

Fine-grained tokens

Tokenising produces fine-grained tokens.

Each fine-grained token has a kind. Most kinds of fine-grained token also have attributes.

The possible kinds of fine-grained token are listed in the table below together with their attributes.

Kind	Attributes
`Whitespace`
`Line_comment`	`style`, `body`
`Block_comment`	`style`, `body`
`Character_literal`	`represented character`, `suffix`
`Byte_literal`	`represented byte`, `suffix`
`String_literal`	`represented string`, `suffix`
`Byte_string_literal`	`represented bytes`, `suffix`
`C_string_literal`	`represented bytes`, `suffix`
`Raw_string_literal`	`represented string`, `suffix`
`Raw_byte_string_literal`	`represented bytes`, `suffix`
`Raw_c_string_literal`	`represented bytes`, `suffix`
`Float_literal`	`body`, `suffix`
`Integer_literal`	`base`, `digits`, `suffix`
`Raw_lifetime_or_label`	`name`
`Lifetime_or_label`	`name`
`Raw_ident`	`represented ident`
`Ident`	`represented ident`
`Punctuation`	`mark`

These attributes have the following types:

Attribute	Type
`base`	binary / octal / decimal / hexadecimal
`body`	sequence of characters
`digits`	sequence of characters
`mark`	single character
`name`	sequence of characters
`represented byte`	single byte
`represented bytes`	sequence of bytes
`represented character`	single character
`represented ident`	sequence of characters
`represented string`	sequence of characters
`style`	non-doc / inner doc / outer doc
`suffix`	sequence of characters

Note: At this stage

Both _ and keywords are treated as instances of Ident.

There are explicit tokens representing whitespace and comments.

Single-character tokens are used for all punctuation.

A lifetime (or label) is represented as a single token (which includes the leading ').

Tokenisation grammar

As explained above, the grammar defines a tokens nonterminal for each Rust edition. They are presented below.

The rest of the grammar is presented in the following pages in this section. The definitions of some nonterminals are repeated on multiple pages for convenience.

The definitions of the tokenisation nonterminals are presented an order consistent with their appearances in the choice expressions below. That means they appear in priority order (higher priority earlier).

The full grammar is also available on a single page.

Grammar

TOKENS_2015 = { TOKEN_2015 * }
TOKENS_2021 = { TOKEN_2021 * }
TOKENS_2024 = { TOKEN_2024 * }

TOKEN_2015 = {
    Whitespace |
    Line_comment |
    Block_comment |
    Reserved_block_comment_start |
    Character_literal |
    Byte_literal |
    String_literal |
    Byte_string_literal |
    Raw_string_literal |
    Raw_byte_string_literal |
    Reserved_literal_2015 |
    Reserved_single_quoted_literal_2015 |
    Float_literal |
    Reserved_float |
    Integer_literal |
    Lifetime_or_label |
    Raw_ident |
    Reserved_prefix_2015 |
    Ident |
    Punctuation
}

TOKEN_2021 = {
    Whitespace |
    Line_comment |
    Block_comment |
    Reserved_block_comment_start |
    Character_literal |
    Byte_literal |
    String_literal |
    Byte_string_literal |
    C_string_literal |
    Raw_string_literal |
    Raw_byte_string_literal |
    Raw_c_string_literal |
    Reserved_literal_2021 |
    Reserved_single_quoted_literal_2021 |
    Float_literal |
    Reserved_float |
    Integer_literal |
    Raw_lifetime_or_label |
    Reserved_lifetime_or_label_prefix |
    Lifetime_or_label |
    Raw_ident |
    Reserved_prefix_2021 |
    Ident |
    Punctuation
}

TOKEN_2024 = {
    Whitespace |
    Line_comment |
    Block_comment |
    Reserved_block_comment_start |
    Character_literal |
    Byte_literal |
    String_literal |
    Byte_string_literal |
    C_string_literal |
    Raw_string_literal |
    Raw_byte_string_literal |
    Raw_c_string_literal |
    Reserved_literal_2021 |
    Reserved_single_quoted_literal_2021 |
    Reserved_guard |
    Float_literal |
    Reserved_float |
    Integer_literal |
    Raw_lifetime_or_label |
    Reserved_lifetime_or_label_prefix |
    Lifetime_or_label |
    Raw_ident |
    Reserved_prefix_2021 |
    Ident |
    Punctuation
}

Whitespace and comment tokens

Whitespace

Grammar

Whitespace = { PATTERN_WHITE_SPACE + }

See Special terminals for the definition of PATTERN_WHITE_SPACE.

Attributes

(none)

Rejection

No matches are rejected.

Line comment

Grammar

Line_comment = { "//" ~ LINE_COMMENT_CONTENT }
LINE_COMMENT_CONTENT = { ( !LF ~ ANY ) * }

Attributes

The token's style and body are determined from LINE_COMMENT_CONTENT as follows:

if LINE_COMMENT_CONTENT begins with //:
- style is non-doc
- body is empty
otherwise, if LINE_COMMENT_CONTENT begins with /,
- style is outer doc
- body is the characters from LINE_COMMENT_CONTENT after that /
otherwise, if LINE_COMMENT_CONTENT begins with !,
- style is inner doc
- body is the characters from LINE_COMMENT_CONTENT after that !
otherwise
- style is non-doc
- body is empty

Note: The body of a non-doc comment is ignored by the rest of the compilation process

Rejection

The match is rejected if the token's body would include a CR character.

Block comment

Grammar

Block_comment = { BLOCK_COMMENT }
BLOCK_COMMENT = { "/*" ~ BLOCK_COMMENT_CONTENT ~ "*/" }
BLOCK_COMMENT_CONTENT = { ( BLOCK_COMMENT | !"*/" ~ !"/*" ~ ANY ) * }

Note: See Nested block comments for discussion of the !"/*" subexpression.

Attributes

The comment content is the sequence of characters consumed by the first participating match (that is, the outermost match) of BLOCK_COMMENT_CONTENT in the match.

The token's style and body are determined from the block comment content as follows:

if the comment content begins with **:
- style is non-doc
- body is empty
otherwise, if the comment content begins with * and contains at least one further character,
- style is outer doc
- body is the characters from the comment content after that *
otherwise, if the comment content begins with !,
- style is inner doc
- body is the characters from the comment content after that !
otherwise
- style is non-doc
- body is empty

Note: It follows that /**/ and /***/ are not doc-comments

Note: The body of a non-doc comment is ignored by the rest of the compilation process

Rejection

The match is rejected if the token's body would include a CR character.

Reserved block comment start

Grammar

Reserved_block_comment_start = { "/*" }

Rejection

All matches are rejected.

Note: This definition makes sure that an unterminated block comment isn't accepted as punctuation (/ followed by *).

Quoted literal tokens

Each kind of quoted literal represents a character, byte, sequence of characters, or sequence of bytes.

These representations are obtained by interpreting the literal content (the consumed characters between ' ' or " "), which may contain several forms of escape (character sequences beginning with \).

The descriptions of processing the non-raw literals below are based on the single-escape interpretation and escape interpretation of the literal content, which are defined in Escape processing.

Summary

The following table summarises which forms of character and escape are accepted in each kind of quoted literal.

Literal	Forbidden	Simple	Unicode	Hexadecimal	String continuation
`''`	`CR` `LF` `HT`	✓	✓	✓ (<= 127)
`b''`	`CR` `LF` `HT` > 127	✓		✓
`""`	`CR`	✓	✓	✓ (<= 127)	✓
`b""`	`CR` > 127	✓		✓	✓
`c""`	`CR`	✓	✓	✓	✓
`r""`	`CR`
`br""`	`CR` > 127
`cr""`	`CR`
Eg		`\n`	`\u{2014}`	`\x1b`

The "Forbidden" column indicates which characters may not appear directly in the literal content; "> 127" means any character whose Unicode scalar value is greater than 127.

The remaining columns indicate which forms of escape are accepted.

The "(<= 127)" annotation means that hexadecimal escapes whose first hexadecimal digit is greater than 7 aren't accepted.

In raw literals the \ character represents itself; otherwise a \ that doesn't introduce an escape is forbidden.

Single-quoted literals

The following nonterminals are common to the definitions below:

Grammar

SINGLE_QUOTED_FORM = {
    "'" ~ SINGLE_QUOTED_CONTENT ~ "'" ~
    SUFFIX ?
}
SINGLE_QUOTED_CONTENT = {
    BACKSLASH ~ ANY ~ ( !"'" ~ ANY ) * |
    !"'" ~ ANY
}

SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }

Character literal

Grammar

Character_literal = { SINGLE_QUOTED_FORM }

Attributes

The token's represented character is the represented character of SINGLE_QUOTED_CONTENT's single-escape interpretation.

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

SINGLE_QUOTED_CONTENT has no single-escape interpretation; or
SINGLE_QUOTED_CONTENT's single-escape interpretation has no represented character; or
SINGLE_QUOTED_CONTENT's single-escape interpretation is a non-escape whose represented character is LF, CR, or HT; or
the token's suffix would consist of the single character _.

Byte literal

Grammar

Byte_literal = { "b" ~ SINGLE_QUOTED_FORM }

Attributes

The token's represented byte is the represented byte of SINGLE_QUOTED_CONTENT's single-escape interpretation.

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

SINGLE_QUOTED_CONTENT has no single-escape interpretation; or
SINGLE_QUOTED_CONTENT's single-escape interpretation is any of the following:
- a non-escape whose represented character is LF, CR, or HT
- a Unicode escape; or
SINGLE_QUOTED_CONTENT's single-escape interpretation has no represented byte; or
the token's suffix would consist of the single character _.

(Non-raw) double-quoted literals

The following nonterminals are common to the definitions below:

Grammar

DOUBLE_QUOTED_FORM = {
    DOUBLEQUOTE ~ DOUBLE_QUOTED_CONTENT ~ DOUBLEQUOTE ~
    SUFFIX ?
}
DOUBLE_QUOTED_CONTENT = {
    (
        BACKSLASH ~ ANY |
        !DOUBLEQUOTE ~ ANY
    ) *
}

SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }

String literal

Grammar

String_literal = { DOUBLE_QUOTED_FORM }

Attributes

The token's represented string is the sequence made up of the represented character of each component of DOUBLE_QUOTED_CONTENT's escape interpretation.

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

DOUBLE_QUOTED_CONTENT has no escape interpretation; or
DOUBLE_QUOTED_CONTENT's escape interpretation contains any of the following
- a component that has no represented character
- a non-escape whose represented character is CR; or
the token's suffix would consist of the single character _.

Byte-string literal

Grammar

Byte_string_literal = { "b" ~ DOUBLE_QUOTED_FORM }

Attributes

The token's represented bytes are the represented byte of each component of DOUBLE_QUOTED_CONTENT's escape interpretation.

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

DOUBLE_QUOTED_CONTENT has no escape interpretation; or
DOUBLE_QUOTED_CONTENT's escape interpretation contains any of the following:
- a non-escape whose represented character is CR
- a Unicode escape
- a component that has no represented byte; or
the token's suffix would consist of the single character _.

C-string literal

Grammar

C_string_literal = { "c" ~ DOUBLE_QUOTED_FORM }

Attributes

The token's represented bytes are derived from DOUBLE_QUOTED_CONTENT's escape interpretation in the following way:

Each non-escape, simple escape, or Unicode escape contributes the UTF-8 encoding of its represented character.
Each hexadecimal escape contributes its represented byte.

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

DOUBLE_QUOTED_CONTENT has no escape interpretation; or
DOUBLE_QUOTED_CONTENT's escape interpretation contains any of the following:
- a Unicode escape that has no represented character
- a non-escape whose represented character is CR; or
any of the token's represented bytes would be 0; or
the token's suffix would consist of the single character _.

Raw double-quoted literals

The following nonterminals are common to the definitions below:

Grammar

RAW_DOUBLE_QUOTED_FORM = {
    HASHES¹ ~
    DOUBLEQUOTE ~ RAW_DOUBLE_QUOTED_CONTENT ~ DOUBLEQUOTE ~
    HASHES² ~
    SUFFIX ?
}
RAW_DOUBLE_QUOTED_CONTENT = {
    ( !(DOUBLEQUOTE ~ HASHES²) ~ ANY ) *
}
HASHES = { "#" {0, 255} }

SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }

These definitions require an extension to the Parsing Expression Grammar formalism: an attempt to match one of the parsing expressions marked as HASHES² fails unless the characters it consumes are the same as the characters consumed by the (only) match of the expression marked as HASHES¹ under the same match attempt of a tokenisation nonterminal.

See Grammar for raw string literals for a discussion of alternatives to this extension.

Raw string literal

Grammar

Raw_string_literal = { "r" ~ RAW_DOUBLE_QUOTED_FORM }

Attributes

The token's represented string is RAW_DOUBLE_QUOTED_CONTENT.

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

a CR character appears in RAW_DOUBLE_QUOTED_CONTENT; or
the token's suffix would consist of the single character _.

Raw byte-string literal

Grammar

Raw_byte_string_literal = { "br" ~ RAW_DOUBLE_QUOTED_FORM }

Attributes

The token's represented bytes are the Unicode scalar values of the characters in RAW_DOUBLE_QUOTED_CONTENT. (This is well defined because of the first rejection case below.)

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

any character whose Unicode scalar value is greater than 127 appears in RAW_DOUBLE_QUOTED_CONTENT; or
a CR character appears in RAW_DOUBLE_QUOTED_CONTENT; or
the token's suffix would consist of the single character _.

Raw C-string literal

Grammar

Raw_c_string_literal = { "cr" ~ RAW_DOUBLE_QUOTED_FORM }

Attributes

The token's represented bytes are the UTF-8 encoding of RAW_DOUBLE_QUOTED_CONTENT

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

a CR character appears in RAW_DOUBLE_QUOTED_CONTENT; or
any of the token's represented bytes would be 0; or
the token's suffix would consist of the single character _.

Reserved forms

Reserved literal

Grammar

Reserved_literal_2015 = { "r" ~ DOUBLEQUOTE | "br" ~ DOUBLEQUOTE | "b'" }
Reserved_literal_2021 = { IDENT ~ ( DOUBLEQUOTE | "'" ) }

Rejection

All matches are rejected.

Note: I believe in the Reserved_literal_2015 definition only the b' form is strictly needed: if that definition matches using one of the other subexpressions then the input will be rejected eventually anyway (given that the corresponding quoted literal nonterminal didn't match).

Note: Reserved_literal_2021 catches both reserved forms and unterminated b' literals.

Reserved single-quoted literal

Grammar

Reserved_single_quoted_literal_2015 = { "'" ~ IDENT ~ "'" }
Reserved_single_quoted_literal_2021 = { "'" ~ "r#" ? ~ IDENT ~ "'" }

Rejection

All matches are rejected.

Note: This reservation is to catch forms like 'aaa'bbb, so this definition must come before Lifetime_or_label.

Reserved guard (Rust 2024)

Grammar

Reserved_guard = { "##" | "#" ~ DOUBLEQUOTE }

Rejection

All matches are rejected.

Note: This definition is listed here near the double-quoted string literals because these forms were reserved during discussions about introducing string literals formed like #"…"#.

Escape processing

The escape-processing grammar

The escape-processing grammar is the following Parsing Expression Grammar:

LITERAL_COMPONENTS = {
    LITERAL_COMPONENT *
}

LITERAL_COMPONENT = {
    !BACKSLASH ~ ANY |
    BACKSLASH ~ ESCAPE_BODY
}

ESCAPE_BODY = {
    SIMPLE_ESCAPE_BODY |
    UNICODE_ESCAPE_BODY |
    HEXADECIMAL_ESCAPE_BODY |
    STRING_CONTINUATION_ESCAPE_BODY
}

SIMPLE_ESCAPE_BODY = {
    "0" | "t" | "n" | "r" | DOUBLEQUOTE | "'" | BACKSLASH
}

UNICODE_ESCAPE_BODY = {
    "u" ~ "{" ~ ( HEXADECIMAL_DIGIT ~ "_" * ){1,6} ~ "}"
}

HEXADECIMAL_ESCAPE_BODY = {
    "x" ~ HEXADECIMAL_DIGIT ~ HEXADECIMAL_DIGIT
}

STRING_CONTINUATION_ESCAPE_BODY = {
    LF ~ ( TAB | LF | CR | " " ) *
}

HEXADECIMAL_DIGIT = {
    '0'..'9' | 'a'..'f' | 'A'..'F'
}

Classifying escapes

A match of LITERAL_COMPONENT is:

a non-escape if ESCAPE_BODY did not participate in the match
a simple escape if SIMPLE_ESCAPE_BODY participated in the match
a Unicode escape if UNICODE_ESCAPE_BODY participated in the match
a hexadecimal escape if HEXADECIMAL_ESCAPE_BODY participated in the match
a string continuation escape if STRING_CONTINUATION_ESCAPE_BODY participated in the match.

It follows from the definitions of LITERAL_COMPONENT AND ESCAPE_BODY that each match of LITERAL_COMPONENT is exactly one of the above forms.

Non-escapes

The represented character of a non-escape is the single character consumed by the non-escape.

The represented byte of a non-escape whose represented character has a Unicode scalar value that is less than 128 is that Unicode scalar value. Other non-escapes have no represented byte.

Note: this means a non-escape has a represented byte exactly when the UTF-8 encoding of its represented character is a single byte.

Simple escapes

A simple escape is a form like \n or \". Simple escapes are used to represent common control characters and characters that have special meaning in the tokenisation grammar.

The represented character of a simple escape is determined from the character consumed by the match of SIMPLE_ESCAPE_BODY that participated in the escape, according to the table below.

Simple escape body	Represented character
0	U+0000 `NUL`
t	U+0009 `HT`
n	U+000A `LF`
r	U+000D `CR`
"	U+0022 (QUOTATION MARK)
'	U+0027 (APOSTROPHE)
\	U+005C (REVERSE SOLIDUS)

The represented byte of a simple escape is the Unicode scalar value of its represented character.

Unicode escapes

A Unicode escape is a form like \u{211d} or \u{01_F9_80}. A Unicode escape can represent any single character.

The digits of a Unicode escape are the characters consumed by the sequence of participating matches of HEXADECIMAL_DIGIT in the escape.

The numeric value of a Unicode escape is the result of interpreting the escape's digits as a hexadecimal integer, as if by u32::from_str_radix with radix 16.

If a Unicode escape's numeric value is a Unicode scalar value, the represented character of the escape is the character with that Unicode scalar value. Otherwise the Unicode escape has no represented character.

Hexadecimal escapes

A hexadecimal escape is a form like \xA0 or \x1b. In byte, byte-string, and C-string literals, a hexadecimal escape can represent any single byte. In character and string literals, a hexadecimal escape can represent any single ASCII character.

The digits of a hexadecimal escape are the characters consumed by the sequence of participating matches of HEXADECIMAL_DIGIT in the escape.

The represented byte of a hexadecimal escape is the result of interpreting the escape's digits as a hexadecimal integer, as if by u8::from_str_radix with radix 16.

The represented character of a hexadecimal escape whose represented byte is less than 128 is the character whose Unicode scalar value is the escape's represented byte. Other hexadecimal escapes have no represented character.

Note: this means a hexadecimal escape has a represented character exactly when its represented byte is the UTF-8 encoding of a character.

String continuation escapes

A string continuation escape is \ followed immediately by LF, optionally followed by some forms of additional whitespace (see STRING_CONTINUATION_ESCAPE_BODY). The escape is effectively removed from the literal content.

The Reference says the whitespace-removal behaviour may change in future; see String continuation escapes.

Escape interpretations

Single-escape interpretation

If an attempt to match the LITERAL_COMPONENT nonterminal against a character sequence succeeds and consumes the entire sequence, and the match is not a string continuation escape, the single-escape interpretation of that character sequence is the resulting match.

Otherwise the character sequence has no single-escape interpretation.

This means a single-escape interpretation is one of the forms described under Classifying escapes above, other than a string continuation escape.

Escape interpretation

If an attempt to match the LITERAL_COMPONENTS nonterminal against a character sequence succeeds and consumes the entire sequence, the escape interpretation of that character sequence is the sequence of participating matches of LITERAL_COMPONENT in the resulting match, omitting any string continuation escapes.

Otherwise, the character sequence has no escape interpretation.

The individual matches in an escape interpretation are referred to as its components.

This means the escape interpretation is a sequence of components, each of which has one of the forms described under Classifying escapes above, and it doesn't include any string continuation escapes.

Numeric literal tokens

The following nonterminals are common to the definitions below:

Grammar

DECIMAL_DIGITS = { ('0'..'9' | "_") * }
HEXADECIMAL_DIGITS = { ('0'..'9' | 'a'..'f' | 'A'..'F' | "_") * }
LOW_BASE_TOKEN_DIGITS = { DECIMAL_DIGITS }
DECIMAL_PART = { '0'..'9' ~ DECIMAL_DIGITS }

SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }

Float literal

Grammar

Float_literal = {
    FLOAT_BODY_WITH_EXPONENT ~ SUFFIX ? |
    FLOAT_BODY_WITHOUT_EXPONENT ~ !("e"|"E") ~ SUFFIX ? |
    FLOAT_BODY_WITH_FINAL_DOT ~ !"." ~ !IDENT_START
}

FLOAT_BODY_WITH_EXPONENT = {
    DECIMAL_PART ~ ("." ~ DECIMAL_PART ) ? ~
    ("e"|"E") ~ ("+"|"-") ? ~ EXPONENT_DIGITS
}
EXPONENT_DIGITS = { "_" * ~ '0'..'9' ~ DECIMAL_DIGITS }

FLOAT_BODY_WITHOUT_EXPONENT = {
    DECIMAL_PART ~ "." ~ DECIMAL_PART
}

FLOAT_BODY_WITH_FINAL_DOT = {
    DECIMAL_PART ~ "."
}

Note: The ! "." subexpression makes sure that forms like 1..2 aren't treated as starting with a float. The ! IDENT_START subexpression makes sure that forms like 1.some_method() aren't treated as starting with a float.

Attributes

The token's body is FLOAT_BODY_WITH_EXPONENT, FLOAT_BODY_WITHOUT_EXPONENT, or FLOAT_BODY_WITH_FINAL_DOT, whichever one participated in the match.

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

No matches are rejected.

Reserved float

Grammar

Reserved_float = {
    RESERVED_FLOAT_EMPTY_EXPONENT | RESERVED_FLOAT_BASED
}
RESERVED_FLOAT_EMPTY_EXPONENT = {
    DECIMAL_PART ~ ("." ~ DECIMAL_PART ) ? ~
    ("e"|"E") ~ ("+"|"-") ?
}
RESERVED_FLOAT_BASED = {
    (
        ("0b" | "0o") ~ LOW_BASE_TOKEN_DIGITS |
        "0x" ~ HEXADECIMAL_DIGITS
    )  ~  (
        ("e"|"E") |
        "." ~ !"." ~ !IDENT_START
    )
}

Rejection

All matches are rejected.

Integer literal

Grammar

Integer_literal = {
    ( INTEGER_BINARY_LITERAL |
      INTEGER_OCTAL_LITERAL |
      INTEGER_HEXADECIMAL_LITERAL |
      INTEGER_DECIMAL_LITERAL ) ~
    SUFFIX_NO_E ?
}

INTEGER_BINARY_LITERAL = { "0b" ~ LOW_BASE_TOKEN_DIGITS }
INTEGER_OCTAL_LITERAL = { "0o" ~ LOW_BASE_TOKEN_DIGITS }
INTEGER_HEXADECIMAL_LITERAL = { "0x" ~ HEXADECIMAL_DIGITS }
INTEGER_DECIMAL_LITERAL = { DECIMAL_PART }

SUFFIX_NO_E = { !("e"|"E") ~ SUFFIX }

Note: See rfc0879 for the reason we accept all decimal digits in binary and octal tokens; the inappropriate digits cause the token to be rejected.

Note: The INTEGER_DECIMAL_LITERAL nonterminal is listed last in the Integer_literal definition in order to resolve ambiguous cases like the following:

0b1e2 (which isn't 0 with suffix b1e2)

0b0123 (which is rejected, not accepted as 0 with suffix b0123)

0xy (which is rejected, not accepted as 0 with suffix xy)

0x· (which is rejected, not accepted as 0 with suffix x·)

Attributes

The token's base is looked up in the following table, depending on which nonterminal participated in the match:


`INTEGER_BINARY_LITERAL`	binary
`INTEGER_OCTAL_LITERAL`	octal
`INTEGER_HEXADECIMAL_LITERAL`	hexadecimal
`INTEGER_DECIMAL_LITERAL`	decimal

The token's digits are LOW_BASE_TOKEN_DIGITS, HEXADECIMAL_DIGITS, or DECIMAL_PART, whichever one participated in the match.

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

the token's digits would consist entirely of _ characters; or
the token's base would be binary and its digits would contain any character other than 0, 1, or _; or
the token's base would be octal and its digits would contain any character other than 0, 1, 2, 3, 4, 5, 6, 7, or _.

Note: In particular, a match which would make an Integer_literal with empty digits is rejected.

Ident, lifetime, and label tokens

This writeup uses the term ident to refer to a token that lexically has the form of an identifier, including keywords and _.

Note: the procedural macros system uses the name Ident to refer to what this writeup calls Ident and Raw_ident.

The following nonterminals are common to the definitions below:

Grammar

IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }

Note: This is following the specification in Unicode Standard Annex #31, with the addition of permitting underscore as the first character.

See Special terminals for the definitions of XID_START and XID_CONTINUE.

Raw lifetime or label (Rust 2021 and 2024)

Grammar

Raw_lifetime_or_label = { "'r#" ~ IDENT }

Attributes

The token's name is IDENT.

Note that the name is not NFC-normalised. See NFC normalisation for lifetime/label.

Rejection

The match is rejected if IDENT is one of the following sequences of characters:

_
crate
self
super
Self

Reserved lifetime or label prefix (Rust 2021 and 2024)

Grammar

Reserved_lifetime_or_label_prefix = { "'" ~ IDENT ~ "#" }

Rejection

All matches are rejected.

(Non-raw) lifetime or label

Grammar

Lifetime_or_label = { "'" ~ IDENT }

Note: The Reserved_single_quoted_literal definitions make sure that forms like 'aaa'bbb are not accepted.

See Modelling lifetimes and labels for a discussion of why this model doesn't simply treat ' as punctuation.

Attributes

The token's name is IDENT.

Note that the name is not NFC-normalised. See NFC normalisation for lifetime/label.

Rejection

No matches are rejected.

Raw ident

Grammar

Raw_ident = { "r#" ~ IDENT }

Attributes

The token's represented ident is the NFC-normalised form of IDENT.

Rejection

The match is rejected if the token's represented ident would be one of the following sequences of characters:

_
crate
self
super
Self

Reserved prefix

Grammar

Reserved_prefix_2015 = { "r#" | "br#" }
Reserved_prefix_2021 = { IDENT ~ "#" }

Rejection

All matches are rejected.

Note: This definition must appear here in priority order. Tokens added in future which match these reserved forms wouldn't necessarily be forms of identifier.

(Non-raw) ident

Grammar

Ident = { IDENT }

Note: The Reference adds the following when discussing identifiers: "Zero width non-joiner (ZWNJ U+200C) and zero width joiner (ZWJ U+200D) characters are not allowed in identifiers." Those characters don't have XID_Start or XID_Continue, so that's only informative text, not an additional constraint.

Attributes

The token's represented ident is the NFC-normalised form of IDENT

Rejection

No matches are rejected.

Punctuation tokens

Punctuation

Punctuation

Grammar

Punctuation = {
    ";" |
    "," |
    "." |
    "(" |
    ")" |
    "{" |
    "}" |
    "[" |
    "]" |
    "@" |
    "#" |
    "~" |
    "?" |
    ":" |
    "$" |
    "=" |
    "!" |
    "<" |
    ">" |
    "-" |
    "&" |
    "|" |
    "+" |
    "*" |
    "/" |
    "^" |
    "%"
}

Attributes

The token's mark is the single character consumed by the match.

Rejection

No matches are rejected.

Lowering doc-comments

This phase of processing converts an input sequence of fine-grained tokens to a new sequence of fine-grained tokens.

The new sequence is the same as the input sequence, except that each Line_comment or Block_comment token whose style is inner doc or outer doc is replaced with the following sequence:

Punctuation with mark #
Whitespace
Punctuation with mark ! (omitted if the comment token's style is outer doc)
Punctuation with mark [
Ident with represented ident doc
Punctuation with mark =
Whitespace
Raw_string_literal with the comment token's body as the represented string and empty suffix
Punctuation with mark ]

Note: the whitespace tokens aren't observable by anything currently described in this writeup, but they explain the spacing in the tokens that proc-macros see.

Machine-readable frontmatter grammar

The machine-readable Pest grammar for frontmatter is presented here for convenience.

See Parsing Expression Grammars for an explanation of the notation.

This version of the grammar uses Pest's PUSH, PEEK, and POP for the matched fences.

ANY, EOI, PATTERN_WHITE_SPACE, XID_START, and XID_CONTINUE are built in to Pest and so not defined below.

LF and TAB are treated as special terminals in this writeup, but they are not built in to Pest so they have definitions below using character-sequence terminals which include escapes. These definitions use Pest's silent rules.

FRONTMATTER = {
    WHITESPACE_ONLY_LINE * ~
    START_LINE ~
    CONTENT_LINE * ~
    END_LINE
}

WHITESPACE_ONLY_LINE = {
    ( !LF ~ PATTERN_WHITE_SPACE ) * ~
    LF
}

START_LINE = {
    PUSH(FENCE) ~
    HORIZONTAL_WHITESPACE * ~
    ( INFOSTRING ~ HORIZONTAL_WHITESPACE * ) ? ~
    LF
}

CONTENT_LINE = {
    !PEEK ~
    ( !LF ~ ANY ) * ~
    LF
}

END_LINE = {
    POP ~
    HORIZONTAL_WHITESPACE * ~
    ( LF | EOI )
}

FENCE = { "-" {3, 255} }

INFOSTRING = {
    ( XID_START | "_" ) ~
    ( XID_CONTINUE | "-" | "." ) *
}

HORIZONTAL_WHITESPACE = { " " | TAB }


RESERVED = {
    PATTERN_WHITE_SPACE * ~
    FENCE
}

LF = _{ "\u{000a}" }
TAB = _{ "\u{0009}" }

The complete tokenisation grammar

The machine-readable Pest grammar for tokenisation is presented here for convenience.

See Parsing Expression Grammars for an explanation of the notation.

This version of the grammar uses Pest's PUSH, PEEK, and POP for the Raw_double_quoted_literal definitions.

ANY, PATTERN_WHITE_SPACE, XID_START, and XID_CONTINUE are built in to Pest and so not defined below.

LF, DOUBLEQUOTE, and BACKSLASH are treated as special terminals in this writeup, but they are not built in to Pest so they have definitions below using character-sequence terminals which include escapes. These definitions use Pest's silent rules.

TOKENS_2015 = { TOKEN_2015 * }
TOKENS_2021 = { TOKEN_2021 * }
TOKENS_2024 = { TOKEN_2024 * }

TOKEN_2015 = {
    Whitespace |
    Line_comment |
    Block_comment |
    Reserved_block_comment_start |
    Character_literal |
    Byte_literal |
    String_literal |
    Byte_string_literal |
    Raw_string_literal |
    Raw_byte_string_literal |
    Reserved_literal_2015 |
    Reserved_single_quoted_literal_2015 |
    Float_literal |
    Reserved_float |
    Integer_literal |
    Lifetime_or_label |
    Raw_ident |
    Reserved_prefix_2015 |
    Ident |
    Punctuation
}

TOKEN_2021 = {
    Whitespace |
    Line_comment |
    Block_comment |
    Reserved_block_comment_start |
    Character_literal |
    Byte_literal |
    String_literal |
    Byte_string_literal |
    C_string_literal |
    Raw_string_literal |
    Raw_byte_string_literal |
    Raw_c_string_literal |
    Reserved_literal_2021 |
    Reserved_single_quoted_literal_2021 |
    Float_literal |
    Reserved_float |
    Integer_literal |
    Raw_lifetime_or_label |
    Reserved_lifetime_or_label_prefix |
    Lifetime_or_label |
    Raw_ident |
    Reserved_prefix_2021 |
    Ident |
    Punctuation
}

TOKEN_2024 = {
    Whitespace |
    Line_comment |
    Block_comment |
    Reserved_block_comment_start |
    Character_literal |
    Byte_literal |
    String_literal |
    Byte_string_literal |
    C_string_literal |
    Raw_string_literal |
    Raw_byte_string_literal |
    Raw_c_string_literal |
    Reserved_literal_2021 |
    Reserved_single_quoted_literal_2021 |
    Reserved_guard |
    Float_literal |
    Reserved_float |
    Integer_literal |
    Raw_lifetime_or_label |
    Reserved_lifetime_or_label_prefix |
    Lifetime_or_label |
    Raw_ident |
    Reserved_prefix_2021 |
    Ident |
    Punctuation
}


Whitespace = { PATTERN_WHITE_SPACE + }

Line_comment = { "//" ~ LINE_COMMENT_CONTENT }
LINE_COMMENT_CONTENT = { ( !LF ~ ANY ) * }

Block_comment = { BLOCK_COMMENT }
BLOCK_COMMENT = { "/*" ~ BLOCK_COMMENT_CONTENT ~ "*/" }
BLOCK_COMMENT_CONTENT = { ( BLOCK_COMMENT | !"*/" ~ !"/*" ~ ANY ) * }

Reserved_block_comment_start = { "/*" }


SINGLE_QUOTED_FORM = {
    "'" ~ SINGLE_QUOTED_CONTENT ~ "'" ~
    SUFFIX ?
}
SINGLE_QUOTED_CONTENT = {
    BACKSLASH ~ ANY ~ ( !"'" ~ ANY ) * |
    !"'" ~ ANY
}

Character_literal = { SINGLE_QUOTED_FORM }

Byte_literal = { "b" ~ SINGLE_QUOTED_FORM }


DOUBLE_QUOTED_FORM = {
    DOUBLEQUOTE ~ DOUBLE_QUOTED_CONTENT ~ DOUBLEQUOTE ~
    SUFFIX ?
}
DOUBLE_QUOTED_CONTENT = {
    (
        BACKSLASH ~ ANY |
        !DOUBLEQUOTE ~ ANY
    ) *
}

String_literal = { DOUBLE_QUOTED_FORM }

Byte_string_literal = { "b" ~ DOUBLE_QUOTED_FORM }

C_string_literal = { "c" ~ DOUBLE_QUOTED_FORM }



RAW_DOUBLE_QUOTED_FORM = {
    PUSH(HASHES) ~
    DOUBLEQUOTE ~ RAW_DOUBLE_QUOTED_CONTENT ~ DOUBLEQUOTE ~
    POP ~
    SUFFIX ?
}
RAW_DOUBLE_QUOTED_CONTENT = {
    ( !(DOUBLEQUOTE ~ PEEK) ~ ANY ) *
}
HASHES = { "#" {0, 255} }

Raw_string_literal = { "r" ~ RAW_DOUBLE_QUOTED_FORM }

Raw_byte_string_literal = { "br" ~ RAW_DOUBLE_QUOTED_FORM }

Raw_c_string_literal = { "cr" ~ RAW_DOUBLE_QUOTED_FORM }


Reserved_literal_2015 = { "r" ~ DOUBLEQUOTE | "br" ~ DOUBLEQUOTE | "b'" }
Reserved_literal_2021 = { IDENT ~ ( DOUBLEQUOTE | "'" ) }

Reserved_single_quoted_literal_2015 = { "'" ~ IDENT ~ "'" }
Reserved_single_quoted_literal_2021 = { "'" ~ "r#" ? ~ IDENT ~ "'" }


Reserved_guard = { "##" | "#" ~ DOUBLEQUOTE }


DECIMAL_DIGITS = { ('0'..'9' | "_") * }
HEXADECIMAL_DIGITS = { ('0'..'9' | 'a'..'f' | 'A'..'F' | "_") * }
LOW_BASE_TOKEN_DIGITS = { DECIMAL_DIGITS }
DECIMAL_PART = { '0'..'9' ~ DECIMAL_DIGITS }


Float_literal = {
    FLOAT_BODY_WITH_EXPONENT ~ SUFFIX ? |
    FLOAT_BODY_WITHOUT_EXPONENT ~ !("e"|"E") ~ SUFFIX ? |
    FLOAT_BODY_WITH_FINAL_DOT ~ !"." ~ !IDENT_START
}

FLOAT_BODY_WITH_EXPONENT = {
    DECIMAL_PART ~ ("." ~ DECIMAL_PART ) ? ~
    ("e"|"E") ~ ("+"|"-") ? ~ EXPONENT_DIGITS
}
EXPONENT_DIGITS = { "_" * ~ '0'..'9' ~ DECIMAL_DIGITS }

FLOAT_BODY_WITHOUT_EXPONENT = {
    DECIMAL_PART ~ "." ~ DECIMAL_PART
}

FLOAT_BODY_WITH_FINAL_DOT = {
    DECIMAL_PART ~ "."
}

Reserved_float = {
    RESERVED_FLOAT_EMPTY_EXPONENT | RESERVED_FLOAT_BASED
}
RESERVED_FLOAT_EMPTY_EXPONENT = {
    DECIMAL_PART ~ ("." ~ DECIMAL_PART ) ? ~
    ("e"|"E") ~ ("+"|"-") ?
}
RESERVED_FLOAT_BASED = {
    (
        ("0b" | "0o") ~ LOW_BASE_TOKEN_DIGITS |
        "0x" ~ HEXADECIMAL_DIGITS
    )  ~  (
        ("e"|"E") |
        "." ~ !"." ~ !IDENT_START
    )
}


Integer_literal = {
    ( INTEGER_BINARY_LITERAL |
      INTEGER_OCTAL_LITERAL |
      INTEGER_HEXADECIMAL_LITERAL |
      INTEGER_DECIMAL_LITERAL ) ~
    SUFFIX_NO_E ?
}

INTEGER_BINARY_LITERAL = { "0b" ~ LOW_BASE_TOKEN_DIGITS }
INTEGER_OCTAL_LITERAL = { "0o" ~ LOW_BASE_TOKEN_DIGITS }
INTEGER_HEXADECIMAL_LITERAL = { "0x" ~ HEXADECIMAL_DIGITS }
INTEGER_DECIMAL_LITERAL = { DECIMAL_PART }

SUFFIX_NO_E = { !("e"|"E") ~ SUFFIX }


Raw_lifetime_or_label = { "'r#" ~ IDENT }

Reserved_lifetime_or_label_prefix = { "'" ~ IDENT ~ "#" }

Lifetime_or_label = { "'" ~ IDENT }

Raw_ident = { "r#" ~ IDENT }

Reserved_prefix_2015 = { "r#" | "br#" }
Reserved_prefix_2021 = { IDENT ~ "#" }

Ident = { IDENT }


Punctuation = {
    ";" |
    "," |
    "." |
    "(" |
    ")" |
    "{" |
    "}" |
    "[" |
    "]" |
    "@" |
    "#" |
    "~" |
    "?" |
    ":" |
    "$" |
    "=" |
    "!" |
    "<" |
    ">" |
    "-" |
    "&" |
    "|" |
    "+" |
    "*" |
    "/" |
    "^" |
    "%"
}


SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }


LF = _{ "\u{000a}" }
DOUBLEQUOTE = _{ "\"" }
BACKSLASH = _{ "\\" }

Machine-readable escape-processing grammar

The machine-readable Pest grammar for escape processing is presented here for convenience.

See Parsing Expression Grammars for an explanation of the notation.

ANY, EOI, PATTERN_WHITE_SPACE, XID_START, and XID_CONTINUE are built in to Pest and so not defined below.

TAB, CR, LF, DOUBLEQUOTE, and BACKSLASH are treated as special terminals in this writeup, but they are not built in to Pest so they have definitions below using character-sequence terminals which include escapes. These definitions use Pest's silent rules.

LITERAL_COMPONENTS = {
    LITERAL_COMPONENT *
}

LITERAL_COMPONENT = {
    !BACKSLASH ~ ANY |
    BACKSLASH ~ ESCAPE_BODY
}

ESCAPE_BODY = {
    SIMPLE_ESCAPE_BODY |
    UNICODE_ESCAPE_BODY |
    HEXADECIMAL_ESCAPE_BODY |
    STRING_CONTINUATION_ESCAPE_BODY
}

SIMPLE_ESCAPE_BODY = {
    "0" | "t" | "n" | "r" | DOUBLEQUOTE | "'" | BACKSLASH
}

UNICODE_ESCAPE_BODY = {
    "u" ~ "{" ~ ( HEXADECIMAL_DIGIT ~ "_" * ){1,6} ~ "}"
}

HEXADECIMAL_ESCAPE_BODY = {
    "x" ~ HEXADECIMAL_DIGIT ~ HEXADECIMAL_DIGIT
}

STRING_CONTINUATION_ESCAPE_BODY = {
    LF ~ ( TAB | LF | CR | " " ) *
}

HEXADECIMAL_DIGIT = {
    '0'..'9' | 'a'..'f' | 'A'..'F'
}

TAB = _{ "\u{0009}" }
CR = _{ "\u{000d}" }
LF = _{ "\u{000a}" }
DOUBLEQUOTE = _{ "\"" }
BACKSLASH = _{ "\\" }

Parsing Expression Grammars

Parsing Expression Grammars were introduced in Ford 2004.

The notation used in this document is based on the variant used by the Pest Rust library.

This page describes a subset of the formalism that is sufficient for the grammars used in this writeup.

See Grammars above for a less formal treatment.

Nonterminal definitions

A Parsing Expression Grammar is made up of a sequence of nonterminal definitions of the form

NAME = { … }

The order of the nonterminal definitions is not significant.

The name on the left hand side of the = in a nonterminal definition is a nonterminal.

The part of the nonterminal definition that appears between { and } is that nonterminal's expression. It is a Parsing expression as defined below.

No nonterminal appears more than once on the left hand side of a definition.

Parsing expressions

Parsing expressions have the following forms, where

e, e₁, and e₂ represent arbitrary parsing expressions
m and n represent arbitrary non-negative integers, written in decimal.


Terminals
eg `"abc"`	Character-sequence terminal
eg `'a'..'f'`	Character-range terminal
`ANY`	Any character
`DOUBLEQUOTE`	"
`BACKSLASH`	\
`CR`	Carriage return
`LF`	Line feed
`TAB`	Tab
`PATTERN_WHITE_SPACE`
`XID_START`
`XID_CONTINUE`
`EOI`	End of input
`EMPTY`	Empty match
Nonterminals
A defined nonterminal
Compound expressions
`e₁ ~ e₂`	Sequencing expression
`e₁ \| e₂`	Prioritised choice expression
`e ?`	Option suffix expression
`e *`	Zero-or-more repetitions expression
`e +`	One-or-more repetitions expression
`e {m, n}` (with `m` <= `n`)	limited repetitions expression
`! e`	Negative lookahead expression
Grouping
`( e )`

The symbols ~, |, ?, *, +, !, and the form {m, n}, are called parsing operators.

Each nonterminal which appears in a parsing expression has a definition in the grammar.

The EMPTY terminal doesn't appear in any of grammars in this writeup; it's used below in the descriptions of matching option and repetition expressions.

Grouping, precedence, and association

The definition of matching below assumes that each parsing expression has a known interpretation as a tree of compound expressions, nonterminals, and terminals.

This section describes how to resolve ambiguities in the written form of a parsing expression to produce such a tree.

A subexpression in parentheses ( and ) is treated as a separate unit.

The prioritised choice parsing operator | has the lowest precedence, so for example e₁ ~ e₂ | e₃ ~ e₄ is interpreted as ( e₁ ~ e₂ ) | ( e₃ ~ e₄ ).

The sequencing parsing operator ~ has the next-lowest precedence, so for example !e₁ ~ e₂ is interpreted as (!e₁) ~ e₂, and e₁ ~ e₂? is interpreted as e₁ ~ (e₂?).

The grammars used in this writeup do not rely on a defined precedence between the unary parsing operators.

The binary parsing operators ~ and | are left-associative:

e₁ ~ e₂ ~ e₃ is interpreted as (e₁ ~ e₂) ~ e₃
e₁ | e₂ | e₃ is interpreted as (e₁ | e₂) | e₃

The associativity doesn't matter in practice: (e₁ ~ e₂) ~ e₃ and e₁ ~ (e₂ ~ e₃) have identical outcomes, and (e₁ | e₂) | e₃ and e₁ | (e₂ | e₃) have identical outcomes.

Matching

A match attempt is characterised by a grammar, a parsing expression, and a character sequence.

In the main part of this document references to match attempts for nonterminals do not explicitly mention the grammar: the grammar which contains the nonterminal is assumed (no nonterminal name appears in more than one of the grammars).

On the remainder of this page the grammar is assumed to be held constant.

A match attempt is identified using the form "a match attempt of e against s" or "an attempt to match e against s", where e is a parsing expression and s is a character sequence.

On this page a match attempt may be referred to simply as an attempt.

The descriptions of terminals, nonterminals, and compound expressions below, taken together, define the outcome of any match attempt.

It is possible to write a grammar under which the definition of outcome below is not well-founded, because of direct or indirect left recursion in the definitions of nonterminals. The grammars used in this writeup do not have this complication, so we may assume all match attempts have a well-defined outcome.

The outcome of a match attempt against s is one of:

success, together with
- a prefix of s which was consumed by the attempt
- a sequence of matches (the attempt's elaboration)
failure.

We say a match attempt succeeds or is successful if its outcome is success, and fails if its outcome is failure.

A successful match attempt can be referred to as a match.

Note that any nonterminal is a parsing expression on its own, so it is meaningful to talk about an attempt to match a nonterminal against a character sequence.

For the purposes of this section, a prefix of a sequence is the first n characters of the sequence, for some n. The prefix may be empty, or the entire sequence.

In the descriptions below, s represents a character sequence.

Terminals

An attempt to match a character-sequence terminal "c₁…cₙ" against s succeeds if and only if the character sequence c₁…cₙ is a prefix of s, and (if it succeeds) consumes c₁…cₙ. Here, c₁…cₙ represents an arbitrary sequence of characters other than " (in practice, they are printable ASCII characters).

An attempt to match a character-range terminal 'c₁'..'c₂' against s succeeds if and only s begins with a character whose Unicode scalar value is between the Unicode scalar value of c₁ and the Unicode scalar value of c₂ (inclusive), and (if it succeeds) consumes that character. Here, c₁ and c₂ represent arbitrary single characters other than ' (in practice, ASCII digits or letters).

An attempt to match ANY against s succeeds if and only if s is not empty, and (if it succeeds) consumes the first character in s.

An attempt to match DOUBLEQUOTE against s succeeds if and only if s begins with the character ", and (if it succeeds) consumes that character.

An attempt to match BACKSLASH against s succeeds if and only if s begins with the character \, and (if it succeeds) consumes that character.

An attempt to match CR against s succeeds if and only if s begins with the character CR, and (if it succeeds) consumes that character.

An attempt to match LF against s succeeds if and only if s begins with the character LF, and (if it succeeds) consumes that character.

An attempt to match TAB against s succeeds if and only if s begins with the character HT, and (if it succeeds) consumes that character.

An attempt to match PATTERN_WHITE_SPACE succeeds if and only if s begins with a character which has the Pattern_White_Space Unicode character property, as defined in PropList.txt in the Unicode character database, and (if it succeeds) consumes that character.

An attempt to match XID_START succeeds if and only if s begins with a character which has the XID_Start Unicode character property, as defined in DerivedCoreProperties.txt in the Unicode character database, and (if it succeeds) consumes that character.

An attempt to match XID_CONTINUE succeeds if and only if s begins with a character which has the XID_Continue Unicode character property, as defined in DerivedCoreProperties.txt in the Unicode character database, and (if it succeeds) consumes that character.

An attempt to match EOI against s succeeds if and only if s is empty, and (if it succeeds) consumes an empty character sequence.

An attempt to match EMPTY always succeeds, and (if it succeeds) consumes an empty character sequence.

All matches of terminals have an empty elaboration.

Nonterminals

An attempt A to match a nonterminal against s succeeds if and only if an attempt A′ to match the nonterminal's expression against s succeeds.

If A is successful, it consumes the characters consumed by A′ and its elaboration is A followed by the elaboration of A′.

Compound expressions

In the descriptions below, a statement that an expression e₁ reduces to an expression e₂ means that the outcome of an attempt to match e₁ against s is the outcome of an attempt to match e₂ against s.

Sequencing expressions (`~`)

The outcome of an attempt A to match a sequencing expression e₁ ~ e₂ against s is as follows:

If an attempt A₁ to match the expression e₁ against s fails, A fails.
Otherwise, A succeeds if and only if an attempt A₂ to match e₂ against s′ succeeds, where s′ is the sequence of characters obtained by removing the prefix consumed by A₁ from s.

If A succeeds:

It consumes the characters consumed by A₁ followed by the characters consumed by A₂.
Its elaboration is the elaboration of A₁ followed by the elaboration of A₂.

Prioritised choice expressions (`|`)

The outcome of an attempt A to match a prioritised choice expression e₁ | e₂ against s is as follows:

If an attempt A₁ to match e₁ against s succeeds, the outcome of A is the outcome of A₁.
Otherwise, the outcome of A is the outcome of an attempt to match e₂ against s.

Option expressions (`?`)

The option expression e? reduces to e | EMPTY.

Repetition expressions (`*`, `+`, and `{m,n}`)

A zero-or-more repetitions expression e* reduces to ( e ~ e* ) | EMPTY.

A one-or-more repetitions expression e+ reduces to e ~ e*.

A limited repetition expression of the form e{0, 0} reduces to EMPTY.

A limited repetition expression of the form e{0, n}, for n > 0, reduces to e? ~ e{0, n-1}.

A limited repetition expression of the form e{m, n}, for m > 0 and n >= m, reduces to e ~ e{m-1, n-1}.

Negative lookahead expressions (`!`)

An attempt to match a negative lookahead expression !e against s succeeds if and only if an attempt to match e against s fails.

If the attempt succeeds, it consumes no characters and has an empty elaboration.

Participating in a match

A match is a participating match of a nonterminal N in a match A if it is a match of N which appears in the elaboration of A.

A nonterminal N participates in a match A if there is a participating match of N in A.

If a nonterminal N participates in a match A, the first participating match of N in A is the first match of N in the elaboration of A.

Sequences of matches

The characters consumed by a sequence of matches A₁…Aₙ are the sequence of characters obtained by concatenating the character sequences consumed by each Aᵢ, in order.

If N is a nonterminal, the sequence of participating matches of N in a match A is the sequence obtained by restricting the elaboration of A to matches of N.

If 𝑵 is a class of nonterminals, the sequence of participating matches of 𝑵 in a match A is the sequence obtained by restricting the elaboration of A to matches of members of 𝑵.

Command-line interface for the reimplementation

The repository containing this writeup also contains a binary crate which contains the reimplementation and a command line program for comparing the reimplementation against rustc (linking against the rustc implementation via rustc_private).

Run it in the usual way, from a working copy:

cargo run -- <subcommand> [options]

Note the repository includes a rust-toolchain.toml file which will cause cargo run to install the required nightly version of rustc.

Summary usage

Usage: lexeywan [<subcommand>] [...options]

Subcommands:
 *test          [suite-opts]
  compare       [suite-opts] [comparison-opts] [dialect-opts]
  decl-compare  [suite-opts] [comparison-opts] [--edition=2015|2021|*2024]
  inspect       [suite-opts] [dialect-opts]
  coarse        [suite-opts] [dialect-opts]
  identcheck
  proptest      [--count] [--strategy=<name>] [--print-failures|--print-all]
                [dialect-opts]

suite-opts (specify at most one):
  --short: run the SHORTLIST rather than the LONGLIST
  --xfail: run the tests which are expected to fail

dialect-opts:
  --edition=2015|2021|*2024
  --cleaning=none|*shebang|shebang-and-frontmatter
  --lower-doc-comments

comparison-opts:
  --failures-only: don't report cases where the lexers agree
  --details=always|*failures|never

* -- default

Subcommands

The following subcommands are available:

`test`

This is the main way to check the whole system for disagreements.

test is the default subcommand.

For each known edition, it runs the following for the requested testcases:

comparison of rustc_parse's lexer and the reimplementation, like for compare with options
- --cleaning=none
- --cleaning=shebang
- --cleaning=shebang-and-frontmatter
- --cleaning=shebang --lower-doc-comments
comparison via declarative macros, like for decl-compare

For each comparison it reports whether the implementations agreed for all testcases, without further detail.

`compare`

This is the main way to run the testsuite and see results for individual testcases.

It analyses the requested testcases with the rustc_parse lexer and the reimplementation, and compares the output.

The analysis uses a single dialect.

For each testcase, the comparison agrees if either:

both implementations accept the input and produce the same forest of regularised tokens; or
both implementations reject the input.

Otherwise the comparison disagrees.

See regular_tokens.rs for what regularisation involves.

The comparison may also mark a testcase as a model error. This happens if rustc panics or the reimplementation reports an internal error.

Example output

‼ R:✓ L:✗ «//comment»
✔ R:✓ L:✓ «'label»

Here, the first line says that rustc (R) accepted the input //comment but the reimplementation (L) rejected it. The initial ‼ indicates the disagreement.

The second line says that both rustc and the reimplementation accepted the input 'label. The initial ✔ indicates the agreement.

Output control

By default compare prints a line (of the sort shown in the example above) for each testcase. Pass --failures-only to only print lines for the cases where the implementations disagree.

The compare subcommand can also report further detail for a testcase:

if the input is accepted, the forest of regularised tokens
if the input is rejected, the rustc error message or the reimplementation's reason for rejection

Further detail is controlled as follows:


`--details=always`	Report detail for all testcases
`--details=failures` (default)	Report detail for disagreeing testcases
`--details=never`	Report detail for no testcases

`inspect`

This shows more detail than compare, but doesn't report on agreement.

Analyses the requested testcases using the rustc_parse lexer and the reimplementation, and prints each lexer's analysis.

Uses a specified dialect.

Unlike compare, this shows the tokens before regularisation.

For the reimplementation, it shows the tokenisation nonterminal matches, together with the characters consumed by any subsidiary nonterminal matches inside them, and fine-grained tokens.

If rustc rejects the input (and the rejection wasn't a fatal error), it reports the tokens rustc would have passed on to the parser.

If the reimplementation rejects the input, reports what has been tokenised so far. If the rejection comes from processing, describes the rejected match and reports any matches and fine-grained tokens from before the rejection.

`coarse`

This shows the reimplementation's coarse-grained tokens.

Analyses the requested testcases using the reimplementation only, including combination of fine-grained tokens into coarse-grained tokens, and prints a representation of the analysis.

Uses a specified dialect.

`decl-compare`

This is a second way to test the observable behaviour of Rust's lexer, which doesn't depend so much on rustc_parse's internals.

It analyses the requested testcases via declarative macros, and compares the result to what the reimplementation expects.

The analysis works by defining a macro using the tt fragment specifier which applies stringify! to each parameter, embedding the testcase in an invocation of that macro, running rustc's macro expander and parser, and inspecting the results.

See the comments in decl_via_rustc.rs for details.

The reimplementation includes a model of what stringify! does.

It uses the selected edition. Doc-comments are always lowered. The only Processing that happens before tokenising step performed is CRLF normalisation.

The --details and --failures-only options work in the same way as for compare; "details" shows the stringified form of each token.

`identcheck`

This checks that the rustc_parse lexer and the reimplementation agree which characters are permitted in identifiers.

For each Unicode character C this constructs a string containing C aC, and checks the reimplementation and rustc_parse agree on its analysis.

It reports the number of agreements and disagreements.

It will notice if the Unicode version changes in one of the implementations (rustc's Unicode version comes from its unicode-xid dependency, and the reimplementation's comes from its pest dependency).

It won't notice if the Unicode version used for NFC normalisation is out of sync (for both the reimplementation and rustc, this comes from the unicode-normalization dependency).

identcheck always uses the latest available edition.

`proptest`

This performs randomised testing.

It uses proptest to generate random strings, analyses them with the rustc_parse lexer and the reimplementation, and compares the output. The analysis and comparison is the same as for compare above, for a specified dialect.

If this process finds a string which results in disagreement (or a model error), proptest simplifies it as much as it can while still provoking the problem, then testing stops.

The --count argument specifies how many strings to generate (the default is 5000).

The --strategy argument specified how to generate the strings. See SIMPLE_STRATEGIES in framework/proptesting/strategies.rs for the list of available strategies. The mix strategy is the default.

Output control

By default proptest prints a single reduced disagreement, if it finds any.

If --print-all is passed it prints each string it generates.

If --print-failures is passed it prints each disagreeing testcase it generates, so you can see the simplification process.

Dialects

The compare, inspect, coarse, and proptest subcommands accept the following options to control the lexers' behaviour:

--edition=2015|2021|2024
--cleaning=none|shebang|shebang-and-frontmatter
--lower-doc-comments

The decl-compare subcommand accepts only --edition.

The options apply both to rustc and the reimplementation.

--edition controls which Rust edition's lexing semantics are used. It defaults to the most recent known edition. There's no 2018 option because there were no lexing changes between the 2015 and 2018 editions.

--cleaning controls which of the steps described in Processing that happens before tokenising are performed. It defaults to shebang. Byte order mark removal and CRLF normalisation are always performed. (The reimplementation doesn't model the "Decoding" step, because the hard-coded testcases are provided as Rust string literals and so are already UTF-8.)

If --lower-doc-comments is passed, doc-comments are converted to attributes as described in Lowering doc-comments.

Choosing the testcases to run

By default, subcommands which need a list of testcases use the list hard-coded as LONGLIST in framework/testcases.rs.

Pass --short to use the list hard-coded as SHORTLIST instead.

Pass --xfail to use the list hard-coded as XFAIL instead. This list includes testcases which are expected to fail or disagree with at least one subcommand and set of options.

Exit status

Each subcommand which compares the reimplementation to rustc reports exit status 0 if all comparisons agreed, or exit status 3 if any comparison disagreed or any model errors were observed.

For all subcommands, exit status 101 indicates an unhandled error.

Rationale for this model

Separating lexing from parsing

This model assumes that we want to define a process of tokenisation, turning a sequence of characters into a sequence of tokens, and that there would be a separate grammar using those tokens as its terminals.

The alternative is to use a single "scannerless" grammar, which combines lexical analysis and parsing into a single process.

Rust's parser is not easy to model, and I think it's likely that the best formalism for describing Rust's parser wouldn't also be a good formalism for describing its lexer.

Rejecting matches

The separate 'processing' step which can reject matches is primarily a way to make the grammar simpler.

The main advantage is dealing with character, byte, and string literals, where we have to reject invalid escape sequences at lexing time.

In this model, the lexer finds the extent of the token using simple grammar definitions, and then checks whether the escape sequences are valid in a separate "processing" operation. So the grammar "knows" that a backslash character indicates an escape sequence, but doesn't model escapes in any further detail.

In contrast the Reference gives grammar productions which try to describe the available escape sequences in each kind of string-like literal, but this isn't enough to characterise which forms are accepted (for example "\u{d800}" is rejected at lexing time, because there is no character with scalar value D800).

Given that we have this separate operation, we can use it to simplify other parts of the grammar too, including:

distinguishing doc-comments
rejecting CR in comments
rejecting the reserved keywords in raw identifiers, eg r#crate
rejecting no-digit forms like 0x_
rejecting the variants of numeric literals reserved in rfc0879, eg 0b123
rejecting literals with a single _ as a suffix

This means we can avoid adding many "reserved form" definitions. For example, if we didn't accept _ as a suffix in the main string-literal grammar definitions, we'd have to have another Reserved_something definition to prevent the _ being accepted as a separate token.

Given the choice to use locally greedy matching (see below), I think an operation which rejects tokens after parsing them is necessary to deal with a case like 0b123, to avoid analysing it as 0b1 followed by 23.

Using a Parsing Expression Grammar

I think a PEG is a good formalism for modelling Rust's lexer (though probably not for the rest of the grammar) for several reasons.

Resolving ambiguities

The lexical part of Rust's grammar is necessarily full of ambiguities.

For example:

ab could be a single identifier, or a followed by b
1.2 could be a floating-point literal, or 1 followed by . followed by 2
r"x" could be a raw string literal, or r followed by "x"

The Reference doesn't explicitly state what formalism it's using, or what rule it uses to disambiguate such cases.

There are two common approaches: to choose the longest available match (as Lex traditionally does), or to explicitly list rules in priority order and specify locally "greedy" repetition-matching (as PEGs do).

With the "longest available" approach, additional rules are needed if multiple rules can match with equal length.

The 2024 version of this model characterised the cases where rustc doesn't choose the longest available match, and where (given its choice of rules) there are multiple longest-matching rules.

For example, the Reference's lexer rules for input such as 0x3 allow two interpretations, matching the same extent:

as a hexadecimal integer literal: 0x3 with no suffix
as a decimal integer literal: 0 with a suffix of x3

We want to choose the former interpretation. (We could say that these are both the same kind of token and re-analyse it later to decide which part was the suffix, but we'd still have to understand the distinction inside the lexer in order to reject forms like 0b0123.)

Examples where rustc chooses a token which is shorter than the longest available match are rare. In the model used by the Reference, 0x· is one: rustc treats this as a "reserved number" (0x), rather than 0 with suffix x·. (Note that · has the XID_Continue property but not XID_Start.)

I think that in 2025 it's clear that a priority-based system is a better fit for Rust's lexer. In particular, if pr131656 is accepted (allowing forms like 1em), the new ambiguities will be resolved naturally because the floating-point literal definitions have priority over the integer literal definitions.

Generative grammars don't inherently have prioritisation, but parsing expression grammars do.

Ambiguities that must be resolved as errors

There are a number of forms which are errors at lexing time, even though in principle they could be analysed as multiple tokens.

Many cases can be handled in processing, as described above.

Other cases can be handled naturally using a PEG, by writing medium-priority rules to match them, for example:

the rfc3101 "reserved prefixes" (in Rust 2021 and newer): k#abc, f"...", or f'...'
unterminated block comments such as /* abc
forms that look like floating-point literals with a base indicator, such as 0b1.0

In this model, these additional rules cause the input to be rejected at processing time.

Lookahead

There are two cases where the Reference currently describes the lexer's behaviour using lookahead:

for (possibly raw) lifetime-or-label, to prevent 'ab'c' being analysed as 'ab followed by 'c
for floating-point literals, to make sure that 1.a is analysed as 1 . a rather than 1. a

These are easily modelled using PEG predicates (though this writeup prefers a reserved form for the former).

Handling raw strings

The biggest weakness of using the PEG formalism is that it can't naturally describe the rule for matching the number of # characters in raw string literals.

See Grammar for raw string literals for discussion.

Adopting language changes

Rustc's lexer is made up of hand-written imperative code, largely using match statements. It often peeks ahead for a few characters, and it tries to avoid backtracking.

This is a close fit for the behaviour modelled by PEGs, so there's good reason to suppose that it will be easy to update this model for future versions of Rust.

Modelling lifetimes and labels

Like the Reference, this model has a separate kind of token for lifetime-or-label.

It would be nice to be able to treat them as two fine-grained tokens (' followed by an identifier), like they are treated in procedural macro input, but I think it's impractical.

The main difficulty is dealing with cases like 'r"abc". Rust accepts that as a lifetime-or-label 'r followed by a string literal "abc". A model which treats ' as a complete token would analyse this as ' followed by a raw string literal r"abc". This problem can occur with any prefix (including a reserved prefix).

Producing tokens with attributes

This model makes the lexing process responsible for a certain amount of 'interpretation' of the tokens, rather than simply describing how the input source is sliced up and assigning a 'kind' to each resulting token.

The main motivation for this is to deal with string-like literals: it means we don't need to separate the description of the result of "unescaping" strings from the description of which strings contain well-formed escapes.

In particular, describing unescaping at lexing time makes it easy to describe the rule about rejecting NULs in C-strings, even if they were written using an escape.

For numeric literals, the way the suffix is identified isn't always simple (see Resolving ambiguities above); I think it's best to make the lexer responsible for doing it, so that the description of numeric literal expressions doesn't have to.

For identifiers, many parts of the spec will need a notion of equivalence (both for handling raw identifiers and for dealing with NFC normalisation), and some restrictions depend on the normalised form (see ASCII identifiers). I think it's best for the lexer to handle this by defining the represented ident.

Grammar for raw string literals

I believe the PEG formalism can't naturally describe Rust's rule for matching the number of # characters in raw string literals.

(The same limitations apply to matching the number of - characters in frontmatter fences.)

I can think of the following ways to handle this:

Corresponding nonterminal extension

This writeup uses an ad-hoc extension to the formalism, along similar lines to the stack extension described below (but without needing a full stack).

It's described as follows:

an attempt to match one of the parsing expressions marked as HASHES² fails unless the characters it consumes are the same as the characters consumed by the (only) match of the expression marked as HASHES¹ under the same match attempt of a tokenisation nonterminal.

This extension isn't formalised in the appendix on PEGs.

It could be formalised in a similar way to the mark/check extension below, with the addition of some notion of a scoping nonterminal which uses an empty context for its sub-attempt.

Pest's stack extension

Pest provides a stack extension which is a good fit for this requirement, and is used in the reimplementation.

It looks like this:

RAW_DOUBLE_QUOTED_FORM = {
    PUSH(HASHES) ~
    "\"" ~ RAW_DOUBLE_QUOTED_CONTENT ~ "\"" ~
    POP ~
    SUFFIX ?
}
RAW_DOUBLE_QUOTED_CONTENT = {
    ( !("\"" ~ PEEK) ~ ANY ) *
}
HASHES = { "#" {0, 255} }

The notion of attempting to match a parsing expression is extended to include a context stack (a stack of character sequences): each match attempt takes the stack as an additional input and produces an updated stack as an additional part of the outcome.

The stack is initially empty.

There are three additional forms of parsing expression: PUSH(e), PEEK, and POP, where e is an arbitrary parsing expression.

PUSH(e) behaves in the same way as the parsing expression e. If it succeeds, it additionally pushes the text consumed by e onto the stack.

An attempt to match PEEK against a character sequence s succeeds if and only if the top entry of the stack is a prefix of s. If the stack is empty, PEEK fails.

POP behaves in the same way as PEEK. Additionally, if it succeeds it then pops the top entry from the stack.

All other parsing expressions leave the stack unmodified.

Mark/check extension

This extension uses the same notation as the corresponding nonterminal extension. It might be described along the following lines:

An attempt to match a parsing expression marked with ² fails unless the characters it consumes are the same as the characters consumed by the previous match of an expression marked as ¹.

A formalisation of this extension in the style used in the appendix on PEGs is sketched below.

Treat ¹ and ² as operators, defining a mark expression and a check expression respectively.

Extend the characterisation of a match attempt to include a context, which is a sequence of matches (this formalises a notion of the matches preceding the attempt).

Alter the description of most kinds of expression to consider a context and use the same context for each sub-attempt, for example:

An attempt A to match a nonterminal against s in context c succeeds if and only if an attempt A′ to match the nonterminal's expression against s in context c succeeds.

Alter the description of sequencing expressions to use an updated context when attempting the right-hand side:

The outcome of an attempt A to match a sequencing expression e₁ ~ e₂ against s in context c is as follows:

If an attempt A₁ to match the expression e₁ against s in context c fails, A fails.
Otherwise, A succeeds if and only if an attempt A₂ to match e₂ against s′ in context c′ succeeds, where s′ is the sequence of characters obtained by removing the prefix consumed by A₁ from s, and c′ is c followed by the elaboration of A₁.

Include mark expressions in the elaboration:

An attempt A to match a mark expression e¹ against s in context c succeeds if and only if an attempt A′ to match e against s in context c succeeds.

If A is successful, it consumes the characters consumed by A′ and its elaboration is A followed by the elaboration of A′.

Describe a check expression as failing unless the characters its subexpression consumes are the same as the characters consumed by the last mark expression in its context:

An attempt A to match a check expression e² against s in context c succeeds if

an attempt A′ to match e against s in context c succeeds; and
c includes at least one mark expression; and
the characters consumed by A′ are the same as the characters consumed by the last mark expression in c.

Otherwise A fails.

Scheme of definitions

Because raw string literals have a limit of 255 # characters, it is in principle possible to model them using a PEG with 256 (pairs of) definitions.

So writing this out as a "scheme" of definitions might be thinkable:

RDQ_0 = {
    DOUBLEQUOTE ~ RDQ_0_CONTENT ~ DOUBLEQUOTE ~
}
RDQ_0_CONTENT = {
    ( !(DOUBLEQUOTE) ~ ANY ) *
}

RDQ_1 = {
    "#"{1} ~
    DOUBLEQUOTE ~ RDQ_1_CONTENT ~ DOUBLEQUOTE ~
    "#"{1} ~
}
RDQ_1_CONTENT = {
    ( !(DOUBLEQUOTE ~ "#"{1}) ~ ANY ) *
}

RDQ_2 = {
    "#"{2} ~
    DOUBLEQUOTE ~ RDQ_2_CONTENT ~ DOUBLEQUOTE ~
    "#"{2} ~
}
RDQ_2_CONTENT = {
    ( !(DOUBLEQUOTE ~ "#"{2}) ~ ANY ) *
}

…

RDQ_255 = {
    "#"{255} ~
    DOUBLEQUOTE ~ RDQ_255_CONTENT ~ DOUBLEQUOTE ~
    "#"{255} ~
}
RDQ_255_CONTENT = {
    ( !(DOUBLEQUOTE ~ "#"{255}) ~ ANY ) *
}

Open questions

Terminology

Some of the terms used in this document are taken from pre-existing documentation or rustc's error output, but many of them are new (and so can freely be changed).

Here's a partial list:

Term	Source
processing	New
fine-grained token	New
compound token	New
literal content	Reference
non-escape	New
simple escape	Reference
hexadecimal escape	rustc error message (as "hex escape")
escape sequence	Reference
escaped value	Reference
string continuation escape	Reference (as `STRING_CONTINUE`)
string representation	Reference
represented byte	New
represented character	Reference
represented bytes	Reference
represented string	Reference
represented ident	New
style (of a comment)	rustc internal
body (of a comment)	Reference

Raw string literals

How should raw string literals be documented, and in particular how should any necessary extension to PEGs be formalised?

See Grammar for raw string literals for some options.

Token kinds and attributes

What kinds and attributes should fine-grained tokens have?

Hash count

Should there be an attribute recording the number of hashes in a raw string or byte-string literal? Rustc has something of the sort.

ASCII identifiers

Should there be an attribute indicating whether an identifier is all ASCII? The Reference lists several places where identifiers have this restriction, and it seems natural for the lexer to be responsible for making this check.

The list in the Reference is:

extern crate declarations
External crate names referenced in a path
Module names loaded from the filesystem without a path attribute
no_mangle attributed items
Item names in external blocks

I believe this restriction is applied after NFC-normalisation, so it's best thought of as a restriction on the represented ident.

Represented bytes for C strings

At present this document says that the sequence of "represented bytes" for C string literals doesn't include the added NUL.

That's following the way the Reference currently uses the term "represented bytes", but rustc includes the NUL in its equivalent piece of data.

Should this writeup change to match rustc?

Wording for string unescaping

The description of building up the represented bytes for C-string literals still uses the "contributes" wording from the Reference. Is it worth having something more formal?

Rustc oddities

NFC normalisation for lifetime/label

Identifiers are normalised to NFC, which means that Kelvin and Kelvin are treated as representing the same identifier. See rfc2457.

But this doesn't happen for lifetimes or labels, so 'Kelvin and 'Kelvin are different as lifetimes or labels.

For example, this compiles without warning in Rust 1.91, while this doesn't.

In this writeup, the represented ident attribute of Ident and Raw_ident fine-grained tokens is in NFC, and the name attribute of Lifetime_or_label and Raw_lifetime_or_label tokens isn't.

I think this behaviour is a promising candidate for provoking the "Wait...that's what we currently do? We should fix that." reaction to being given a spec to review.

Filed as rustc #126759.

Nested block comments

The Reference says "Nested block comments are supported".

Rustc implements this by counting occurrences of /* and */, matching greedily. That means it rejects forms like /* xyz /*/.

This writeup includes a !"/*" subexpression in the BLOCK_COMMENT_CONTENT definition to match rustc's behaviour.

The grammar production in the Reference seems to be written to assume that these forms should be accepted (but I think it's garbled anyway: it accepts /* /* */).

I haven't seen any discussion of whether this rustc behaviour is considered desirable.

String continuation escapes

rustc has a warning that the behaviour of String continuation escapes (when multiple newlines are skipped) may change in future.

The Reference has a note about this, and points to #1042 for more information.

#136600 asks whether this is intentional.

Keyboard shortcuts

Writeup of Rust's lexer