Introduction

This document contains a description of rustc's lexer, which is aiming to be both correct and verifiable.

It's accompanied by a reimplementation of the lexer in Rust based on that description (called the "comparable implementation" below), and a framework for comparing its output to rustc's.

One component of the description is a Parsing Expression Grammar; the comparable implementation uses the Pest library to generate the corresponding parser.

Scope

Rust language version

This document describes Rust version 1.86.

That means it describes raw lifetimes/labels and the additional reservations in the 2024 edition, but not

rfc3349 (Mixed UTF-8 literals)

Other statements in this document are intended to be true as of April 2025.

The comparable implementation is intended to be compiled against (and compared against)
rustc 1.87.0-nightly (f8a913b13 2025-02-23)

This branch also documents the behaviour of pr131656
lexer: Treat more floats with empty exponent as valid tokens
as of 2025-04-27; the comparable implementation should be compared against that PR.

Editions

This document describes the editions supported by Rust 1.86:

2015
2018
2021
2024

There are no differences in lexing behaviour between the 2015 and 2018 editions.

In the comparable implementation, "2015" is used to refer to the common behaviour of Rust 2015 and Rust 2018.

Accepted input

This description aims to accept input exactly if rustc's lexer would.

Specifically, it aims to model what's accepted as input to a function-like macro (a procedural macro or a by-example macro using the tt fragment specifier).

It's not attempting to accurately model rustc's "reasons" for rejecting input, or to provide enough information to reproduce error messages similar to rustc's.

It's not attempting to describe rustc's "recovery" behaviour (where input which will be reported as an error provides tokens to later stages of the compiler anyway).

Size limits

This description doesn't attempt to characterise rustc's limits on the size of the input as a whole.

As far as I know, rustc has no limits on the size of individual tokens beyond its limits on the input as a whole. But I haven't tried to test this.

Output form

This document only goes as far as describing how to produce a "least common denominator" stream of tokens.

Further writing will be needed to describe how to convert that stream to forms that fit the (differing) needs of the grammar and the macro systems.

In particular, this representation may be unsuitable for direct use by a description of the grammar because:

there's no distinction between identifiers and keywords;
there's a single "kind" of token for all punctuation;
sequences of punctuation such as :: aren't glued together to make a single token.

(The comparable implementation includes code to make compound punctuation tokens so they can be compared with rustc's, but that process isn't described here.)

Licence

This document and the accompanying lexer implementation are released under the terms of both the MIT license and the Apache License (Version 2.0).

Authorship and source access

The source code for this document and the accompanying lexer implementation is available at https://github.com/mattheww/lexeywan

Overview

The following processes might be considered to be part of Rust's lexer:

Decode: interpret UTF-8 input as a sequence of Unicode characters
Clean:
- Byte order mark removal
- CRLF normalisation
- Shebang removal
Tokenise: interpret the characters as ("fine-grained") tokens
Further processing: to fit the needs of later parts of the spec
- For example, convert fine-grained tokens to compound tokens
- possibly different for the grammar and the two macro implementations

This document attempts to completely describe the "Tokenise" process.

Definitions

Byte

For the purposes of this document, byte means the same thing as Rust's u8 (corresponding to a natural number in the range 0 to 255 inclusive).

Character

For the purposes of this document, character means the same thing as Rust's char. That means, in particular:

there's exactly one character for each Unicode scalar value
the things that Unicode calls "noncharacters" are characters
there are no characters corresponding to surrogate code points

Sequence

When this document refers to a sequence of items, it means a finite, but possibly empty, ordered list of those items.

"character sequence" and "sequence of characters" are different ways of saying the same thing.

NFC normalisation

References to NFC-normalised strings are talking about Unicode's Normalization Form C, defined in Unicode Standard Annex #15.

Processing that happens before tokenising

This document's description of tokenising takes a sequence of characters as input.

rustc obtains that sequence of characters as follows:

This description is taken from the Input format chapter of the Reference.

Source encoding

Each source file is interpreted as a sequence of Unicode characters encoded in UTF-8. It is an error if the file is not valid UTF-8.

Byte order mark removal

If the first character in the sequence is U+FEFF (BYTE ORDER MARK), it is removed.

CRLF normalisation

Each pair of characters U+000D CR immediately followed by U+000A LF is replaced by a single U+000A LF.

Other occurrences of the character U+000D CR are left in place (they are treated as whitespace).

Note: It's still possible for the sequence CRLF to be passed on to the tokeniser: that will happen if the source file contained the sequence CRCRLF.

Shebang removal

If the remaining sequence begins with the characters #!, the characters up to and including the first U+000A LF are removed from the sequence.

For example, the first line of the following file would be ignored:

#!/usr/bin/env rustx

fn main() {
    println!("Hello!");
}

As an exception, if the #! characters are followed (ignoring intervening comments or whitespace) by a [ punctuation token, nothing is removed. This prevents an inner attribute at the start of a source file being removed.

See open question: How to model shebang removal

Tokenising

This phase of processing takes a character sequence (the input), and either:

produces a sequence of fine-grained tokens; or
reports that lexical analysis failed

The analysis depends on the Rust edition which is in effect when the input is processed.

So strictly speaking, the edition is a second parameter to the process described here.

Tokenisation is described using two operations:

Pretokenising extracts pretokens from the character sequence.
Reprocessing converts pretokens to fine-grained tokens.

Either operation can cause lexical analysis to fail.

Note: If lexical analysis succeeds, concatenating the extents of the produced tokens produces an exact copy of the input.

Process

The process is to repeat the following steps until the input is empty:

extract a pretoken from the start of the input
reprocess that pretoken

If no step determines that lexical analysis should fail, the output is the sequence of fine-grained tokens produced by the repetitions of the second step.

Note: Each fine-grained token corresponds to one pretoken, representing exactly the same characters from the input; reprocessing doesn't involve any combination or splitting.

Note: It doesn't make any difference whether we treat this as one pass with interleaved pretoken-extraction and reprocessing, or as two passes. The comparable implementation uses a single interleaved pass, which means when it reports an error it describes the earliest part of the input which caused trouble.

Pretokens

Each pretoken has an extent, which is a sequence of characters taken from the input.

Each pretoken has a kind, and possibly also some attributes, as described in the tables below.

Kind	Attributes
`Reserved`
`Whitespace`
`LineComment`	`comment content`
`BlockComment`	`comment content`
`Punctuation`	`mark`
`Identifier`	`identifier`
`RawIdentifier`	`identifier`
`LifetimeOrLabel`	`name`
`RawLifetimeOrLabel`	`name`
`SingleQuotedLiteral`	`prefix`, `literal content`, `suffix`
`DoubleQuotedLiteral`	`prefix`, `literal content`, `suffix`
`RawDoubleQuotedLiteral`	`prefix`, `literal content`, `suffix`
`IntegerLiteral`	`base`, `digits`, `suffix`
`FloatLiteral`	`body`, `suffix`

These attributes have the following types:

Attribute	Type
`base`	binary / octal / decimal / hexadecimal
`body`	sequence of characters
`comment content`	sequence of characters
`digits`	sequence of characters
`identifier`	sequence of characters
`literal content`	sequence of characters
`mark`	single character
`name`	sequence of characters
`prefix`	sequence of characters
`suffix`	either a sequence of characters or none

Pretokenising

Pretokenisation is described by a Parsing Expression Grammar which describes how to match a single pretoken (not a sequence of pretokens).

The grammar isn't strictly a PEG. See Grammar for raw string literals

The grammar defines an edition nonterminal for each Rust edition:

Edition	Edition nonterminal
2015	`PRETOKEN_2015`
2021	`PRETOKEN_2021`
2024	`PRETOKEN_2024`

Each edition nonterminal is defined as a choice expression, each of whose subexpressions is a single nonterminal (a pretoken nonterminal).

Grammar

PRETOKEN_2015 = {
    Whitespace |
    Line_comment |
    Block_comment |
    Unterminated_block_comment |
    Single_quoted_literal |
    Double_quoted_literal_2015 |
    Raw_double_quoted_literal_2015 |
    Unterminated_literal_2015 |
    Float_literal_1 |
    Reserved_float_empty_exponent |
    Reserved_float_e_suffix_restriction |
    Float_literal_2 |
    Reserved_float_based |
    Reserved_integer_e_suffix_restriction |
    Integer_literal |
    Lifetime_or_label |
    Raw_identifier |
    Reserved_prefix_2015 |
    Identifier |
    Punctuation
}

PRETOKEN_2021 = {
    Whitespace |
    Line_comment |
    Block_comment |
    Unterminated_block_comment |
    Single_quoted_literal |
    Double_quoted_literal_2021 |
    Raw_double_quoted_literal_2021 |
    Reserved_literal_2021 |
    Float_literal_1 |
    Reserved_float_empty_exponent |
    Reserved_float_e_suffix_restriction |
    Float_literal_2 |
    Reserved_float_based |
    Reserved_integer_e_suffix_restriction |
    Integer_literal |
    Raw_lifetime_or_label_2021 |
    Reserved_lifetime_or_label_prefix_2021 |
    Lifetime_or_label |
    Raw_identifier |
    Reserved_prefix_2021 |
    Identifier |
    Punctuation
}

PRETOKEN_2024 = {
    Whitespace |
    Line_comment |
    Block_comment |
    Unterminated_block_comment |
    Single_quoted_literal |
    Double_quoted_literal_2021 |
    Raw_double_quoted_literal_2021 |
    Reserved_literal_2021 |
    Reserved_guard_2024 |
    Float_literal_1 |
    Reserved_float_empty_exponent |
    Reserved_float_e_suffix_restriction |
    Float_literal_2 |
    Reserved_float_based |
    Reserved_integer_e_suffix_restriction |
    Integer_literal |
    Raw_lifetime_or_label_2021 |
    Reserved_lifetime_or_label_prefix_2021 |
    Lifetime_or_label |
    Raw_identifier |
    Reserved_prefix_2021 |
    Identifier |
    Punctuation
}

The pretoken nonterminals are distinguished in the grammar as having names in Title_case.

The rest of the grammar is presented in the following pages in this section. It's also available on a single page.

The pretoken nonterminals are presented in an order consistent with their appearance in the edition nonterminals. That means they appear in priority order (highest priority first). There is one exception, for floating-point literals and their related reserved forms (see Float literal).

Extracting pretokens

To extract a pretoken from the input:

Attempt to match the edition's edition nonterminal at the start of the input.
If the match fails, lexical analysis fails.
If the match succeeds, the extracted pretoken has:
- extent: the characters consumed by the nonterminal's expression
- kind and attributes: determined by the pretoken nonterminal used in the match, as described below.
Remove the extracted pretoken's extent from the start of the input.

Strictly speaking we have to justify the assumption that the match will always either fail or succeed, which basically means observing that the grammar has no left recursion.

Determining the pretoken kind and attributes

Each pretoken nonterminal produces a single kind of pretoken.

In most cases a given kind of pretoken is produced only by a single pretoken nonterminal. The exceptions are:

Several pretoken nonterminals produce Reserved pretokens.
There are two pretoken nonterminals producing FloatLiteral pretokens.
In some cases there are variant pretoken nonterminals for different editions.

Each pretoken nonterminal (or group of edition variants) has a subsection on the following pages, which lists the pretoken kind and provides a table of that pretoken kind's attributes.

In most cases an attribute value is "captured" by a named definition from the grammar:

If an attributes table entry says "from NONTERMINAL", the attribute's value is the sequence of characters consumed by that nonterminal, which will appear in one of the pretoken nonterminal's subexpressions (possibly via the definitions of additional nonterminals).
Some attributes table entries list multiple nonterminals, eg "from NONTERMINAL1 or NONTERMINAL2". In these cases the grammar ensures that at most one of those nonterminals may be matched, so that the attribute is unambiguously defined.
If no listed nonterminal is matched (which can happen if they all appear before ? or inside choice expressions), the attribute's value is none. The table says "(may be none)" in these cases.

If for any input the above rules don't result in a unique well-defined attribute value, it's a bug in this specification.

In other cases the attributes table entry defines the attribute value explicitly, depending on the characters consumed by the pretoken nonterminal or on which subexpression of the pretoken nonterminal matched.

Common definitions

Some grammar definitions which are needed on the following pages appear below.

Sets of characters

The following special terminals specify sets of Unicode characters:

Grammar

ANY
PATTERN_WHITE_SPACE
XID_START
XID_CONTINUE

ANY matches any Unicode character.

PATTERN_WHITE_SPACE matches any character which has the Pattern_White_Space Unicode property. These characters are:


U+0009	(horizontal tab, '\t')
U+000A	(line feed, '\n')
U+000B	(vertical tab)
U+000C	(form feed)
U+000D	(carriage return, '\r')
U+0020	(space, ' ')
U+0085	(next line)
U+200E	(left-to-right mark)
U+200F	(right-to-left mark)
U+2028	(line separator)
U+2029	(paragraph separator)

Note: This set doesn't change in updated Unicode versions.

XID_START matches any character which has the XID_Start Unicode property (as of Unicode 16.0.0).

XID_CONTINUE matches any character which has the XID_Continue Unicode property (as of Unicode 16.0.0).

Identifier-like forms

Grammar

IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }
SUFFIX = { IDENT }

Whitespace and comment pretokens

Whitespace

Grammar

Whitespace = { PATTERN_WHITE_SPACE + }

Pretoken kind

Whitespace

Attributes

(none)

Line comment

Grammar

Line_comment = { "//" ~ LINE_COMMENT_CONTENT }
LINE_COMMENT_CONTENT = { ( !"\n" ~ ANY )* }

Pretoken kind

LineComment

Attributes


`comment content`	from `LINE_COMMENT_CONTENT`

Block comment

Grammar

Block_comment = { "/*" ~ BLOCK_COMMENT_CONTENT ~ "*/" }
BLOCK_COMMENT_CONTENT = { ( Block_comment | !"*/" ~ !"/*" ~ ANY ) * }

Pretoken kind

BlockComment

Attributes


`comment content`	from `BLOCK_COMMENT_CONTENT`

Note: See Nested block comments for discussion of the !"/*" subexpression.

Unterminated block comment

Grammar

Unterminated_block_comment = { "/*" }

Pretoken kind

Reserved

Attributes

(none)

Note: This definition makes sure that an unterminated block comment isn't accepted as punctuation (* followed by /).

String and byte literal pretokens

Single-quoted literal

Grammar

Single_quoted_literal = {
    SQ_PREFIX ~
    "'" ~ SQ_CONTENT ~ "'" ~
    SUFFIX ?
}

SQ_PREFIX = { "b" ? }

SQ_CONTENT = {
    "\\" ~ ANY ~ ( !"'" ~ ANY ) * |
    !"'" ~ ANY
}

Pretoken kind

SingleQuotedLiteral

Attributes


`prefix`	from `SQ_PREFIX`
`literal content`	from `SQ_CONTENT`
`suffix`	from `SUFFIX` (may be none)

(Non-raw) double-quoted literal

Grammar

Double_quoted_literal_2015 = { DQ_PREFIX_2015 ~ DQ_REMAINDER }
Double_quoted_literal_2021 = { DQ_PREFIX_2021 ~ DQ_REMAINDER }

DQ_PREFIX_2015 = { "b" ? }
DQ_PREFIX_2021 = { ( "b" | "c" ) ? }

DQ_REMAINDER = {
    "\"" ~ DQ_CONTENT ~ "\"" ~
    SUFFIX ?
}
DQ_CONTENT = {
    (
        "\\" ~ ANY |
        !"\"" ~ ANY
    ) *
}

Pretoken kind

DoubleQuotedLiteral

Attributes


`prefix`	from `DQ_PREFIX_2015` or `DQ_PREFIX_2021`
`literal content`	from `DQ_CONTENT`
`suffix`	from `SUFFIX` (may be none)

Raw double-quoted literal

Grammar

Raw_double_quoted_literal_2015 = { RAW_DQ_PREFIX_2015 ~ RAW_DQ_REMAINDER }
Raw_double_quoted_literal_2021 = { RAW_DQ_PREFIX_2021 ~ RAW_DQ_REMAINDER }

RAW_DQ_PREFIX_2015 = { "r" | "br" }
RAW_DQ_PREFIX_2021 = { "r" | "br" | "cr" }

RAW_DQ_REMAINDER = {
    HASHES¹ ~
    "\"" ~ RAW_DQ_CONTENT ~ "\"" ~
    HASHES² ~
    SUFFIX ?
}
RAW_DQ_CONTENT = {
    ( !("\"" ~ HASHES²) ~ ANY ) *
}
HASHES = { "#" {0, 255} }

These definitions require an extension to the Parsing Expression Grammar formalism: each of the expressions marked as HASHES² fails unless the text it matches is the same as the text matched by the (only) successful match using the expression marked as HASHES¹ in the same attempt to match the current pretoken nonterminal.

See Grammar for raw string literals for a discussion of alternatives to this extension.

Pretoken kind

RawDoubleQuotedLiteral

Attributes


`prefix`	from `RAW_DQ_PREFIX_2015` or `RAW_DQ_PREFIX_2021`
`literal content`	from `RAW_DQ_CONTENT`
`suffix`	from `SUFFIX` (may be none)

Reserved or unterminated literal

Grammar

Unterminated_literal_2015 = { "r\"" | "br\"" | "b'" }
Reserved_literal_2021 = { IDENT ~ ( "\"" | "'" ) }

Pretoken kind

Reserved

Attributes

(none)

Note: I believe in the Unterminated_literal_2015 definition only the b' form is strictly needed: if that definition matches using one of the other subexpressions then the input will be rejected eventually anyway (given that the corresponding string literal nonterminal didn't match).

Note: Reserved_literal_2021 catches both reserved forms and unterminated b' literals.

Reserved guard (Rust 2024)

Grammar

Reserved_guard_2024 = { "##" | "#\"" }

Pretoken kind

Reserved

Attributes

(none)

Note: This definition is listed here near the double-quoted string literals because these forms were reserved during discussions about introducing string literals formed like #"…"#.

Numeric literal pretokens

The following nonterminals are common to the definitions below:

Grammar

DECIMAL_DIGITS = { ('0'..'9' | "_") * }
HEXADECIMAL_DIGITS = { ('0'..'9' | 'a' .. 'f' | 'A' .. 'F' | "_") * }
LOW_BASE_PRETOKEN_DIGITS = { DECIMAL_DIGITS }
DECIMAL_PART = { '0'..'9' ~ DECIMAL_DIGITS }

RESTRICTED_E_SUFFIX = { ("e"|"E") ~ "_"+ ~ !XID_START ~ XID_CONTINUE }

Float literal

Grammar

Float_literal_1 = {
    FLOAT_BODY_WITH_EXPONENT ~ SUFFIX ?
}
Float_literal_2 = {
    FLOAT_BODY_WITHOUT_EXPONENT ~ SUFFIX ? |
    FLOAT_BODY_WITH_FINAL_DOT ~ !"." ~ !IDENT_START
}

FLOAT_BODY_WITH_EXPONENT = {
    DECIMAL_PART ~ ("." ~ DECIMAL_PART ) ? ~
    ("e"|"E") ~ ("+"|"-") ? ~ EXPONENT_DIGITS
}
EXPONENT_DIGITS = { "_" * ~ '0'..'9' ~ DECIMAL_DIGITS }

FLOAT_BODY_WITHOUT_EXPONENT = {
    DECIMAL_PART ~ "." ~ DECIMAL_PART
}

FLOAT_BODY_WITH_FINAL_DOT = {
    DECIMAL_PART ~ "."
}

Pretoken kind

FloatLiteral

Attributes


`body`	from `FLOAT_BODY_WITH_EXPONENT`,`FLOAT_BODY_WITHOUT_EXPONENT`, or `FLOAT_BODY_WITH_FINAL_DOT`
`suffix`	from `SUFFIX` (may be none)

Note: The ! "." subexpression makes sure that forms like 1..2 aren't treated as starting with a float. The ! IDENT_START subexpression makes sure that forms like 1.some_method() aren't treated as starting with a float.

Note: The Reserved_float_empty_exponent pretoken nonterminal is placed between Float_literal_1 and Float_literal_2 in priority order (which is why there are two pretoken nonterminals producing FloatLiteral).

Reserved float

Grammar

Reserved_float_empty_exponent = {
    DECIMAL_PART ~ ("." ~ DECIMAL_PART ) ? ~
    ("e"|"E") ~ ("+"|"-")
}
Reserved_float_e_suffix_restriction = {
    DECIMAL_PART ~ "." ~ DECIMAL_PART ~
    RESTRICTED_E_SUFFIX
}
Reserved_float_based = {
    (
        ("0b" | "0o") ~ LOW_BASE_PRETOKEN_DIGITS |
        "0x" ~ HEXADECIMAL_DIGITS
    )  ~  (
        ("e"|"E") ~ ("+"|"-" | EXPONENT_DIGITS) |
        "." ~ !"." ~ !IDENT_START
    )
}

Pretoken kind

Reserved

Attributes

(none)

Note: The Reserved_float_empty_exponent pretoken nonterminal is placed between Float_literal_1 and Float_literal_2 in priority order. This ordering makes sure that forms like 123.4e+ are reserved, rather than being accepted by FLOAT_BODY_WITHOUT_EXPONENT).

See e-suffix-restriction for discussion of Reserved_float_e_suffix_restriction.

Reserved integer

Grammar

Reserved_integer_e_suffix_restriction = {
    ( INTEGER_BINARY_LITERAL |
      INTEGER_OCTAL_LITERAL |
      INTEGER_DECIMAL_LITERAL ) ~
    RESTRICTED_E_SUFFIX
}

Pretoken kind

Reserved

Attributes

(none)

See e-suffix-restriction for discussion.

Integer literal

Grammar

Integer_literal = {
    ( INTEGER_BINARY_LITERAL |
      INTEGER_OCTAL_LITERAL |
      INTEGER_HEXADECIMAL_LITERAL |
      INTEGER_DECIMAL_LITERAL ) ~
    SUFFIX ?
}

INTEGER_BINARY_LITERAL = { "0b" ~ LOW_BASE_PRETOKEN_DIGITS }
INTEGER_OCTAL_LITERAL = { "0o" ~ LOW_BASE_PRETOKEN_DIGITS }
INTEGER_HEXADECIMAL_LITERAL = { "0x" ~ HEXADECIMAL_DIGITS }
INTEGER_DECIMAL_LITERAL = { DECIMAL_PART }

Pretoken kind

IntegerLiteral

Attributes


`base`	See below
`digits`	from `LOW_BASE_PRETOKEN_DIGITS`, `HEXADECIMAL_DIGITS`, or `DECIMAL_PART`
`suffix`	from `SUFFIX` (may be none)

The base attribute is determined from the following table, depending on which nonterminal participated in the match:


`INTEGER_BINARY_LITERAL`	binary
`INTEGER_OCTAL_LITERAL`	octal
`INTEGER_HEXADECIMAL_LITERAL`	hexadecimal
`INTEGER_DECIMAL_LITERAL`	decimal

Note: See rfc0879 for the reason we accept all decimal digits in binary and octal pretokens; the inappropriate digits are rejected in reprocessing.

Note: The INTEGER_DECIMAL_LITERAL nonterminal is listed last in the Integer_literal definition in order to resolve ambiguous cases like the following:

0b1e2 (which isn't 0 with suffix b1e2)

0b0123 (which is rejected, not accepted as 0 with suffix b0123)

0xy (which is rejected, not accepted as 0 with suffix xy)

0x· (which is rejected, not accepted as 0 with suffix x·)

Identifier, lifetime, and label pretokens

Recall that the IDENT nonterminal is defined as follows:

Grammar

IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }

Note: This is following the specification in Unicode Standard Annex #31 for Unicode version 16.0, with the addition of permitting underscore as the first character.

Raw lifetime or label (Rust 2021 and 2024)

Grammar

Raw_lifetime_or_label_2021 = { "'r#" ~ IDENT ~ !"'" }

Pretoken kind

RawLifetimeOrLabel

Attributes


`name`	from `IDENT`

Reserved lifetime or label prefix (Rust 2021 and 2024)

Grammar

Reserved_lifetime_or_label_prefix_2021 = { "'" ~ IDENT ~ "#" }

Pretoken kind

Reserved

Attributes

(none)

(Non-raw) lifetime or label

Grammar

Lifetime_or_label = { "'" ~ IDENT ~ !"'" }

Pretoken kind

LifetimeOrLabel

Attributes


`name`	from `IDENT`

Note: The !"'" at the end of the expression makes sure that forms like 'aaa'bbb are not accepted.

Raw identifier

Grammar

Raw_identifier = { "r#" ~ IDENT }

Pretoken kind

RawIdentifier

Attributes


`identifier`	from `IDENT`

Reserved prefix

Grammar

Reserved_prefix_2015 = { "r#" | "br#" }
Reserved_prefix_2021 = { IDENT ~ "#" }

Pretoken kind

Reserved

Attributes

(none)

Note: This definition must appear here in priority order. Tokens added in future which match these reserved forms wouldn't necessarily be forms of identifier.

(Non-raw) identifier

Grammar

Identifier = { IDENT }

Pretoken kind

Identifier

Attributes


`identifier`	from `IDENT`

Note: The Reference adds the following when discussing identifiers: "Zero width non-joiner (ZWNJ U+200C) and zero width joiner (ZWJ U+200D) characters are not allowed in identifiers." Those characters don't have XID_Start or XID_Continue, so that's only informative text, not an additional constraint.

Punctuation pretokens

Punctuation

Grammar

Punctuation = {
    ";" |
    "," |
    "." |
    "(" |
    ")" |
    "{" |
    "}" |
    "[" |
    "]" |
    "@" |
    "#" |
    "~" |
    "?" |
    ":" |
    "$" |
    "=" |
    "!" |
    "<" |
    ">" |
    "-" |
    "&" |
    "|" |
    "+" |
    "*" |
    "/" |
    "^" |
    "%"
}

Pretoken kind

Punctuation

Attributes


`mark`	the single consumed character

Fine-grained tokens

Reprocessing produces fine-grained tokens.

Each fine-grained token has an extent, which is a sequence of characters taken from the input.

Each fine-grained token has a kind, and possibly also some attributes, as described in the tables below.

Kind	Attributes
`Whitespace`
`LineComment`	`style`, `body`
`BlockComment`	`style`, `body`
`Punctuation`	`mark`
`Identifier`	`represented identifier`
`RawIdentifier`	`represented identifier`
`LifetimeOrLabel`	`name`
`RawLifetimeOrLabel`	`name`
`CharacterLiteral`	`represented character`, `suffix`
`ByteLiteral`	`represented byte`, `suffix`
`StringLiteral`	`represented string`, `suffix`
`RawStringLiteral`	`represented string`, `suffix`
`ByteStringLiteral`	`represented bytes`, `suffix`
`RawByteStringLiteral`	`represented bytes`, `suffix`
`CStringLiteral`	`represented bytes`, `suffix`
`RawCStringLiteral`	`represented bytes`, `suffix`
`IntegerLiteral`	`base`, `digits`, `suffix`
`FloatLiteral`	`body`, `suffix`

These attributes have the following types:

Attribute	Type
`base`	binary / octal / decimal / hexadecimal
`body`	sequence of characters
`digits`	sequence of characters
`mark`	single character
`name`	sequence of characters
`represented byte`	single byte
`represented bytes`	sequence of bytes
`represented character`	single character
`represented identifier`	sequence of characters
`represented string`	sequence of characters
`style`	non-doc / inner doc / outer doc
`suffix`	sequence of characters

Notes:

At this stage:

Both _ and keywords are treated as instances of Identifier.
There are explicit tokens representing whitespace and comments.
Single-character tokens are used for all punctuation.
A lifetime (or label) is represented as a single token (which includes the leading ').

Reprocessing

Reprocessing examines a pretoken, and either accepts it (producing a fine-grained token), or rejects it (causing lexical analysis to fail).

Note: Reprocessing behaves in the same way in all Rust editions.

The effect of reprocessing each kind of pretoken is given in List of reprocessing cases.

Escape processing

The descriptions of the effect of reprocessing string and character literals make use of several forms of escape.

Each form of escape is characterised by:

an escape sequence: a sequence of characters, which always begins with \
an escaped value: either a single character or an empty sequence of characters

In the definitions of escapes below:

An octal digit is any of the characters in the range 0..=7.
A hexadecimal digit is any of the characters in the ranges 0..=9, a..=f, or A..=F.

Simple escapes

Each sequence of characters occurring in the first column of the following table is an escape sequence.

In each case, the escaped value is the character given in the corresponding entry in the second column.

Escape sequence	Escaped value
\0	U+0000 `NUL`
\t	U+0009 `HT`
\n	U+000A `LF`
\r	U+000D `CR`
\"	U+0022 QUOTATION MARK
\'	U+0027 APOSTROPHE
\\	U+005C REVERSE SOLIDUS

Note: The escaped value therefore has a Unicode scalar value which can be represented in a byte.

8-bit escapes

The escape sequence consists of \x followed by two hexadecimal digits.

The escaped value is the character whose Unicode scalar value is the result of interpreting the final two characters in the escape sequence as a hexadecimal integer, as if by u8::from_str_radix with radix 16.

Note: The escaped value therefore has a Unicode scalar value which can be represented in a byte.

7-bit escapes

The escape sequence consists of \x followed by an octal digit then a hexadecimal digit.

Unicode escapes

The escape sequence consists of \u{, followed by a hexadecimal digit, followed by a sequence of characters each of which is a hexadecimal digit or _, followed by }, with the restriction that there are no more than six hexadecimal digits in the entire escape sequence

The escaped value is the character whose Unicode scalar value is the result of interpreting the hexadecimal digits contained in the escape sequence as a hexadecimal integer, as if by u32::from_str_radix with radix 16.

String continuation escapes

The escape sequence consists of \ followed immediately by LF, and all following whitespace characters before the next non-whitespace character.

For this purpose, the whitespace characters are HT (U+0009), LF (U+000A), CR (U+000D), and SPACE (U+0020).

The escaped value is an empty sequence of characters.

The Reference says this behaviour may change in future; see String continuation escapes.

Reserved
Whitespace
LineComment
BlockComment
Punctuation
Identifier
RawIdentifier
LifetimeOrLabel
RawLifetimeOrLabel
SingleQuotedLiteral
DoubleQuotedLiteral
RawDoubleQuotedLiteral
IntegerLiteral
FloatLiteral

The list of of reprocessing cases

The list below has an entry for each kind of pretoken, describing what kind of fine-grained token it produces, how the fine-grained token's attributes are determined, and the circumstances under which a pretoken is rejected.

When an attribute value is given below as "copied", it has the same value as the pretoken's attribute with the same name.

`Reserved`

A Reserved pretoken is always rejected.

`Whitespace`

Fine-grained token kind produced: Whitespace

A Whitespace pretoken is always accepted.

`LineComment`

Fine-grained token kind produced: LineComment

Attributes

style and body are determined from the pretoken's comment content as follows:

if the comment content begins with //:
- style is non-doc
- body is empty
otherwise, if the comment content begins with /,
- style is outer doc
- body is the characters from the comment content after that /
otherwise, if the comment content begins with !,
- style is inner doc
- body is the characters from the comment content after that !
otherwise
- style is non-doc
- body is empty

The pretoken is rejected if (and only if) the resulting body includes a CR character.

Note: The body of a non-doc comment is ignored by the rest of the compilation process

`BlockComment`

Fine-grained token kind produced: BlockComment

Attributes

style and body are determined from the pretoken's comment content as follows:

if the comment content begins with **:
- style is non-doc
- body is empty
otherwise, if the comment content begins with * and contains at least one further character,
- style is outer doc
- body is the characters from the comment content after that *
otherwise, if the comment content begins with !,
- style is inner doc
- body is the characters from the comment content after that !
otherwise
- style is non-doc
- body is empty

The pretoken is rejected if (and only if) the resulting body includes a CR character.

Note: It follows that /**/ and /***/ are not doc-comments

Note: The body of a non-doc comment is ignored by the rest of the compilation process

`Punctuation`

Fine-grained token kind produced: Punctuation

A Punctuation pretoken is always accepted.

Attributes

mark: copied

`Identifier`

Fine-grained token kind produced: Identifier

An Identifier pretoken is always accepted.

Attributes

represented identifier: NFC-normalised form of the pretoken's identifier

`RawIdentifier`

Fine-grained token kind produced: RawIdentifier

Attributes

represented identifier: NFC-normalised form of the pretoken's identifier

The pretoken is rejected if (and only if) the represented identifier is one of the following sequences of characters:

_
crate
self
super
Self

`LifetimeOrLabel`

Fine-grained token kind produced: LifetimeOrLabel

A LifetimeOrLabel pretoken is always accepted.

Attributes

name: copied

Note that the name is not NFC-normalised. See NFC normalisation for lifetime/label.

`RawLifetimeOrLabel`

Fine-grained token kind produced: RawLifetimeOrLabel

The pretoken is rejected if (and only if) the name is one of the following sequences of characters:

_
crate
self
super
Self

Attributes

name: copied

Note that the name is not NFC-normalised. See NFC normalisation for lifetime/label.

`SingleQuotedLiteral`

The pretokeniser guarantees the pretoken's prefix attribute is one of the following:

empty, in which case it is reprocessed as described under Character literal
the single character b, in which case it is reprocessed as described under Byte literal.

In either case, the pretoken is rejected if its suffix consists of the single character _.

Character literal

Fine-grained token kind produced: CharacterLiteral

Attributes

The represented character is derived from the pretoken's literal content as follows:

If the literal content is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:
If the literal content begins with a \ character which did not introduce one of the above forms of escape, the pretoken is rejected.
Otherwise, if the single character that makes up the literal content is LF, CR, or TAB, the pretoken is rejected.
Otherwise the represented character is the single character that makes up the literal content.

suffix: the pretoken's suffix, or empty if that is none

Note: The protokeniser guarantees the pretoken's literal content is either a single character, or a character sequence beginning with \.

Byte literal

Fine-grained token kind produced: ByteLiteral

Attributes

Define a represented character, derived from the pretoken's literal content as follows:

If the literal content is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:
- Simple escapes
- 8-bit escapes
If the literal content begins with a \ character which did not introduce one of the above forms of escape, the pretoken is rejected.
Otherwise, if the single character that makes up the literal content is LF, CR, or TAB, the pretoken is rejected.
Otherwise, if the single character that makes up the literal content has a unicode scalar value greater than 127, the pretoken is rejected.
Otherwise the represented character is the single character that makes up the literal content.

The represented byte is the represented character's Unicode scalar value.

suffix: the pretoken's suffix, or empty if that is none

Note: The protokeniser guarantees the pretoken's literal content is either a single character, or a character sequence beginning with \.

`DoubleQuotedLiteral`

The pretokeniser guarantees the pretoken's prefix attribute is one of the following:

empty, in which case it is reprocessed as described under String literal
the single character b, in which case it is reprocessed as described under Byte-string literal
the single character c, in which case it is reprocessed as described under C-string literal

In each case, the pretoken is rejected if its suffix consists of the single character _.

String literal

Fine-grained token kind produced: StringLiteral

Attributes

The represented string is derived from the pretoken's literal content by replacing each escape sequence of any of the following forms occurring in the literal content with the escape sequence's escaped value.

These replacements take place in left-to-right order. For example, the pretoken with extent "\\x41" is converted to the characters \ x 4 1.

If a \ character appears in the literal content but is not part of one of the above forms of escape, the pretoken is rejected.

If a CR character appears in the literal content and is not part of a string continuation escape, the pretoken is rejected.

suffix: the pretoken's suffix, or empty if that is none

See Wording for string unescaping

Byte-string literal

Fine-grained token kind produced: ByteStringLiteral

If any character whose unicode scalar value is greater than 127 appears in the literal content, the pretoken is rejected.

Attributes

Define a represented string (a sequence of characters) derived from the pretoken's literal content by replacing each escape sequence of any of the following forms occurring in the literal content with the escape sequence's escaped value.

These replacements take place in left-to-right order. For example, the pretoken with extent b"\\x41" is converted to the characters \ x 4 1.

If a \ character appears in the literal content but is not part of one of the above forms of escape, the pretoken is rejected.

If a CR character appears in the literal content and is not part of a string continuation escape, the pretoken is rejected.

The represented bytes are the sequence of Unicode scalar values of the characters in the represented string.

suffix: the pretoken's suffix, or empty if that is none

See Wording for string unescaping

C-string literal

Fine-grained token kind produced: CStringLiteral

Attributes

The pretoken's literal content is treated as a sequence of items, each of which is either a single Unicode character other than \ or an escape.

The sequence of items is converted to the represented bytes as follows:

Each single Unicode character contributes its UTF-8 representation.
Each simple escape contributes a single byte containing the Unicode scalar value of its escaped value.
Each 8-bit escape contributes a single byte containing the Unicode scalar value of its escaped value.
Each unicode escape contributes the UTF-8 representation of its escaped value.
Each string continuation escape contributes no bytes.

If a \ character appears in the literal content but is not part of one of the above forms of escape, the pretoken is rejected.

If a CR character appears in the literal content and is not part of a string continuation escape, the pretoken is rejected.

If any of the resulting represented bytes have value 0, the pretoken is rejected.

suffix: the pretoken's suffix, or empty if that is none

See Wording for string unescaping

`RawDoubleQuotedLiteral`

The pretokeniser guarantees the pretoken's prefix attribute is one of the following:

the single character r, in which case it is reprocessed as described under Raw string literal
the characters br, in which case it is reprocessed as described under Raw byte-string literal
the characters cr, in which case it is reprocessed as described under Raw C-string literal

In each case, the pretoken is rejected if its suffix consists of the single character _.

Raw string literal

Fine-grained token kind produced: RawStringLiteral

The pretoken is rejected if (and only if) a CR character appears in the literal content.

Attributes

represented string: the pretoken's literal content

suffix: the pretoken's suffix, or empty if that is none

Raw byte-string literal

Fine-grained token kind produced: RawByteStringLiteral

If any character whose unicode scalar value is greater than 127 appears in the literal content, the pretoken is rejected.

If a CR character appears in the literal content, the pretoken is rejected.

Attributes

represented bytes: the sequence of Unicode scalar values of the characters in the pretoken's literal content

suffix: the pretoken's suffix, or empty if that is none

Raw C-string literal

Fine-grained token kind produced: RawCStringLiteral

If a CR character appears in the literal content, the pretoken is rejected.

Attributes

represented bytes: the UTF-8 encoding of the pretoken's literal content

suffix: the pretoken's suffix, or empty if that is none

If any of the resulting represented bytes have value 0, the pretoken is rejected.

`IntegerLiteral`

Fine-grained token kind produced: IntegerLiteral

The pretoken is rejected if (and only if):

its digits attribute consists entirely of _ characters; or
its base attribute is binary, and its digits attribute contains any character other than 0, 1, or _; or
its base attribute is octal, and its digits attribute contains any character other than 0, 1, 2, 3, 4, 5, 6, 7, or _.

Attributes

base: copied

digits: copied

suffix: the pretoken's suffix, or empty if that is none

Note: In particular, an IntegerLiteral whose digits is empty is rejected.

`FloatLiteral`

Fine-grained token kind produced: FloatLiteral

Attributes

body: copied

suffix: the pretoken's suffix, or empty if that is none

Parsing Expression Grammar notation

Parsing Expression Grammars are described informally in §2 of Ford 2004.

The notation used in this document is the variant used by the Pest Rust library, so that it's easy to keep in sync with the comparable implementation.

In particular:

the sequencing operator is written explicitly, as ~
the ordered choice operator is |
?, *, and + have their usual senses (as expression suffixes)
{0, 255} is a repetition suffix, meaning "from 0 to 255 repetitions"
the not-predicate (for negative lookahead) is ! (as an expression prefix)
a terminal matching an individual character is written like "x"
a terminal matching a sequence of characters is written like "abc"
a terminal matching a range of characters is written like '0'..'9'
"\"" matches a single " character
"\\" matches a single \ character
"\n" matches a single LF character

The ordered choice operator | has the lowest precedence, so

a ~ b | c ~ d

is equivalent to

( a ~ b ) | ( c ~ d )

The sequencing operator ~ has the next-lowest precedence, so

!"." ~ SOMETHING

is equivalent to

(!".") ~ SOMETHING

"Any character except" is written using the not-predicate and ANY, for example

( !"'" ~ ANY )

matches any single character except '.

See Grammar for raw string literals for a discussion of extensions used to model raw string literals.

The complete pretokenisation grammar

The machine-readable Pest grammar for pretokenisation is presented here for convenience.

See Parsing Expression Grammar notation for an explanation of the notation.

This version of the grammar uses Pest's PUSH, PEEK, and POP for the Raw_double_quoted_literal definitions.

ANY, PATTERN_WHITE_SPACE, XID_START, and XID_CONTINUE are built in to Pest and so not defined below.

PRETOKEN_2015 = {
    Whitespace |
    Line_comment |
    Block_comment |
    Unterminated_block_comment |
    Single_quoted_literal |
    Double_quoted_literal_2015 |
    Raw_double_quoted_literal_2015 |
    Unterminated_literal_2015 |
    Float_literal_1 |
    Reserved_float_empty_exponent |
    Reserved_float_e_suffix_restriction |
    Float_literal_2 |
    Reserved_float_based |
    Reserved_integer_e_suffix_restriction |
    Integer_literal |
    Lifetime_or_label |
    Raw_identifier |
    Reserved_prefix_2015 |
    Identifier |
    Punctuation
}

PRETOKEN_2021 = {
    Whitespace |
    Line_comment |
    Block_comment |
    Unterminated_block_comment |
    Single_quoted_literal |
    Double_quoted_literal_2021 |
    Raw_double_quoted_literal_2021 |
    Reserved_literal_2021 |
    Float_literal_1 |
    Reserved_float_empty_exponent |
    Reserved_float_e_suffix_restriction |
    Float_literal_2 |
    Reserved_float_based |
    Reserved_integer_e_suffix_restriction |
    Integer_literal |
    Raw_lifetime_or_label_2021 |
    Reserved_lifetime_or_label_prefix_2021 |
    Lifetime_or_label |
    Raw_identifier |
    Reserved_prefix_2021 |
    Identifier |
    Punctuation
}

PRETOKEN_2024 = {
    Whitespace |
    Line_comment |
    Block_comment |
    Unterminated_block_comment |
    Single_quoted_literal |
    Double_quoted_literal_2021 |
    Raw_double_quoted_literal_2021 |
    Reserved_literal_2021 |
    Reserved_guard_2024 |
    Float_literal_1 |
    Reserved_float_empty_exponent |
    Reserved_float_e_suffix_restriction |
    Float_literal_2 |
    Reserved_float_based |
    Reserved_integer_e_suffix_restriction |
    Integer_literal |
    Raw_lifetime_or_label_2021 |
    Reserved_lifetime_or_label_prefix_2021 |
    Lifetime_or_label |
    Raw_identifier |
    Reserved_prefix_2021 |
    Identifier |
    Punctuation
}


IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }
SUFFIX = { IDENT }


Whitespace = { PATTERN_WHITE_SPACE + }

Line_comment = { "//" ~ LINE_COMMENT_CONTENT }
LINE_COMMENT_CONTENT = { ( !"\n" ~ ANY )* }

Block_comment = { "/*" ~ BLOCK_COMMENT_CONTENT ~ "*/" }
BLOCK_COMMENT_CONTENT = { ( Block_comment | !"*/" ~ !"/*" ~ ANY ) * }

Unterminated_block_comment = { "/*" }


Single_quoted_literal = {
    SQ_PREFIX ~
    "'" ~ SQ_CONTENT ~ "'" ~
    SUFFIX ?
}

SQ_PREFIX = { "b" ? }

SQ_CONTENT = {
    "\\" ~ ANY ~ ( !"'" ~ ANY ) * |
    !"'" ~ ANY
}


Double_quoted_literal_2015 = { DQ_PREFIX_2015 ~ DQ_REMAINDER }
Double_quoted_literal_2021 = { DQ_PREFIX_2021 ~ DQ_REMAINDER }

DQ_PREFIX_2015 = { "b" ? }
DQ_PREFIX_2021 = { ( "b" | "c" ) ? }

DQ_REMAINDER = {
    "\"" ~ DQ_CONTENT ~ "\"" ~
    SUFFIX ?
}
DQ_CONTENT = {
    (
        "\\" ~ ANY |
        !"\"" ~ ANY
    ) *
}


Raw_double_quoted_literal_2015 = { RAW_DQ_PREFIX_2015 ~ RAW_DQ_REMAINDER }
Raw_double_quoted_literal_2021 = { RAW_DQ_PREFIX_2021 ~ RAW_DQ_REMAINDER }

RAW_DQ_PREFIX_2015 = { "r" | "br" }
RAW_DQ_PREFIX_2021 = { "r" | "br" | "cr" }

RAW_DQ_REMAINDER = {
    PUSH(HASHES) ~
    "\"" ~ RAW_DQ_CONTENT ~ "\"" ~
    POP ~
    SUFFIX ?
}
RAW_DQ_CONTENT = {
    ( !("\"" ~ PEEK) ~ ANY ) *
}
HASHES = { "#" {0, 255} }

Unterminated_literal_2015 = { "r\"" | "br\"" | "b'" }
Reserved_literal_2021 = { IDENT ~ ( "\"" | "'" ) }

Reserved_guard_2024 = { "##" | "#\"" }

DECIMAL_DIGITS = { ('0'..'9' | "_") * }
HEXADECIMAL_DIGITS = { ('0'..'9' | 'a' .. 'f' | 'A' .. 'F' | "_") * }
LOW_BASE_PRETOKEN_DIGITS = { DECIMAL_DIGITS }
DECIMAL_PART = { '0'..'9' ~ DECIMAL_DIGITS }

RESTRICTED_E_SUFFIX = { ("e"|"E") ~ "_"+ ~ !XID_START ~ XID_CONTINUE }


Float_literal_1 = {
    FLOAT_BODY_WITH_EXPONENT ~ SUFFIX ?
}
Float_literal_2 = {
    FLOAT_BODY_WITHOUT_EXPONENT ~ SUFFIX ? |
    FLOAT_BODY_WITH_FINAL_DOT ~ !"." ~ !IDENT_START
}

FLOAT_BODY_WITH_EXPONENT = {
    DECIMAL_PART ~ ("." ~ DECIMAL_PART ) ? ~
    ("e"|"E") ~ ("+"|"-") ? ~ EXPONENT_DIGITS
}
EXPONENT_DIGITS = { "_" * ~ '0'..'9' ~ DECIMAL_DIGITS }

FLOAT_BODY_WITHOUT_EXPONENT = {
    DECIMAL_PART ~ "." ~ DECIMAL_PART
}

FLOAT_BODY_WITH_FINAL_DOT = {
    DECIMAL_PART ~ "."
}

Reserved_float_empty_exponent = {
    DECIMAL_PART ~ ("." ~ DECIMAL_PART ) ? ~
    ("e"|"E") ~ ("+"|"-")
}
Reserved_float_e_suffix_restriction = {
    DECIMAL_PART ~ "." ~ DECIMAL_PART ~
    RESTRICTED_E_SUFFIX
}
Reserved_float_based = {
    (
        ("0b" | "0o") ~ LOW_BASE_PRETOKEN_DIGITS |
        "0x" ~ HEXADECIMAL_DIGITS
    )  ~  (
        ("e"|"E") ~ ("+"|"-" | EXPONENT_DIGITS) |
        "." ~ !"." ~ !IDENT_START
    )
}

Reserved_integer_e_suffix_restriction = {
    ( INTEGER_BINARY_LITERAL |
      INTEGER_OCTAL_LITERAL |
      INTEGER_DECIMAL_LITERAL ) ~
    RESTRICTED_E_SUFFIX
}

Integer_literal = {
    ( INTEGER_BINARY_LITERAL |
      INTEGER_OCTAL_LITERAL |
      INTEGER_HEXADECIMAL_LITERAL |
      INTEGER_DECIMAL_LITERAL ) ~
    SUFFIX ?
}

INTEGER_BINARY_LITERAL = { "0b" ~ LOW_BASE_PRETOKEN_DIGITS }
INTEGER_OCTAL_LITERAL = { "0o" ~ LOW_BASE_PRETOKEN_DIGITS }
INTEGER_HEXADECIMAL_LITERAL = { "0x" ~ HEXADECIMAL_DIGITS }
INTEGER_DECIMAL_LITERAL = { DECIMAL_PART }


Raw_lifetime_or_label_2021 = { "'r#" ~ IDENT ~ !"'" }

Reserved_lifetime_or_label_prefix_2021 = { "'" ~ IDENT ~ "#" }

Lifetime_or_label = { "'" ~ IDENT ~ !"'" }

Raw_identifier = { "r#" ~ IDENT }

Reserved_prefix_2015 = { "r#" | "br#" }
Reserved_prefix_2021 = { IDENT ~ "#" }

Identifier = { IDENT }


Punctuation = {
    ";" |
    "," |
    "." |
    "(" |
    ")" |
    "{" |
    "}" |
    "[" |
    "]" |
    "@" |
    "#" |
    "~" |
    "?" |
    ":" |
    "$" |
    "=" |
    "!" |
    "<" |
    ">" |
    "-" |
    "&" |
    "|" |
    "+" |
    "*" |
    "/" |
    "^" |
    "%"
}

Rationale for this model

Pretokenising and reprocessing
Using a Parsing Expression Grammar
Producing tokens with attributes

Pretokenising and reprocessing

The split into pretokenisation and reprocessing is primarily a way to make the grammar simpler.

The main advantage is dealing with character, byte, and string literals, where we have to reject invalid escape sequences at lexing time.

In this model, the lexer finds the extent of the token using simple grammar definitions, and then checks whether the escape sequences are valid in a separate "reprocessing" operation. So the grammar "knows" that a backslash character indicates an escape sequence, but doesn't model escapes in any further detail.

In contrast the Reference gives grammar productions which try to describe the available escape sequences in each kind of string-like literal, but this isn't enough to characterise which forms are accepted (for example "\u{d800}" is rejected at lexing time, because there is no character with scalar value D800).

Given that we have this separate operation, we can use it to simplify other parts of the grammar too, including:

distinguishing doc-comments
rejecting CR in comments
rejecting the reserved keywords in raw identifiers, eg r#crate
rejecting no-digit forms like 0x_
rejecting the variants of numeric literals reserved in rfc0879, eg 0b123
rejecting literals with a single _ as a suffix

This means we can avoid adding many "reserved form" definitions. For example, if we didn't accept _ as a suffix in the main string-literal grammar definitions, we'd have to have another Reserved_something definition to prevent the _ being accepted as a separate token.

Given the choice to use locally greedy matching (see below), I think an operation which rejects pretokens after parsing them is necessary to deal with a case like 0b123, to avoid analysing it as 0b1 followed by 23.

Using a Parsing Expression Grammar

I think a PEG is a good formalism for modelling Rust's lexer (though probably not for the rest of the grammar) for several reasons.

Resolving ambiguities

The lexical part of Rust's grammar is necessarily full of ambiguities.

For example:

ab could be a single identifier, or a followed by b
1.2 could be a floating-point literal, or 1 followed by . followed by 2
r"x" could be a raw string literal, or r followed by "x"

The Reference doesn't explicitly state what formalism it's using, or what rule it uses to disambiguate such cases.

There are two common approaches: to choose the longest available match (as Lex traditionally does), or to explicitly list rules in priority order and specify locally "greedy" repetition-matching (as PEGs do).

With the "longest available" approach, additional rules are needed if multiple rules can match with equal length.

The 2024 version of this model characterised the cases where rustc doesn't choose the longest available match, and where (given its choice of rules) there are multiple longest-matching rules.

For example, the Reference's lexer rules for input such as 0x3 allow two interpretations, matching the same extent:

as a hexadecimal integer literal: 0x3 with no suffix
as a decimal integer literal: 0 with a suffix of x3

We want to choose the former interpretation. (We could say that these are both the same kind of token and re-analyse it later to decide which part was the suffix, but we'd still have to understand the distinction inside the lexer in order to reject forms like 0b0123.)

Examples where rustc chooses a token which is shorter than the longest available match are rare. In the model used by the Reference, 0x· is one: rustc treats this as a "reserved number" (0x), rather than 0 with suffix x·. (Note that · has the XID_Continue property but not XID_Start.)

Similarly the changes from pr131656 (allowing forms like 1em) introduce ambiguities which are resolved naturally because the floating-point literal definitions have priority over the integer literal definitions.

Again there are examples where the longest available match isn't chosen. For example 1e2· could be interpreted as an integer decimal literal with suffix e2·, but instead we find the float literal 1e2 and then reject the remainder of the input.

I think that in 2025 it's clear that a priority-based system is a better fit for Rust's lexer.

Generative grammars don't inherently have prioritisation, but parsing expression grammars do.

Ambiguities that must be resolved as errors

There are a number of forms which are errors at lexing time, even though in principle they could be analysed as multiple tokens.

Many cases can be handled in reprocessing, as described above.

Other cases can be handled naturally using a PEG, by writing medium-priority rules rules to match them, for example:

the rfc3101 "reserved prefixes" (in Rust 2021 and newer): k#abc, f"...", or f'...'
unterminated block comments such as /* abc
forms that look like floating-point literals with a base indicator, such as 0b1.0

In this model, these additional rules produce Reserved pretokens, which are rejected at reprocessing time.

Lookahead

There are two cases where the Reference currently describes the lexer's behaviour using lookahead:

for (possibly raw) lifetime-or-label, to prevent 'ab'c' being analysed as 'ab followed by 'c
for floating-point literals, to make sure that 1.a is analysed as 1 . a rather than 1. a

These are easily modelled using PEG predicates.

Handling raw strings

The biggest weakness of using the PEG formalism is that it can't naturally describe the rule for matching the number of # characters in raw string literals.

See Grammar for raw string literals for discussion.

Adopting language changes

Rustc's lexer is made up of hand-written imperative code, largely using match statements. It often peeks ahead for a few characters, and it tries to avoid backtracking.

This is a close fit for the behaviour modelled by PEGs, so there's good reason to suppose that it will be easy to update this model for future versions of Rust.

Producing tokens with attributes

This model makes the lexing process responsible for a certain amount of 'interpretation' of the tokens, rather than simply describing how the input source is sliced up and assigning a 'kind' to each resulting token.

The main motivation for this is to deal with string-like literals: it means we don't need to separate the description of the result of "unescaping" strings from the description of which strings contain well-formed escapes.

In particular, describing unescaping at lexing time makes it easy to describe the rule about rejecting NULs in C-strings, even if they were written using an escape.

For numeric literals, the way the suffix is identified isn't always simple (see Resolving ambiguities above); I think it's best to make the lexer responsible for doing it, so that the description of numeric literal expressions doesn't have to.

For identifiers, many parts of the spec will need a notion of equivalence (both for handling raw identifiers and for dealing with NFC normalisation), and some restrictions depend on the normalised form (see ASCII identifiers). I think it's best for the lexer to handle this by defining the represented identifier.

This document treats the lexer's "output" as a stream of tokens which have concrete attributes, but of course it would be equivalent (and I think more usual for a spec) to treat each attribute as an independent defined term, and write things like "the represented character of a character literal token is…".

Grammar for raw string literals

I believe the PEG formalism can't naturally describe Rust's rule for matching the number of # characters in raw string literals.

I can think of the following ways to handle this:

Ad-hoc extension

This writeup uses an ad-hoc extension to the formalism, along similar lines to the stack extension described below (but without needing a full stack).

Pest's stack extension

Pest provides a stack extension which is a good fit for this requirement, and is used in the comparable implementation.

It looks like this:

RAW_DQ_REMAINDER = {
    PUSH(HASHES) ~
    "\"" ~ RAW_DQ_CONTENT ~ "\"" ~
    POP ~
    SUFFIX ?
}
RAW_DQ_CONTENT = {
    ( !("\"" ~ PEEK) ~ ANY ) *
}
HASHES = { "#" {0, 255} }

The notion of matching an expression is extended to include a context stack (a stack of strings): each match attempt takes the stack as an additional input and produces an updated stack as an additional output.

The stack is initially empty.

There are three additional forms of expression: PUSH(e), PEEK(e), and POP(e), where e is an arbitrary expression.

PUSH(e) behaves in the same way as the expression e; if it succeeds, it additionally pushes the text consumed by e onto the stack.

PEEK(e) behaves in the same way as a literal string expression, where the string is the top entry of the stack. If the stack is empty, PEEK(e) fails.

POP(e) behaves in the same way as PEEK(e). If it succeeds, it then pops the top entry from the stack.

All other expressions leave the stack unmodified.

Scheme of definitions

Because raw string literals have a limit of 255 # characters, it is in principle possible to model them using a PEG with 256 (pairs of) definitions.

So writing this out as a "scheme" of definitions might be thinkable:

RDQ_0 = {
    "\"" ~ RDQ_0_CONTENT ~ "\"" ~
}
RDQ_0_CONTENT = {
    ( !("\"") ~ ANY ) *
}

RDQ_1 = {
    "#"{1} ~
    "\"" ~ RDQ_1_CONTENT ~ "\"" ~
    "#"{1} ~
}
RDQ_1_CONTENT = {
    ( !("\"" ~ "#"{1}) ~ ANY ) *
}

RDQ_2 = {
    "#"{2} ~
    "\"" ~ RDQ_2_CONTENT ~ "\"" ~
    "#"{2} ~
}
RDQ_2_CONTENT = {
    ( !("\"" ~ "#"{2}) ~ ANY ) *
}

…

RDQ_255 = {
    "#"{255} ~
    "\"" ~ RDQ_255_CONTENT ~ "\"" ~
    "#"{255} ~
}
RDQ_255_CONTENT = {
    ( !("\"" ~ "#"{255}) ~ ANY ) *
}

Open questions

Terminology
Presenting reprocessing as a separate pass
Raw string literals
Token kinds and attributes
How to indicate captured text
Wording for string unescaping
How to model shebang removal
String continuation escapes

Terminology

Some of the terms used in this document are taken from pre-existing documentation or rustc's error output, but many of them are new (and so can freely be changed).

Here's a partial list:

Term	Source
pretoken	New
reprocessing	New
fine-grained token	New
compound token	New
literal content	Reference (recent)
simple escape	Reference (recent)
escape sequence	Reference
escaped value	Reference (recent)
string continuation escape	Reference (as `STRING_CONTINUE`)
string representation	Reference (recent)
represented byte	New
represented character	Reference (recent)
represented bytes	Reference (recent)
represented string	Reference (recent)
represented identifier	New
style (of a comment)	rustc internal
body (of a comment)	Reference

Terms listed as "Reference (recent)" are ones I introduced in PRs merged in January 2024.

Presenting reprocessing as a separate pass

This writeup presents pretokenisation and reprocessing in separate sections, with separate but similar definitions for a "Pretoken" and a "Fine-grained token".

That's largely because I wanted the option to have further processing between those two stages which might split or join tokens, as some earlier models have done.

But in this version of the model that flexibility isn't used: one pretoken always corresponds to one fine-grained token (unless the input is rejected).

So it might be possible to drop the distinction between those two types altogether.

In any case I don't think it's necessary to describe reprocessing as a second pass: the conditions for rejecting each type of pretoken, and the definitions of the things which are currently attributes of fine-grained tokens, could be described in the same place as the description of how the pretoken is produced.

Raw string literals

How should raw string literals be documented? See Grammar for raw string literals for some options.

Token kinds and attributes

What kinds and attributes should fine-grained tokens have?

Distinguishing raw and non-raw forms

The current table distinguishes raw from non-raw forms as different top-level "kinds".

I think this distinction will be needed in some cases, but perhaps it would be better represented using an attributes on unified kinds (like rustc_ast::StrStyle and rustc_ast::token::IdentIsRaw).

As an example of where it might be wanted: proc-macros Display for raw identifiers includes the r# prefix for raw identifiers, but I think simply using the source extent isn't correct because the Display output is NFC-normalised.

Hash count

Should there be an attribute recording the number of hashes in a raw string or byte-string literal? Rustc has something of the sort.

ASCII identifiers

Should there be an attribute indicating whether an identifier is all ASCII? The Reference lists several places where identifiers have this restriction, and it seems natural for the lexer to be responsible for making this check.

The list in the Reference is:

extern crate declarations
External crate names referenced in a path
Module names loaded from the filesystem without a path attribute
no_mangle attributed items
Item names in external blocks

I believe this restriction is applied after NFC-normalisation, so it's best thought of as a restriction on the represented identifier.

Represented bytes for C strings

At present this document says that the sequence of "represented bytes" for C string literals doesn't include the added NUL.

That's following the way the Reference currently uses the term "represented bytes", but rustc includes the NUL in its equivalent piece of data.

Should this writeup change to match rustc?

How to indicate captured text

Some of the nonterminals in the grammar exist only to identify text to be "captured", for example LINE_COMMENT_CONTENT here:

Line_comment = { "//" ~ LINE_COMMENT_CONTENT }
LINE_COMMENT_CONTENT = { ( !"\n" ~ ANY )* }

Would it be better to extend the notation to allow annotating part of an expression without separating out a nonterminal? Pest's "Tags" extension would allow doing this, but it's not a standard feature of PEGs.

Wording for string unescaping

The description of reprocessing for String literals and C-string literals was originally drafted for the Reference. Should there be a more formal definition of unescaping processes than the current "left-to-right order" and "contributes" wording?

I believe that any literal content which will be accepted can be written uniquely as a sequence of (escape-sequence or non-\-character), but I'm not sure that's obvious enough that it can be stated without justification.

This is a place where the comparable implementation isn't closely parallel to the writeup.

How to model shebang removal

This part of the Reference text isn't trying to be rigorous:

As an exception, if the #! characters are followed (ignoring intervening comments or whitespace) by a [ token, nothing is removed. This prevents an inner attribute at the start of a source file being removed.

rustc implements the "ignoring intervening comments or whitespace" part by running its lexer for long enough to see whether the [ is there or not, then discarding the result (see #70528 and #71487 for history).

So should the spec define this in terms of its model of the lexer?

String continuation escapes

rustc has a warning that the behaviour of String continuation escapes (when multiple newlines are skipped) may change in future.

The Reference has a note about this, and points to #1042 for more information.

#136600 asks whether this is intentional.

Rustc oddities

NFC normalisation for lifetime/label

Identifiers are normalised to NFC, which means that Kelvin and Kelvin are treated as representing the same identifier. See rfc2457.

But this doesn't happen for lifetimes or labels, so 'Kelvin and 'Kelvin are different as lifetimes or labels.

For example, this compiles without warning in Rust 1.86, while this doesn't.

In this writeup, the represented identifier attribute of Identifier and RawIdentifier fine-grained tokens is in NFC, and the name attribute of LifetimeOrLabel and RawLifetimeOrLabel tokens isn't.

I think this behaviour is a promising candidate for provoking the "Wait...that's what we currently do? We should fix that." reaction to being given a spec to review.

Filed as rustc #126759.

Nested block comments

The Reference says "Nested block comments are supported".

Rustc implements this by counting occurrences of /* and */, matching greedily. That means it rejects forms like /* xyz /*/.

This writeup includes a !"/*" subexpression in the BLOCK_COMMENT_CONTENT definition to match rustc's behaviour.

The grammar production in the Reference seems to be written to assume that these forms should be accepted (but I think it's garbled anyway: it accepts /* /* */).

I haven't seen any discussion of whether this rustc behaviour is considered desirable.

Restriction on e-suffixes

With the implementation of pr131656 as of 2025-04-27, support for numeric literal suffixes beginning with e or E is incomplete, and rejects some (very obscure) cases.

A numeric literal token is rejected if:

it doesn't have an exponent; and
it has a suffix of the following form:
- begins with e or E
- immediately followed by one or more _ characters
- immediately followed by a character which has the XID_Continue property but not XID_Start.

For example, 123e_· is rejected.

The Reserved_float_e_suffix_restriction and Reserved_integer_e_suffix_restriction nonterminals describe this restriction in the grammar.

Writeup of Rust's lexer