Introduction

This document contains a description of rustc's lexer, which is aiming to be both correct and verifiable.

It's accompanied by a reimplementation of the lexer in Rust based on that description (called the "comparable implementation" below), and a framework for comparing its output to rustc's.

Scope

Rust language version

This document describes Rust version 1.85.

That means it describes raw lifetimes/labels and the additional reservations in the 2024 edition, but not

rfc3349 (Mixed UTF-8 literals)

Other statements in this document are intended to be true as of February 2025.

The comparable implementation is intended to be compiled against (and compared against)
rustc 1.87.0-nightly (f8a913b13 2025-02-23)

This branch also documents the behaviour of pr131656
lexer: Treat more floats with empty exponent as valid tokens
as of 2025-03-02; the comparable implementation should be compared against that PR.

Editions

This document describes the editions supported by Rust 1.85:

2015
2018
2021
2024

There are no differences in lexing behaviour between the 2015 and 2018 editions.

In the comparable implementation, "2015" is used to refer to the common behaviour of Rust 2015 and Rust 2018.

Accepted input

This description aims to accept input exactly if rustc's lexer would.

Specifically, it aims to model what's accepted as input to a function-like macro (a procedural macro or a by-example macro using the tt fragment specifier).

It's not attempting to accurately model rustc's "reasons" for rejecting input, or to provide enough information to reproduce error messages similar to rustc's.

It's not attempting to describe rustc's "recovery" behaviour (where input which will be reported as an error provides tokens to later stages of the compiler anyway).

Size limits

This description doesn't attempt to characterise rustc's limits on the size of the input as a whole.

As far as I know, rustc has no limits on the size of individual tokens beyond its limits on the input as a whole. But I haven't tried to test this.

Output form

This document only goes as far as describing how to produce a "least common denominator" stream of tokens.

Further writing will be needed to describe how to convert that stream to forms that fit the (differing) needs of the grammar and the macro systems.

In particular, this representation may be unsuitable for direct use by a description of the grammar because:

there's no distinction between identifiers and keywords;
there's a single "kind" of token for all punctuation;
sequences of punctuation such as :: aren't glued together to make a single token.

(The comparable implementation includes code to make compound punctuation tokens so they can be compared with rustc's, but that process isn't described here.)

Licence

This document and the accompanying lexer implementation are released under the terms of both the MIT license and the Apache License (Version 2.0).

Authorship and source access

The source code for this document and the accompanying lexer implementation is available at https://github.com/mattheww/lexeywan

Overview

The following processes might be considered to be part of Rust's lexer:

Decode: interpret UTF-8 input as a sequence of Unicode characters
Clean:
- Byte order mark removal
- CRLF normalisation
- Shebang removal
Tokenise: interpret the characters as ("fine-grained") tokens
Further processing: to fit the needs of later parts of the spec
- For example, convert fine-grained tokens to compound tokens
- possibly different for the grammar and the two macro implementations

This document attempts to completely describe the "Tokenise" process.

Definitions

Byte

For the purposes of this document, byte means the same thing as Rust's u8 (corresponding to a natural number in the range 0 to 255 inclusive).

Character

For the purposes of this document, character means the same thing as Rust's char. That means, in particular:

there's exactly one character for each Unicode scalar value
the things that Unicode calls "noncharacters" are characters
there are no characters corresponding to surrogate code points

Sequence

When this document refers to a sequence of items, it means a finite, but possibly empty, ordered list of those items.

"character sequence" and "sequence of characters" are different ways of saying the same thing.

Prefix of a sequence

When this document talks about a prefix of a sequence, it means "prefix" in the way that abc is a prefix of abcde. The prefix may be empty, or the entire sequence.

NFC normalisation

References to NFC-normalised strings are talking about Unicode's Normalization Form C, defined in Unicode Standard Annex #15.

Processing that happens before tokenising

This document's description of tokenising takes a sequence of characters as input.

rustc obtains that sequence of characters as follows:

This description is taken from the Input format chapter of the Reference.

Source encoding

Each source file is interpreted as a sequence of Unicode characters encoded in UTF-8. It is an error if the file is not valid UTF-8.

Byte order mark removal

If the first character in the sequence is U+FEFF (BYTE ORDER MARK), it is removed.

CRLF normalization

Each pair of characters U+000D CR immediately followed by U+000A LF is replaced by a single U+000A LF.

Other occurrences of the character U+000D CR are left in place (they are treated as whitespace).

Note: this document's description of tokenisation doesn't assume that the sequence CRLF never appears in its input; that makes it more general than necessary, but should do no harm.

In particular, in places where the Reference says that tokens may not contain "lone CR", this description just says that any CR is rejected.

Shebang removal

If the remaining sequence begins with the characters #!, the characters up to and including the first U+000A LF are removed from the sequence.

For example, the first line of the following file would be ignored:

#!/usr/bin/env rustx

fn main() {
    println!("Hello!");
}

As an exception, if the #! characters are followed (ignoring intervening comments or whitespace) by a [ punctuation token, nothing is removed. This prevents an inner attribute at the start of a source file being removed.

See open question: How to model shebang removal

Tokenising

This phase of processing takes a character sequence (the input), and either:

produces a sequence of fine-grained tokens; or
reports that lexical analysis failed

The analysis depends on the Rust edition which is in effect when the input is processed.

So strictly speaking, the edition is a second parameter to the process described here.

Tokenisation is described using two operations:

Pretokenising extracts pretokens from the character sequence.
Reprocessing converts pretokens to fine-grained tokens.

Either operation can cause lexical analysis to fail.

Note: If lexical analysis succeeds, concatenating the extents of the produced tokens produces an exact copy of the input.

Process

The process is to repeat the following steps until the input is empty:

extract a pretoken from the start of the input
reprocess that pretoken

If no step determines that lexical analysis should fail, the output is the sequence of fine-grained tokens produced by the repetitions of the second step.

Note: Each fine-grained token corresponds to one pretoken, representing exactly the same characters from the input; reprocessing doesn't involve any combination or splitting.

Note: it doesn't make any difference whether we treat this as one pass with interleaved pretoken-extraction and reprocessing, or as two passes. The comparable implementation uses a single interleaved pass, which means when it reports an error it describes the earliest part of the input which caused trouble.

Pretokenising

Pretokenisation works from an ordered list of rules.

See pretokenisation rules for the list (which depends on the Rust edition which is in effect).

To extract a pretoken from the input:

Apply each rule to the input.
If no rules succeed, lexical analysis fails.
Otherwise, the extracted pretoken's extent, kind, and attributes are determined (as described for each rule below) by the successful rule which appears earliest in the list.
Remove the extracted pretoken's extent from the start of the input.

Note: If lexical analysis succeeds, concatenating the extents of the pretokens extracted during the analysis produces an exact copy of the input.

See open question Rule priority

Rules

Each rule has a pattern (see patterns) and a set of forbidden followers (a set of characters).

A rule may also have a constraint (see constrained pattern matches).

The result of applying a rule to a character sequence is either:

the rule fails; or
the rule succeeds, and reports
- an extent, which is a prefix of the character sequence
- a pretoken kind
- values for the attributes appropriate to that kind of pretoken

Note: a given rule always reports the same pretoken kind, but some pretoken kinds are reported by multiple rules.

Applying rules

To apply a rule to a character sequence:

Attempt to match the rule's pattern against each prefix of the sequence.
If no prefix matches the pattern, the rule fails.
Otherwise the extent is the longest prefix which matches the pattern.
- But if the rule has a constraint, see constrained pattern matches instead for how the extent is determined.
If the extent is not the entire character sequence, and the character in the sequence which immediately follows the extent is in the rule's set of forbidden followers, the rule fails.
Otherwise the rule succeeds.

The description of each rule below says how the pretoken kind and attribute values are determined when the rule succeeds.

Constrained pattern matches

Each rule which has a constraint defines what is required for a sequence of characters to satisfy its constraint.

Or more formally: a constraint is a predicate function defined on sequences of characters.

When a rule which has a constraint is applied to a character sequence, the resulting extent is the shortest maximal match, defined as follows:

The candidates are the prefixes of the character sequence which match the rule's pattern and also satisfy the constraint.
The successor of a prefix of the character sequence is the prefix which is one character longer (the prefix which is the entire sequence has no successor).
The shortest maximal match is the shortest candidate whose successor is not a candidate (or which has no successor)

Note: constraints are used only for block comments and for raw string literals with hashes.

For the block comments rule it would be equivalent to say that the shortest match becomes the extent.

For raw string literals, the "shortest maximal match" behaviour is a way to get the mix of non-greedy and greedy matching we need: the rule as a whole has to be non-greedy so that it doesn't jump to the end of a later literal, but the suffix needs to be matched greedily.

Pretokens

Each pretoken has an extent, which is a sequence of characters taken from the input.

Each pretoken has a kind, and possibly also some attributes, as described in the tables below.

Kind	Attributes
`Reserved`
`Whitespace`
`LineComment`	`comment content`
`BlockComment`	`comment content`
`Punctuation`	`mark`
`Identifier`	`identifier`
`RawIdentifier`	`identifier`
`LifetimeOrLabel`	`name`
`RawLifetimeOrLabel`	`name`
`SingleQuoteLiteral`	`prefix`, `literal content`, `suffix`
`DoubleQuoteLiteral`	`prefix`, `literal content`, `suffix`
`RawDoubleQuoteLiteral`	`prefix`, `literal content`, `suffix`
`IntegerDecimalLiteral`	`digits`, `suffix`
`IntegerHexadecimalLiteral`	`digits`, `suffix`
`IntegerOctalLiteral`	`digits`, `suffix`
`IntegerBinaryLiteral`	`digits`, `suffix`
`FloatLiteral`	`has base`, `body`, `exponent digits`, `suffix`

These attributes have the following types:

Attribute	Type
`body`	sequence of characters
`digits`	sequence of characters
`exponent digits`	either a sequence of characters, or none
`has base`	true or false
`identifier`	sequence of characters
`literal content`	sequence of characters
`comment content`	sequence of characters
`mark`	single character
`name`	sequence of characters
`prefix`	sequence of characters
`suffix`	sequence of characters

Patterns

A pattern has two functions:

To answer the question "does this sequence of characters match the pattern?"
When the answer is yes, to capture zero or more named groups of characters.

The patterns in this document use the notation from the well-known Rust regex crate.

Specifically, the notation is to be interpreted in verbose mode (ignore_whitespace) and with . allowed to match newlines (dot_matches_new_line).

See open question Pattern notation.

Patterns are always used to match against a fixed-length sequence of characters (as if the pattern was anchored at both ends).

Other than for constrained pattern matches, the comparable implementation anchors to the start but not the end, relying on Regex::find() to find the longest matching prefix.

Named capture groups (eg (?<suffix> … ) are used in the patterns to supply character sequences used to determine attribute values.

Sets of characters

In particular, the following notations are used to specify sets of Unicode characters:

\p{Pattern_White_Space}

refers to the set of characters which have the Pattern_White_Space Unicode property, which are:


U+0009	(horizontal tab, '\t')
U+000A	(line feed, '\n')
U+000B	(vertical tab)
U+000C	(form feed)
U+000D	(carriage return, '\r')
U+0020	(space, ' ')
U+0085	(next line)
U+200E	(left-to-right mark)
U+200F	(right-to-left mark)
U+2028	(line separator)
U+2029	(paragraph separator)

Note: This set doesn't change in updated Unicode versions.

\p{XID_Start}

refers to the set of characters which have the XID_Start Unicode property (as of Unicode 16.0.0).

\p{XID_Continue}

refers to the set of characters which have the XID_Continue Unicode property (as of Unicode 16.0.0).

The Reference adds the following when discussing identifiers: "Zero width non-joiner (ZWNJ U+200C) and zero width joiner (ZWJ U+200D) characters are not allowed in identifiers." Those characters don't have XID_Start or XID_Continue, so that's only informative text, not an additional constraint.

Whitespace
Line comment
Block comment
Unterminated block comment
Reserved hash forms (Rust 2024)
Punctuation
Single-quoted literal
Raw lifetime or label (Rust 2021 and 2024)
Reserved lifetime or label prefix (Rust 2021 and 2024)
Non-raw lifetime or label
Double-quoted non-raw literal (Rust 2015 and 2018)
Double-quoted non-raw literal (Rust 2021 and 2024)
Double-quoted hashless raw literal (Rust 2015 and 2018)
Double-quoted hashless raw literal (Rust 2021 and 2024)
Double-quoted hashed raw literal (Rust 2015 and 2018)
Double-quoted hashed raw literal (Rust 2021 and 2024)
Float literal with signed exponent
Float literal with signless exponent
Float literal without exponent
Float literal with final dot
Integer binary literal
Integer octal literal
Integer hexadecimal literal
Integer decimal literal
Raw identifier
Unterminated literal (Rust 2015 and 2018)
Reserved prefix or unterminated literal (Rust 2021 and 2024)
Non-raw identifier

The list of pretokenisation rules

The list of pretokenisation rules is given below.

Rules whose names indicate one or more editions are included in the list only when one of those editions is in effect.

Unless otherwise stated, a rule has no constraint and has an empty set of forbidden followers.

When an attribute value is given below as "captured characters", the value of that attribute is the sequence of characters captured by the capture group in the pattern whose name is the same as the attribute's name.

Whitespace

Pattern

[ \p{Pattern_White_Space} ] +

Pretoken kind

Whitespace

Attributes

(none)

Line comment

Pattern

/ /
(?<comment_content>
  [^ \n] *
)

Pretoken kind

LineComment

Attributes


`comment content`	captured characters

Block comment

Pattern

/ \*
(?<comment_content>
  . *
)
\* /

Constraint

The constraint is satisfied if (and only if) the following block of Rust code evaluates to true, when character_sequence represents an iterator over the sequence of characters being tested against the constraint.

#![allow(unused)]
fn main() {
{
    let mut depth = 0_isize;
    let mut after_slash = false;
    let mut after_star = false;
    for c in character_sequence {
        match c {
            '*' if after_slash => {
                depth += 1;
                after_slash = false;
            }
            '/' if after_star => {
                depth -= 1;
                after_star = false;
            }
            _ => {
                after_slash = c == '/';
                after_star = c == '*';
            }
        }
    }
    depth == 0
}
}

Pretoken kind

BlockComment

Attributes


`comment content`	captured characters

See also Defining the block-comment constraint

Unterminated block comment

Pattern

/ \*

Pretoken kind

Reserved

Attributes

(none)

Reserved hash forms (Rust 2024)

Pattern

\#
( \# | " )

Pretoken kind

Reserved

Attributes

(none)

Punctuation

Pattern

[
  ; , \. \( \) \{ \} \[ \] @ \# ~ \? : \$ = ! < > \- & \| \+ \* / ^ %
]

Pretoken kind

Punctuation

Attributes


`mark`	the single character matched by the pattern

Note: When this pattern matches, the matched character sequence is necessarily one character long.

Single-quoted literal

Pattern

(?<prefix>
  b ?
)
'
(?<literal_content>
  [^ \\ ' ]
|
  \\ . [^']*
)
'
(?<suffix>
  (?:
    [ \p{XID_Start} _ ]
    \p{XID_Continue} *
  ) ?
)

Pretoken kind

SingleQuoteLiteral

Attributes


`prefix`	captured characters
`literal content`	captured characters
`suffix`	captured characters

Raw lifetime or label (Rust 2021 and 2024)

Pattern

' r \#
(?<name>
  [ \p{XID_Start} _ ]
  \p{XID_Continue} *
)

Forbidden followers:

The character '

Pretoken kind

RawLifetimeOrLabel

Attributes


`name`	captured characters

Reserved lifetime or label prefix (Rust 2021 and 2024)

Pattern

'
[ \p{XID_Start} _ ]
\p{XID_Continue} *
\#

Pretoken kind

Reserved

Attributes

(none)

Non-raw lifetime or label

Pattern

'
(?<name>
  [ \p{XID_Start} _ ]
  \p{XID_Continue} *
)

Forbidden followers:

The character '

Pretoken kind

LifetimeOrLabel

Attributes


`name`	captured characters

Note: the forbidden follower here makes sure that forms like 'aaa'bbb are not accepted.

Double-quoted non-raw literal (Rust 2015 and 2018)

Pattern

(?<prefix>
  b ?
)
"
(?<literal_content>
  (?:
    [^ \\ " ]
  |
    \\ .
  ) *
)
"
(?<suffix>
  (?:
    [ \p{XID_Start} _ ]
    \p{XID_Continue} *
  ) ?
)

Pretoken kind

DoubleQuoteLiteral

Attributes


`prefix`	captured characters
`literal content`	captured characters
`suffix`	captured characters

Double-quoted non-raw literal (Rust 2021 and 2024)

Pattern

(?<prefix>
  [bc] ?
)
"
(?<literal_content>
  (?:
    [^ \\ " ]
  |
    \\ .
  ) *
)
"
(?<suffix>
  (?:
    [ \p{XID_Start} _ ]
    \p{XID_Continue} *
  ) ?
)

Pretoken kind

DoubleQuoteLiteral

Attributes


`prefix`	captured characters
`literal content`	captured characters
`suffix`	captured characters

Note: the difference between the 2015/2018 and 2021/2024 patterns is that the 2021/2024 pattern allows c as a prefix.

Double-quoted hashless raw literal (Rust 2015 and 2018)

Pattern

(?<prefix>
  r | br
)
"
(?<literal_content>
  [^"] *
)
"
(?<suffix>
  (?:
    [ \p{XID_Start} _ ]
    \p{XID_Continue} *
  ) ?
)

Pretoken kind

RawDoubleQuoteLiteral

Attributes


`prefix`	captured characters
`literal content`	captured characters
`suffix`	captured characters

Double-quoted hashless raw literal (Rust 2021 and 2024)

Pattern

(?<prefix>
  r | br | cr
)
"
(?<literal_content>
  [^"] *
)
"
(?<suffix>
  (?:
    [ \p{XID_Start} _ ]
    \p{XID_Continue} *
  ) ?
)

Pretoken kind

RawDoubleQuoteLiteral

Attributes


`prefix`	captured characters
`literal content`	captured characters
`suffix`	captured characters

Note: the difference between the 2015/2018 and and 2021/2024 patterns is that the 2021/2024 pattern allows cr as a prefix.

Note: we can't treat the hashless rule as a special case of the hashed one because the "shortest maximal match" rule doesn't work without hashes (consider r"x"").

Double-quoted hashed raw literal (Rust 2015 and 2018)

Pattern

(?<prefix>
  r | br
)
(?<hashes_1>
  \# {1,255}
)
"
(?<literal_content>
  . *
)
"
(?<hashes_2>
  \# {1,255}
)
(?<suffix>
  (?:
    [ \p{XID_Start} _ ]
    \p{XID_Continue} *
  ) ?
)

Constraint

The constraint is satisfied if (and only if) the character sequence captured by the hashes_1 capture group is equal to the character sequence captured by the hashes_2 capture group.

Pretoken kind

RawDoubleQuoteLiteral

Attributes


`prefix`	captured characters
`literal content`	captured characters
`suffix`	captured characters

Double-quoted hashed raw literal (Rust 2021 and 2024)

Pattern

(?<prefix>
  r | br | cr
)
(?<hashes_1>
  \# {1,255}
)
"
(?<literal_content>
  . *
)
"
(?<hashes_2>
  \# {1,255}
)
(?<suffix>
  (?:
    [ \p{XID_Start} _ ]
    \p{XID_Continue} *
  ) ?
)

Constraint

The constraint is satisfied if (and only if) the character sequence captured by the hashes_1 capture group is equal to the character sequence captured by the hashes_2 capture group.

Pretoken kind

RawDoubleQuoteLiteral

Attributes


`prefix`	captured characters
`literal content`	captured characters
`suffix`	captured characters

Note: the difference between the 2015/2018 and 2021/2024 patterns is that the 2021/2024 pattern allows cr as a prefix.

Float literal with signed exponent

Pattern

(?<body>
  (?:
    (?<based>
      (?: 0b | 0o )
      [ 0-9 _ ] *
    )
  |
    [ 0-9 ]
    [ 0-9 _ ] *
  )
  (?:
    \.
    [ 0-9 ]
    [ 0-9 _ ] *
  ) ?
  [eE]
  [+-]
  (?<exponent_digits>
    [ 0-9 _ ] *
  )
)
(?<suffix>
  (?:
    [ \p{XID_Start} ]
    \p{XID_Continue} *
  ) ?
)

Pretoken kind

FloatLiteral

Attributes


`has base`	true if the `based` capture group participates in the match, false otherwise
`body`	captured characters
`exponent digits`	captured characters
`suffix`	captured characters

Float literal with signless exponent

Pattern

(?<body>
  (?:
    (?<based>
      (?: 0b | 0o )
      [ 0-9 _ ] *
    )
  |
    [ 0-9 ]
    [ 0-9 _ ] *
  )
  (?:
    \.
    [ 0-9 ]
    [ 0-9 _ ] *
  ) ?
  [eE]
  (?<exponent_digits>
    _ *
    [ 0-9 ]
    [ 0-9 _ ] *
  )
)
(?<suffix>
  (?:
    [ \p{XID_Start} ]
    \p{XID_Continue} *
  ) ?
)

Pretoken kind

FloatLiteral

Attributes


`has base`	true if the `based` capture group participates in the match, false otherwise
`body`	captured characters
`exponent digits`	captured characters
`suffix`	captured characters

Float literal without exponent

Pattern

(?<body>
  (?:
    (?<based>
      (?: 0b | 0o )
      [ 0-9 _ ] *
    |
      0x
      [ 0-9 a-f A-F _ ] *
    )
  |
    [ 0-9 ]
    [ 0-9 _ ] *
  )
  \.
  [ 0-9 ]
  [ 0-9 _ ] *
)
(?<suffix>
  (?:
    \p{XID_Start}
    \p{XID_Continue} *
  ) ?
)

Pretoken kind

FloatLiteral

Attributes


`has base`	true if the `based` capture group participates in the match, false otherwise
`body`	captured characters
`exponent digits`	none
`suffix`	captured characters

Float literal with final dot

Pattern

(?:
  (?<based>
    (?: 0b | 0o )
    [ 0-9 _ ] *
  |
    0x
    [ 0-9 a-f A-F _ ] *
  )
|
  [ 0-9 ]
  [ 0-9 _ ] *
)
\.

Forbidden followers:

The character _
The character .
The characters with the Unicode property XID_start

Pretoken kind

FloatLiteral

Attributes


`has base`	true if the `based` capture group participates in the match, false otherwise
`body`	the entire character sequence matched by the pattern
`exponent digits`	none
`suffix`	empty character sequence

Integer binary literal

Pattern

0b
(?<digits>
  [ 0-9 _ ] *
)
(?<suffix>
  (?:
    \p{XID_Start}
    \p{XID_Continue} *
  ) ?
)

Pretoken kind

IntegerBinaryLiteral

Attributes


`digits`	captured characters
`suffix`	captured characters

Integer octal literal

Pattern

0o
(?<digits>
  [ 0-9 _ ] *
)
(?<suffix>
  (?:
    \p{XID_Start}
    \p{XID_Continue} *
  ) ?
)

Pretoken kind

IntegerOctalLiteral

Attributes


`digits`	captured characters
`suffix`	captured characters

Integer hexadecimal literal

Pattern

0x
(?<digits>
  [ 0-9 a-f A-F _ ] *
)
(?<suffix>
  (?:
    [ \p{XID_Start} -- aAbBcCdDeEfF]
    \p{XID_Continue} *
  ) ?
)

Pretoken kind

IntegerHexadecimalLiteral

Attributes


`digits`	captured characters
`suffix`	captured characters

Integer decimal literal

Pattern

(?<digits>
  [ 0-9 ]
  [ 0-9 _ ] *
)
(?<suffix>
  (?:
    \p{XID_Start}
    \p{XID_Continue} *
  ) ?
)


`digits`	captured characters
`suffix`	captured characters

Pretoken kind

IntegerDecimalLiteral

Attributes

Note: it is important that this rule has lower priority than the other numeric literal rules. See Integer literal base-vs-suffix ambiguity.

Raw identifier

Pattern

r \#
(?<identifier>
  [ \p{XID_Start} _ ]
  \p{XID_Continue} *
)

Pretoken kind

RawIdentifier

Attributes


`identifier`	captured characters

Unterminated literal (Rust 2015 and 2018)

Pattern

( r \# | b r \# | r " | b r " | b ' )

Note: I believe the double-quoted forms here aren't strictly needed: if this rule is chosen when its pattern matched via one of those forms then the input must be rejected eventually anyway.

Pretoken kind

Reserved

Attributes

(none)

Reserved prefix or unterminated literal (Rust 2021 and 2024)

Pattern

[ \p{XID_Start} _ ]
\p{XID_Continue} *
( \# | " | ' )

Pretoken kind

Reserved

Attributes

(none)

Non-raw identifier

Pattern

(?<identifier>
  [ \p{XID_Start} _ ]
  \p{XID_Continue} *
)

Pretoken kind

Identifier

Attributes


`identifier`	captured characters

Note: this is following the specification in Unicode Standard Annex #31 for Unicode version 16.0, with the addition of permitting underscore as the first character.

Reprocessing

Reprocessing examines a pretoken, and either accepts it (producing a fine-grained token), or rejects it (causing lexical analysis to fail).

Note: reprocessing behaves in the same way in all Rust editions.

The effect of reprocessing each kind of pretoken is given in List of reprocessing cases.

Fine-grained tokens

Reprocessing produces fine-grained tokens.

Each fine-grained token has an extent, which is a sequence of characters taken from the input.

Each fine-grained token has a kind, and possibly also some attributes, as described in the tables below.

Kind	Attributes
`Whitespace`
`LineComment`	`style`, `body`
`BlockComment`	`style`, `body`
`Punctuation`	`mark`
`Identifier`	`represented identifier`
`RawIdentifier`	`represented identifier`
`LifetimeOrLabel`	`name`
`RawLifetimeOrLabel`	`name`
`CharacterLiteral`	`represented character`, `suffix`
`ByteLiteral`	`represented byte`, `suffix`
`StringLiteral`	`represented string`, `suffix`
`RawStringLiteral`	`represented string`, `suffix`
`ByteStringLiteral`	`represented bytes`, `suffix`
`RawByteStringLiteral`	`represented bytes`, `suffix`
`CStringLiteral`	`represented bytes`, `suffix`
`RawCStringLiteral`	`represented bytes`, `suffix`
`IntegerLiteral`	`base`, `digits`, `suffix`
`FloatLiteral`	`body`, `suffix`

These attributes have the following types:

Attribute	Type
`base`	binary / octal / decimal / hexadecimal
`body`	sequence of characters
`digits`	sequence of characters
`mark`	single character
`name`	sequence of characters
`represented byte`	single byte
`represented bytes`	sequence of bytes
`represented character`	single character
`represented identifier`	sequence of characters
`represented string`	sequence of characters
`style`	non-doc / inner doc / outer doc
`suffix`	sequence of characters

Notes:

At this stage:

Both _ and keywords are treated as instances of Identifier.
There are explicit tokens representing whitespace and comments.
Single-character tokens are used for all punctuation.
A lifetime (or label) is represented as a single token (which includes the leading ').

Escape processing

The descriptions of the effect of reprocessing string and character literals make use of several forms of escape.

Each form of escape is characterised by:

an escape sequence: a sequence of characters, which always begins with \
an escaped value: either a single character or an empty sequence of characters

In the definitions of escapes below:

An octal digit is any of the characters in the range 0..=7.
A hexadecimal digit is any of the characters in the ranges 0..=9, a..=f, or A..=F.

Simple escapes

Each sequence of characters occurring in the first column of the following table is an escape sequence.

In each case, the escaped value is the character given in the corresponding entry in the second column.

Escape sequence	Escaped value
\0	U+0000 `NUL`
\t	U+0009 `HT`
\n	U+000A `LF`
\r	U+000D `CR`
\"	U+0022 QUOTATION MARK
\'	U+0027 APOSTROPHE
\\	U+005C REVERSE SOLIDUS

Note: the escaped value therefore has a Unicode scalar value which can be represented in a byte.

8-bit escapes

The escape sequence consists of \x followed by two hexadecimal digits.

The escaped value is the character whose Unicode scalar value is the result of interpreting the final two characters in the escape sequence as a hexadecimal integer, as if by u8::from_str_radix with radix 16.

Note: the escaped value therefore has a Unicode scalar value which can be represented in a byte.

7-bit escapes

The escape sequence consists of \x followed by an octal digit then a hexadecimal digit.

Unicode escapes

The escape sequence consists of \u{, followed by a hexadecimal digit, followed by a sequence of characters each of which is a hexadecimal digit or _, followed by }, with the restriction that there are no more than six hexadecimal digits in the entire escape sequence

The escaped value is the character whose Unicode scalar value is the result of interpreting the hexadecimal digits contained in the escape sequence as a hexadecimal integer, as if by u32::from_str_radix with radix 16.

String continuation escapes

The escape sequence consists of \ followed immediately by LF, and all following whitespace characters before the next non-whitespace character.

For this purpose, the whitespace characters are HT (U+0009), LF (U+000A), CR (U+000D), and SPACE (U+0020).

The escaped value is an empty sequence of characters.

The Reference says this behaviour may change in future; see String continuation escapes.

Reserved
Whitespace
LineComment
BlockComment
Punctuation
Identifier
RawIdentifier
LifetimeOrLabel
RawLifetimeOrLabel
SingleQuoteLiteral
DoubleQuoteLiteral
RawDoubleQuoteLiteral
IntegerDecimalLiteral
IntegerHexadecimalLiteral
IntegerOctalLiteral
IntegerBinaryLiteral
FloatLiteral

The list of of reprocessing cases

The list below has an entry for each kind of pretoken, describing what kind of fine-grained token it produces, how the fine-grained token's attributes are determined, and the circumstances under which a pretoken is rejected.

When an attribute value is given below as "copied", it has the same value as the pretoken's attribute with the same name.

`Reserved`

A Reserved pretoken is always rejected.

`Whitespace`

Fine-grained token kind produced: Whitespace

A Whitespace pretoken is always accepted.

`LineComment`

Fine-grained token kind produced: LineComment

Attributes

style and body are determined from the pretoken's comment content as follows:

if the comment content begins with //:
- style is non-doc
- body is empty
otherwise, if the comment content begins with /,
- style is outer doc
- body is the characters from the comment content after that /
otherwise, if the comment content begins with !,
- style is inner doc
- body is the characters from the comment content after that !
otherwise
- style is non-doc
- body is empty

The pretoken is rejected if (and only if) the resulting body includes a CR character.

Note: the body of a non-doc comment is ignored by the rest of the compilation process

`BlockComment`

Fine-grained token kind produced: BlockComment

Attributes

style and body are determined from the pretoken's comment content as follows:

if the comment content begins with **:
- style is non-doc
- body is empty
otherwise, if the comment content begins with * and contains at least one further character,
- style is outer doc
- body is the characters from the comment content after that *
otherwise, if the comment content begins with !,
- style is inner doc
- body is the characters from the comment content after that !
otherwise
- style is non-doc
- body is empty

The pretoken is rejected if (and only if) the resulting body includes a CR character.

Note: it follows that /**/ and /***/ are not doc-comments

Note: the body of a non-doc comment is ignored by the rest of the compilation process

`Punctuation`

Fine-grained token kind produced: Punctuation

A Punctuation pretoken is always accepted.

Attributes

mark: copied

`Identifier`

Fine-grained token kind produced: Identifier

An Identifier pretoken is always accepted.

Attributes

represented identifier: NFC-normalised form of the pretoken's identifier

`RawIdentifier`

Fine-grained token kind produced: RawIdentifier

Attributes

represented identifier: NFC-normalised form of the pretoken's identifier

The pretoken is rejected if (and only if) the represented identifier is one of the following sequences of characters:

_
crate
self
super
Self

`LifetimeOrLabel`

Fine-grained token kind produced: LifetimeOrLabel

A LifetimeOrLabel pretoken is always accepted.

Attributes

name: copied

Note that the name is not NFC-normalised. See NFC normalisation for lifetime/label.

`RawLifetimeOrLabel`

Fine-grained token kind produced: RawLifetimeOrLabel

The pretoken is rejected if (and only if) the name is one of the following sequences of characters:

_
crate
self
super
Self

Attributes

name: copied

Note that the name is not NFC-normalised. See NFC normalisation for lifetime/label.

`SingleQuoteLiteral`

The pretokeniser guarantees the pretoken's prefix attribute is one of the following:

empty, in which case it is reprocessed as described under Character literal
the single character b, in which case it is reprocessed as described under Byte literal.

In either case, the pretoken is rejected if its suffix consists of the single character _.

Character literal

Fine-grained token kind produced: CharacterLiteral

Attributes

The represented character is derived from the pretoken's literal content as follows:

If the literal content is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:
If the literal content begins with a \ character which did not introduce one of the above forms of escape, the pretoken is rejected.
Otherwise, if the single character that makes up the literal content is LF, CR, or TAB, the pretoken is rejected.
Otherwise the represented character is the single character that makes up the literal content.

suffix: copied

Note: The protokeniser guarantees the pretoken's literal content is either a single character, or a character sequence beginning with \.

Byte literal

Fine-grained token kind produced: ByteLiteral

Attributes

Define a represented character, derived from the pretoken's literal content as follows:

If the literal content is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:
- Simple escapes
- 8-bit escapes
If the literal content begins with a \ character which did not introduce one of the above forms of escape, the pretoken is rejected.
Otherwise, if the single character that makes up the literal content is LF, CR, or TAB, the pretoken is rejected.
Otherwise, if the single character that makes up the literal content has a unicode scalar value greater than 127, the pretoken is rejected.
Otherwise the represented character is the single character that makes up the literal content.

The represented byte is the represented character's Unicode scalar value.

suffix: copied

Note: The protokeniser guarantees the pretoken's literal content is either a single character, or a character sequence beginning with \.

`DoubleQuoteLiteral`

The pretokeniser guarantees the pretoken's prefix attribute is one of the following:

empty, in which case it is reprocessed as described under String literal
the single character b, in which case it is reprocessed as described under Byte-string literal
the single character c, in which case it is reprocessed as described under C-string literal

In each case, the pretoken is rejected if its suffix consists of the single character _.

String literal

Fine-grained token kind produced: StringLiteral

Attributes

The represented string is derived from the pretoken's literal content by replacing each escape sequence of any of the following forms occurring in the literal content with the escape sequence's escaped value.

These replacements take place in left-to-right order. For example, the pretoken with extent "\\x41" is converted to the characters \ x 4 1.

If a \ character appears in the literal content but is not part of one of the above forms of escape, the pretoken is rejected.

If a CR character appears in the literal content and is not part of a string continuation escape, the pretoken is rejected.

suffix: copied

See Wording for string unescaping

Byte-string literal

Fine-grained token kind produced: ByteStringLiteral

If any character whose unicode scalar value is greater than 127 appears in the literal content, the pretoken is rejected.

Attributes

Define a represented string (a sequence of characters) derived from the pretoken's literal content by replacing each escape sequence of any of the following forms occurring in the literal content with the escape sequence's escaped value.

These replacements take place in left-to-right order. For example, the pretoken with extent b"\\x41" is converted to the characters \ x 4 1.

If a \ character appears in the literal content but is not part of one of the above forms of escape, the pretoken is rejected.

If a CR character appears in the literal content and is not part of a string continuation escape, the pretoken is rejected.

The represented bytes are the sequence of Unicode scalar values of the characters in the represented string.

suffix: copied

See Wording for string unescaping

C-string literal

Fine-grained token kind produced: CStringLiteral

Attributes

The pretoken's literal content is treated as a sequence of items, each of which is either a single Unicode character other than \ or an escape.

The sequence of items is converted to the represented bytes as follows:

Each single Unicode character contributes its UTF-8 representation.
Each simple escape contributes a single byte containing the Unicode scalar value of its escaped value.
Each 8-bit escape contributes a single byte containing the Unicode scalar value of its escaped value.
Each unicode escape contributes the UTF-8 representation of its escaped value.
Each string continuation escape contributes no bytes.

If a \ character appears in the literal content but is not part of one of the above forms of escape, the pretoken is rejected.

If a CR character appears in the literal content and is not part of a string continuation escape, the pretoken is rejected.

If any of the resulting represented bytes have value 0, the pretoken is rejected.

suffix: copied

See Wording for string unescaping

`RawDoubleQuoteLiteral`

The pretokeniser guarantees the pretoken's prefix attribute is one of the following:

the single character r, in which case it is reprocessed as described under Raw string literal
the characters br, in which case it is reprocessed as described under Raw byte-string literal
the characters cr, in which case it is reprocessed as described under Raw C-string literal

In each case, the pretoken is rejected if its suffix consists of the single character _.

Raw string literal

Fine-grained token kind produced: RawStringLiteral

The pretoken is rejected if (and only if) a CR character appears in the literal content.

Attributes

represented string: the pretoken's literal content

suffix: copied

Raw byte-string literal

Fine-grained token kind produced: RawByteStringLiteral

If any character whose unicode scalar value is greater than 127 appears in the literal content, the pretoken is rejected.

If a CR character appears in the literal content, the pretoken is rejected.

Attributes

represented bytes: the sequence of Unicode scalar values of the characters in the pretoken's literal content

suffix: copied

Raw C-string literal

Fine-grained token kind produced: RawCStringLiteral

If a CR character appears in the literal content, the pretoken is rejected.

Attributes

represented bytes: the UTF-8 encoding of the pretoken's literal content

suffix: copied

If any of the resulting represented bytes have value 0, the pretoken is rejected.

`IntegerDecimalLiteral`

Fine-grained token kind produced: IntegerLiteral

The pretoken is rejected if (and only if) its digits attribute consists entirely of _ characters.

Attributes

base: decimal

digits: copied

suffix: copied

Note: in particular, an IntegerDecimalLiteral whose digits is empty is rejected.

`IntegerHexadecimalLiteral`

Fine-grained token kind produced: IntegerLiteral

The pretoken is rejected if (and only if) its digits attribute consists entirely of _ characters.

Attributes

base: hexadecimal

digits: copied

suffix: copied

Note: in particular, an IntegerHexadecimalLiteral whose digits is empty is rejected.

`IntegerOctalLiteral`

Fine-grained token kind produced: IntegerLiteral

The pretoken is rejected if (and only if):

its digits attribute consists entirely of _ characters; or
its digits attribute contains any character other than 0, 1, 2, 3, 4, 5, 6, 7, or _.

Attributes

base: octal

digits: copied

suffix: copied

Note: in particular, an IntegerOctalLiteral whose digits is empty is rejected.

`IntegerBinaryLiteral`

Fine-grained token kind produced: IntegerLiteral

The pretoken is rejected if (and only if):

its digits attribute consists entirely of _ characters; or
its digits attribute contains any character other than 0, 1, or _.

Attributes

base: binary

digits: copied

suffix: copied

Note: in particular, an IntegerBinaryLiteral whose digits is empty is rejected.

`FloatLiteral`

Fine-grained token kind produced: FloatLiteral

The pretoken is rejected if (and only if)

its has base attribute is true; or
its exponent digits attribute is a character sequence which consists entirely of _ characters.

Attributes

body: copied

suffix: copied

Note: in particular, a FloatLiteral whose exponent digits is empty is rejected.

Rationale for this model

Pretokenising

The main difference between the model described in this document and the way the Reference (as of Rust 1.85) describes lexing is the split into pretokenisation and reprocessing.

There are a number of forms which are errors at lexing time, even though in principle they could be analysed as multiple tokens.

Examples include

the rfc3101 "reserved prefixes" (in Rust 2021 and newer): k#abc, f"...", or f'...'.
the variants of numeric literals reserved in rfc0879, eg 0x1.2 or 0b123
adjacent-lifetime-like forms such as 'ab'c
stringlike literals with a single _ as a suffix
byte or C strings with unacceptable contents that would be accepted in plain strings, eg b"€", b"\u{00a0}", or c"\x00"

The Reference tries to account for some of these cases by adding rules which match the forms that cause errors, while keeping the forms matched by those rules disjoint from the forms matched by the non-error-causing rules.

The resulting rules for reserved prefixes and numeric literals are quite complicated (and still have mistakes). Rules of this sort haven't been attempted for stringlike literals.

The rules are simpler in a model with a 'pretokenising' step which can match a form such as c"\x00" (preventing it being matched as c followed by "\x00"), leaving it to a later stage to decide whether it's a valid token or a lexing-time error.

This separation also gives us a natural way to lex doc and non-doc comments uniformly, and inspect their contents later to make the distinction, rather than trying to write non-overlapping lexer rules as the Reference does.

Lookahead

The model described in this document uses one-character lookahead (beyond the token which will be matched) in the prelexing step, in two cases:

the lifetime-or-label rule, to prevent 'ab'c' being analysed as 'ab followed by 'c (and similarly for the raw-lifetime-or-label rule)
the rule for float literals ending in ., to make sure that 1.a is analysed as 1 . a rather than 1. a

I think some kind of lookahead is unavoidable in these cases.

I think the lookahead could be done after prelexing instead, by adding a pass that could reject pretokens or join them together, but I think that would be less clear. (In particular, the float rule would end up using a list of pretoken kinds that start with an identifier character, which seems worse than just looking for such a character.)

Constraints and imperative code

There are two kinds of token which are hard to deal with using a "regular" lexer: raw-string literals (where the number of # characters need to match), and block comments (where the /* and */ delimiters need to be balanced).

Raw-string literals can in principle fit into a regular language because there's a limit of 255 # symbols, but it seems hard to do anything useful with that.

Nested comments can in principle be described using non-regular rules (as the Reference does).

The model described in this document deals with these cases by allowing rules to define constraints beyond the simple pattern match, effectively intervening in the "find the longest match" part of pattern matching.

The constraint for raw strings is simple, but the one for block comments has ended up using imperative code, which doesn't seem very satisfactory. See Defining the block-comment constraint.

Producing tokens with attributes

This model makes the lexing process responsible for a certain amount of 'interpretation' of the tokens, rather than simply describing how the input source is sliced up and assigning a 'kind' to each resulting token.

The main motivation for this is to deal with stringlike literals: it means we don't need to separate the description of the result of "unescaping" strings from the description of which strings contain well-formed escapes.

In particular, describing unescaping at lexing time makes it easy to describe the rule about rejecting NULs in C-strings, even if they were written using an escape.

For numeric literals, the way the suffix is identified isn't always simple (see Integer literal base-vs-suffix ambiguity); I think it's best to make the lexer responsible for doing it, so that the description of numeric literal expressions doesn't have to.

For identifiers, many parts of the spec will need a notion of equivalence (both for handling raw identifiers and for dealing with NFC normalisation), and some restrictions depend on the normalised form (see ASCII identifiers). I think it's best for the lexer to handle this by defining the represented identifier.

This document treats the lexer's "output" as a stream of tokens which have concrete attributes, but of course it would be equivalent (and I think more usual for a spec) to treat each attribute as an independent defined term, and write things like "the represented character of a character literal token is…".

Open questions

Terminology
Pattern notation
Rule priority
Token kinds and attributes
Defining the block-comment constraint
Wording for string unescaping
How to model shebang removal
String continuation escapes

Terminology

Some of the terms used in this document are taken from pre-existing documentation or rustc's error output, but many of them are new (and so can freely be changed).

Here's a partial list:

Term	Source
pretoken	New
reprocessing	New
fine-grained token	New
compound token	New
literal content	Reference (recent)
simple escape	Reference (recent)
escape sequence	Reference
escaped value	Reference (recent)
string continuation escape	Reference (as `STRING_CONTINUE`)
string representation	Reference (recent)
represented byte	New
represented character	Reference (recent)
represented bytes	Reference (recent)
represented string	Reference (recent)
represented identifier	New
style (of a comment)	rustc internal
body (of a comment)	Reference

Terms listed as "Reference (recent)" are ones I introduced in PRs merged in January 2024, so it's not very likely that they've been picked up more widely.

Pattern notation

This document is relying on the regex crate for its pattern notation.

This is convenient for checking that the writeup is the same as the comparable implementation, but it's presumably not suitable for the spec.

The usual thing for specs seems to be to define their own notation from scratch.

Requirements for patterns

I've tried to keep the patterns used here as simple as possible.

There's no use of non-greedy matches.

I think all the uses of alternation are obviously unambiguous.

In particular, all uses of alternation inside repetition have disjoint sets of accepted first characters.

I believe all uses of repetition in the unconstrained patterns have unambiguous termination. That is, anything permitted to follow the repeatable section would not be permitted to start a new repetition. In these cases, the distinction between greedy and non-greedy matches doesn't matter.

Naming sub-patterns

The patterns used in this document are inconveniently repetitive, particularly for the edition-specific rule variants and for numeric literals.

Of course the usual thing is to have a way to define reusable named sub-patterns. So I think addressing this should be treated as part of choosing a pattern notation.

Rule priority

At present this document gives the pretokenisation rules explicit priorities, used to determine which rule is chosen in positions where more than one rule matches.

I believe that in almost all cases it would be equivalent to say that the rule which matches the longest extent is chosen (in particular, if multiple rules match then one has a longer extent than any of the others).

See Integer literal base-vs-suffix ambiguity and Exponent-vs-suffix ambiguity below for the exceptions.

This document uses the order in which the rules are presented as the priority, which has the downside of forcing an unnatural presentation order (for example, Raw identifier and Non-raw identifier are separated).

Perhaps it would be better to say that longest-extent is the primary way to disambiguate, and add a secondary principle to cover the exceptional cases.

The comparable implementation reports (as "model error") any cases (other than the exceptions described below) where the priority principle doesn't agree with the longest-extent principle, or where there wasn't a unique longest match.

Integer literal base-vs-suffix ambiguity

The Reference's lexer rules for input such as 0x3 allow two interpretations, matching the same extent:

as a hexadecimal integer literal: 0x3 with no suffix
as a decimal integer literal: 0 with a suffix of x3

If the output of the lexer is simply a token with a kind and an extent, this isn't a problem: both cases have the same kind.

But if we want to make the lexer responsible for identifying which part of the input is the suffix, we need to make sure it gets the right answer (ie, the one with no suffix).

Further, there are cases where we need to reject input which matches the rule for a decimal integer literal 0 with a suffix, for example 0b1e2, 0b0123 (see rfc0879), or 0x·.

(Note that · has the XID_Continue property but not XID_Start.)

In these cases we can't avoid dealing with the base-vs-suffix ambiguity in the lexer.

This model uses a separate rule for integer decimal literals, with lower priority than all other numeric literals, to make sure we get these results.

Note that in the 0x· example the extent matched by the lower priority rule is longer than the extent matched by the chosen rule.

If relying on priorities like this seems undesirable, I think it would be possible to rework the rules to avoid it. It might work to allow the difficult cases to pretokenise as decimal integer literals, and have reprocessing reject decimal literal pretokens which begin with a base indicator.

Exponent-vs-suffix ambiguity

Now that numeric literal suffixes can begin with e or E, many cases of float literals with an exponent could also be interpreted as integer or float literals with a suffix, for example 123e4 or 123.4e5.

This model gives the rules for float literals with an exponent higher priority than any other rules for numeric literals, to make sure we get the desired result.

Note that there are again examples where the extent matched by the lower priority rule is longer than the extent matched by the chosen rule. For example 1e2· could be interpreted as an integer decimal literal with suffix e2·, but instead we find the float literal 1e2 and then reject the remainder of the input.

Token kinds and attributes

What kinds and attributes should fine-grained tokens have?

Distinguishing raw and non-raw forms

The current table distinguishes raw from non-raw forms as different top-level "kinds".

I think this distinction will be needed in some cases, but perhaps it would be better represented using an attributes on unified kinds (like rustc_ast::StrStyle and rustc_ast::token::IdentIsRaw).

As an example of where it might be wanted: proc-macros Display for raw identifers includes the r# prefix for raw identifiers, but I think simply using the source extent isn't correct because the Display output is NFC-normalised.

Hash count

Should there be an attribute recording the number of hashes in a raw string or byte-string literal? Rustc has something of the sort.

ASCII identifiers

Should there be an attribute indicating whether an identifier is all ASCII? The Reference lists several places where identifiers have this restriction, and it seems natural for the lexer to be responsible for making this check.

The list in the Reference is:

extern crate declarations
External crate names referenced in a path
Module names loaded from the filesystem without a path attribute
no_mangle attributed items
Item names in external blocks

I believe this restriction is applied after NFC-normalisation, so it's best thought of as a restriction on the represented identifier.

Represented bytes for C strings

At present this document says that the sequence of "represented bytes" for C string literals doesn't include the added NUL.

That's following the way the Reference currently uses the term "represented bytes", but rustc includes the NUL in its equivalent piece of data.

Defining the block-comment constraint

This document currently uses imperative Rust code to define the Block comment constraint (ie, to say that /* and */ must be properly nested inside a candidate comment).

It would be nice to do better; the options might depend on what pattern notation is chosen.

I don't think there's any very elegant way to describe the constraint in English (note that the constraint is asymmetrical; for example /* /*/ /*/ */ is rejected).

Perhaps the natural continuation of this writeup's approach would be to define a mini-tokeniser to use inside the constraint, but that would be a lot of words for a small part of the spec.

Or perhaps this part could borrow some definitions from whatever formalisation the spec ends up using for Rust's grammar, and use the traditional sort of context-free-grammar approach.

Wording for string unescaping

The description of reprocessing for String literals and C-string literals was originally drafted for the Reference. Should there be a more formal definition of unescaping processes than the current "left-to-right order" and "contributes" wording?

I believe that any literal content which will be accepted can be written uniquely as a sequence of (escape-sequence or non-\-character), but I'm not sure that's obvious enough that it can be stated without justification.

This is a place where the comparable implementation isn't closely parallel to the writeup.

How to model shebang removal

This part of the Reference text isn't trying to be rigorous:

As an exception, if the #! characters are followed (ignoring intervening comments or whitespace) by a [ token, nothing is removed. This prevents an inner attribute at the start of a source file being removed.

rustc implements the "ignoring intervening comments or whitespace" part by running its lexer for long enough to see whether the [ is there or not, then discarding the result (see #70528 and #71487 for history).

So should the spec define this in terms of its model of the lexer?

String continuation escapes

rustc has a warning that the behaviour of String continuation escapes (when multiple newlines are skipped) may change in future.

The Reference has a note about this, and points to #1042 for more information. Should the spec say anything?

Rustc oddities

NFC normalisation for lifetime/label

Identifiers are normalised to NFC, which means that Kelvin and Kelvin are treated as representing the same identifier. See rfc2457.

But this doesn't happen for lifetimes or labels, so 'Kelvin and 'Kelvin are different as lifetimes or labels.

For example, this compiles without warning in Rust 1.83, while this doesn't.

In this writeup, the represented identifier attribute of Identifier and RawIdentifier fine-grained tokens is in NFC, and the name attribute of LifetimeOrLabel and RawLifetimeOrLabel tokens isn't.

I think this behaviour is a promising candidate for provoking the "Wait...that's what we currently do? We should fix that." reaction to being given a spec to review.

Filed as rustc #126759.

Restriction on e-suffixes

With the implementation of pr131656 as of 2025-03-02, support for numeric literal suffixes beginning with e or E is incomplete, and rejects some (very obscure) cases.

A numeric literal token is rejected if:

it doesn't have an exponent; and
it has a suffix of the following form:
- begins with e or E
- immediately followed by one or more _ characters
- immediately followed by a character which has the XID_Continue property but not XID_Start.

For example, 123e_· is rejected.

Writeup of Rust's lexer