Introduction
This document contains a description of rustc
's lexer,
which is aiming to be both correct and verifiable.
It's accompanied by a reimplementation of the lexer in Rust based on that description
(called the "comparable implementation" below),
and a framework for comparing its output to rustc
's.
Scope
Rust language version
This document describes Rust version 1.85.
That means it describes raw lifetimes/labels and the additional reservations in the 2024 edition, but not
- rfc3349 (Mixed UTF-8 literals)
Other statements in this document are intended to be true as of February 2025.
The comparable implementation is intended to be compiled against (and compared against)
rustc 1.87.0-nightly (f8a913b13 2025-02-23)
Editions
This document describes the editions supported by Rust 1.85:
- 2015
- 2018
- 2021
- 2024
There are no differences in lexing behaviour between the 2015 and 2018 editions.
In the comparable implementation, "2015" is used to refer to the common behaviour of Rust 2015 and Rust 2018.
Accepted input
This description aims to accept input exactly if rustc
's lexer would.
Specifically, it aims to model what's accepted as input to a function-like macro
(a procedural macro or a by-example macro using the tt
fragment specifier).
It's not attempting to accurately model rustc
's "reasons" for rejecting input,
or to provide enough information to reproduce error messages similar to rustc
's.
It's not attempting to describe rustc
's "recovery" behaviour
(where input which will be reported as an error provides tokens to later stages of the compiler anyway).
Size limits
This description doesn't attempt to characterise rustc
's limits on the size of the input as a whole.
As far as I know,
rustc
has no limits on the size of individual tokens beyond its limits on the input as a whole. But I haven't tried to test this.
Output form
This document only goes as far as describing how to produce a "least common denominator" stream of tokens.
Further writing will be needed to describe how to convert that stream to forms that fit the (differing) needs of the grammar and the macro systems.
In particular, this representation may be unsuitable for direct use by a description of the grammar because:
- there's no distinction between identifiers and keywords;
- there's a single "kind" of token for all punctuation;
- sequences of punctuation such as
::
aren't glued together to make a single token.
(The comparable implementation includes code to make compound punctuation tokens so they can be compared with rustc
's, but that process isn't described here.)
Licence
This document and the accompanying lexer implementation are released under the terms of both the MIT license and the Apache License (Version 2.0).
Authorship and source access
© Matthew Woodcraft 2024,2025
The source code for this document and the accompanying lexer implementation is available at https://github.com/mattheww/lexeywan
Overview
The following processes might be considered to be part of Rust's lexer:
- Decode: interpret UTF-8 input as a sequence of Unicode characters
- Clean:
- Byte order mark removal
- CRLF normalisation
- Shebang removal
- Tokenise: interpret the characters as ("fine-grained") tokens
- Further processing: to fit the needs of later parts of the spec
- For example, convert fine-grained tokens to compound tokens
- possibly different for the grammar and the two macro implementations
This document attempts to completely describe the "Tokenise" process.
Definitions
Byte
For the purposes of this document, byte means the same thing as Rust's u8
(corresponding to a natural number in the range 0 to 255 inclusive).
Character
For the purposes of this document, character means the same thing as Rust's char
.
That means, in particular:
- there's exactly one character for each Unicode scalar value
- the things that Unicode calls "noncharacters" are characters
- there are no characters corresponding to surrogate code points
Sequence
When this document refers to a sequence of items, it means a finite, but possibly empty, ordered list of those items.
"character sequence" and "sequence of characters" are different ways of saying the same thing.
Prefix of a sequence
When this document talks about a prefix of a sequence, it means "prefix" in the way that abc
is a prefix of abcde
.
The prefix may be empty, or the entire sequence.
NFC normalisation
References to NFC-normalised strings are talking about Unicode's Normalization Form C, defined in Unicode Standard Annex #15.
Processing that happens before tokenising
This document's description of tokenising takes a sequence of characters as input.
rustc
obtains that sequence of characters as follows:
This description is taken from the Input format chapter of the Reference.
Source encoding
Each source file is interpreted as a sequence of Unicode characters encoded in UTF-8. It is an error if the file is not valid UTF-8.
Byte order mark removal
If the first character in the sequence is U+FEFF
(BYTE ORDER MARK), it is removed.
CRLF normalization
Each pair of characters U+000D
CR immediately followed by U+000A
LF is replaced by a single U+000A
LF.
Other occurrences of the character U+000D
CR are left in place (they are treated as whitespace).
Note: this document's description of tokenisation doesn't assume that the sequence CRLF never appears in its input; that makes it more general than necessary, but should do no harm.
In particular, in places where the Reference says that tokens may not contain "lone CR", this description just says that any CR is rejected.
Shebang removal
If the remaining sequence begins with the characters #!, the characters up to and including the first U+000A
LF are removed from the sequence.
For example, the first line of the following file would be ignored:
#!/usr/bin/env rustx
fn main() {
println!("Hello!");
}
As an exception, if the #! characters are followed (ignoring intervening comments or whitespace) by a [
punctuation token, nothing is removed.
This prevents an inner attribute at the start of a source file being removed.
See open question: How to model shebang removal
Tokenising
This phase of processing takes a character sequence (the input), and either:
- produces a sequence of fine-grained tokens; or
- reports that lexical analysis failed
The analysis depends on the Rust edition which is in effect when the input is processed.
So strictly speaking, the edition is a second parameter to the process described here.
Tokenisation is described using two operations:
- Pretokenising extracts pretokens from the character sequence.
- Reprocessing converts pretokens to fine-grained tokens.
Either operation can cause lexical analysis to fail.
Note: If lexical analysis succeeds, concatenating the extents of the produced tokens produces an exact copy of the input.
Process
The process is to repeat the following steps until the input is empty:
- extract a pretoken from the start of the input
- reprocess that pretoken
If no step determines that lexical analysis should fail, the output is the sequence of fine-grained tokens produced by the repetitions of the second step.
Note: Each fine-grained token corresponds to one pretoken, representing exactly the same characters from the input; reprocessing doesn't involve any combination or splitting.
Note: it doesn't make any difference whether we treat this as one pass with interleaved pretoken-extraction and reprocessing, or as two passes. The comparable implementation uses a single interleaved pass, which means when it reports an error it describes the earliest part of the input which caused trouble.
Pretokenising
Pretokenisation works from an ordered list of rules.
See pretokenisation rules for the list (which depends on the Rust edition which is in effect).
To extract a pretoken from the input:
-
Apply each rule to the input.
-
If no rules succeed, lexical analysis fails.
-
Otherwise, the extracted pretoken's extent, kind, and attributes are determined (as described for each rule below) by the successful rule which appears earliest in the list.
-
Remove the extracted pretoken's extent from the start of the input.
Note: If lexical analysis succeeds, concatenating the extents of the pretokens extracted during the analysis produces an exact copy of the input.
See open question Rule priority
Rules
Each rule has a pattern (see patterns) and a set of forbidden followers (a set of characters).
A rule may also have a constraint (see constrained pattern matches).
The result of applying a rule to a character sequence is either:
- the rule fails; or
- the rule succeeds, and reports
- an extent, which is a prefix of the character sequence
- a pretoken kind
- values for the attributes appropriate to that kind of pretoken
Note: a given rule always reports the same pretoken kind, but some pretoken kinds are reported by multiple rules.
Applying rules
To apply a rule to a character sequence:
- Attempt to match the rule's pattern against each prefix of the sequence.
- If no prefix matches the pattern, the rule fails.
- Otherwise the extent is the longest prefix which matches the pattern.
- But if the rule has a constraint, see constrained pattern matches instead for how the extent is determined.
- If the extent is not the entire character sequence, and the character in the sequence which immediately follows the extent is in the rule's set of forbidden followers, the rule fails.
- Otherwise the rule succeeds.
The description of each rule below says how the pretoken kind and attribute values are determined when the rule succeeds.
Constrained pattern matches
Each rule which has a constraint defines what is required for a sequence of characters to satisfy its constraint.
Or more formally: a constraint is a predicate function defined on sequences of characters.
When a rule which has a constraint is applied to a character sequence, the resulting extent is the shortest maximal match, defined as follows:
-
The candidates are the prefixes of the character sequence which match the rule's pattern and also satisfy the constraint.
-
The successor of a prefix of the character sequence is the prefix which is one character longer (the prefix which is the entire sequence has no successor).
-
The shortest maximal match is the shortest candidate whose successor is not a candidate (or which has no successor)
Note: constraints are used only for block comments and for raw string literals with hashes.
For the block comments rule it would be equivalent to say that the shortest match becomes the extent.
For raw string literals, the "shortest maximal match" behaviour is a way to get the mix of non-greedy and greedy matching we need: the rule as a whole has to be non-greedy so that it doesn't jump to the end of a later literal, but the suffix needs to be matched greedily.
Pretokens
Each pretoken has an extent, which is a sequence of characters taken from the input.
Each pretoken has a kind, and possibly also some attributes, as described in the tables below.
Kind | Attributes |
---|---|
Reserved | |
Whitespace | |
LineComment | comment content |
BlockComment | comment content |
Punctuation | mark |
Identifier | identifier |
RawIdentifier | identifier |
LifetimeOrLabel | name |
RawLifetimeOrLabel | name |
SingleQuoteLiteral | prefix, literal content, suffix |
DoubleQuoteLiteral | prefix, literal content, suffix |
RawDoubleQuoteLiteral | prefix, literal content, suffix |
IntegerDecimalLiteral | digits, suffix |
IntegerHexadecimalLiteral | digits, suffix |
IntegerOctalLiteral | digits, suffix |
IntegerBinaryLiteral | digits, suffix |
FloatLiteral | has base, body, exponent digits, suffix |
These attributes have the following types:
Attribute | Type |
---|---|
body | sequence of characters |
digits | sequence of characters |
exponent digits | either a sequence of characters, or none |
has base | true or false |
identifier | sequence of characters |
literal content | sequence of characters |
comment content | sequence of characters |
mark | single character |
name | sequence of characters |
prefix | sequence of characters |
suffix | sequence of characters |
Patterns
A pattern has two functions:
- To answer the question "does this sequence of characters match the pattern?"
- When the answer is yes, to capture zero or more named groups of characters.
The patterns in this document use the notation from the well-known Rust regex
crate.
Specifically, the notation is to be interpreted in verbose mode (ignore_whitespace
)
and with .
allowed to match newlines (dot_matches_new_line
).
See open question Pattern notation.
Patterns are always used to match against a fixed-length sequence of characters (as if the pattern was anchored at both ends).
Other than for constrained pattern matches, the comparable implementation anchors to the start but not the end, relying on
Regex::find()
to find the longest matching prefix.
Named capture groups (eg (?<suffix> … )
are used in the patterns to supply character sequences used to determine attribute values.
Sets of characters
In particular, the following notations are used to specify sets of Unicode characters:
\p{Pattern_White_Space}
refers to the set of characters which have the Pattern_White_Space
Unicode property, which are:
U+0009 | (horizontal tab, '\t') |
U+000A | (line feed, '\n') |
U+000B | (vertical tab) |
U+000C | (form feed) |
U+000D | (carriage return, '\r') |
U+0020 | (space, ' ') |
U+0085 | (next line) |
U+200E | (left-to-right mark) |
U+200F | (right-to-left mark) |
U+2028 | (line separator) |
U+2029 | (paragraph separator) |
Note: This set doesn't change in updated Unicode versions.
\p{XID_Start}
refers to the set of characters which have the XID_Start
Unicode property
(as of Unicode 16.0.0).
\p{XID_Continue}
refers to the set of characters which have the XID_Continue
Unicode property
(as of Unicode 16.0.0).
The Reference adds the following when discussing identifiers: "Zero width non-joiner (ZWNJ U+200C) and zero width joiner (ZWJ U+200D) characters are not allowed in identifiers." Those characters don't have
XID_Start
orXID_Continue
, so that's only informative text, not an additional constraint.
Table of contents
Whitespace
Line comment
Block comment
Unterminated block comment
Reserved hash forms (Rust 2024)
Punctuation
Single-quoted literal
Raw lifetime or label (Rust 2021 and 2024)
Reserved lifetime or label prefix (Rust 2021 and 2024)
Non-raw lifetime or label
Double-quoted non-raw literal (Rust 2015 and 2018)
Double-quoted non-raw literal (Rust 2021 and 2024)
Double-quoted hashless raw literal (Rust 2015 and 2018)
Double-quoted hashless raw literal (Rust 2021 and 2024)
Double-quoted hashed raw literal (Rust 2015 and 2018)
Double-quoted hashed raw literal (Rust 2021 and 2024)
Float literal with exponent
Float literal without exponent
Float literal with final dot
Integer binary literal
Integer octal literal
Integer hexadecimal literal
Integer decimal literal
Raw identifier
Unterminated literal (Rust 2015 and 2018)
Reserved prefix or unterminated literal (Rust 2021 and 2024)
Non-raw identifier
The list of pretokenisation rules
The list of pretokenisation rules is given below.
Rules whose names indicate one or more editions are included in the list only when one of those editions is in effect.
Unless otherwise stated, a rule has no constraint and has an empty set of forbidden followers.
When an attribute value is given below as "captured characters", the value of that attribute is the sequence of characters captured by the capture group in the pattern whose name is the same as the attribute's name.
Whitespace
Pattern
[ \p{Pattern_White_Space} ] +
Pretoken kind
Whitespace
Attributes
(none)
Line comment
Pattern
/ /
(?<comment_content>
[^ \n] *
)
Pretoken kind
LineComment
Attributes
comment content | captured characters |
Block comment
Pattern
/ \*
(?<comment_content>
. *
)
\* /
Constraint
The constraint is satisfied if (and only if) the following block of Rust code evaluates to true
,
when character_sequence
represents an iterator over the sequence of characters being tested against the constraint.
#![allow(unused)] fn main() { { let mut depth = 0_isize; let mut after_slash = false; let mut after_star = false; for c in character_sequence { match c { '*' if after_slash => { depth += 1; after_slash = false; } '/' if after_star => { depth -= 1; after_star = false; } _ => { after_slash = c == '/'; after_star = c == '*'; } } } depth == 0 } }
Pretoken kind
BlockComment
Attributes
comment content | captured characters |
Unterminated block comment
Pattern
/ \*
Pretoken kind
Reserved
Attributes
(none)
Reserved hash forms (Rust 2024)
Pattern
\#
( \# | " )
Pretoken kind
Reserved
Attributes
(none)
Punctuation
Pattern
[
; , \. \( \) \{ \} \[ \] @ \# ~ \? : \$ = ! < > \- & \| \+ \* / ^ %
]
Pretoken kind
Punctuation
Attributes
mark | the single character matched by the pattern |
Note: When this pattern matches, the matched character sequence is necessarily one character long.
Single-quoted literal
Pattern
(?<prefix>
b ?
)
'
(?<literal_content>
[^ \\ ' ]
|
\\ . [^']*
)
'
(?<suffix>
(?:
[ \p{XID_Start} _ ]
\p{XID_Continue} *
) ?
)
Pretoken kind
SingleQuoteLiteral
Attributes
prefix | captured characters |
literal content | captured characters |
suffix | captured characters |
Raw lifetime or label (Rust 2021 and 2024)
Pattern
' r \#
(?<name>
[ \p{XID_Start} _ ]
\p{XID_Continue} *
)
Forbidden followers:
- The character '
Pretoken kind
RawLifetimeOrLabel
Attributes
name | captured characters |
Reserved lifetime or label prefix (Rust 2021 and 2024)
Pattern
'
[ \p{XID_Start} _ ]
\p{XID_Continue} *
\#
Pretoken kind
Reserved
Attributes
(none)
Non-raw lifetime or label
Pattern
'
(?<name>
[ \p{XID_Start} _ ]
\p{XID_Continue} *
)
Forbidden followers:
- The character '
Pretoken kind
LifetimeOrLabel
Attributes
name | captured characters |
Note: the forbidden follower here makes sure that forms like
'aaa'bbb
are not accepted.
Double-quoted non-raw literal (Rust 2015 and 2018)
Pattern
(?<prefix>
b ?
)
"
(?<literal_content>
(?:
[^ \\ " ]
|
\\ .
) *
)
"
(?<suffix>
(?:
[ \p{XID_Start} _ ]
\p{XID_Continue} *
) ?
)
Pretoken kind
DoubleQuoteLiteral
Attributes
prefix | captured characters |
literal content | captured characters |
suffix | captured characters |
Double-quoted non-raw literal (Rust 2021 and 2024)
Pattern
(?<prefix>
[bc] ?
)
"
(?<literal_content>
(?:
[^ \\ " ]
|
\\ .
) *
)
"
(?<suffix>
(?:
[ \p{XID_Start} _ ]
\p{XID_Continue} *
) ?
)
Pretoken kind
DoubleQuoteLiteral
Attributes
prefix | captured characters |
literal content | captured characters |
suffix | captured characters |
Note: the difference between the 2015/2018 and 2021/2024 patterns is that the 2021/2024 pattern allows
c
as a prefix.
Double-quoted hashless raw literal (Rust 2015 and 2018)
Pattern
(?<prefix>
r | br
)
"
(?<literal_content>
[^"] *
)
"
(?<suffix>
(?:
[ \p{XID_Start} _ ]
\p{XID_Continue} *
) ?
)
Pretoken kind
RawDoubleQuoteLiteral
Attributes
prefix | captured characters |
literal content | captured characters |
suffix | captured characters |
Double-quoted hashless raw literal (Rust 2021 and 2024)
Pattern
(?<prefix>
r | br | cr
)
"
(?<literal_content>
[^"] *
)
"
(?<suffix>
(?:
[ \p{XID_Start} _ ]
\p{XID_Continue} *
) ?
)
Pretoken kind
RawDoubleQuoteLiteral
Attributes
prefix | captured characters |
literal content | captured characters |
suffix | captured characters |
Note: the difference between the 2015/2018 and and 2021/2024 patterns is that the 2021/2024 pattern allows
cr
as a prefix.
Note: we can't treat the hashless rule as a special case of the hashed one because the "shortest maximal match" rule doesn't work without hashes (consider
r"x""
).
Double-quoted hashed raw literal (Rust 2015 and 2018)
Pattern
(?<prefix>
r | br
)
(?<hashes_1>
\# {1,255}
)
"
(?<literal_content>
. *
)
"
(?<hashes_2>
\# {1,255}
)
(?<suffix>
(?:
[ \p{XID_Start} _ ]
\p{XID_Continue} *
) ?
)
Constraint
The constraint is satisfied if (and only if) the character sequence captured by the hashes_1
capture group is equal to the character sequence captured by the hashes_2
capture group.
Pretoken kind
RawDoubleQuoteLiteral
Attributes
prefix | captured characters |
literal content | captured characters |
suffix | captured characters |
Double-quoted hashed raw literal (Rust 2021 and 2024)
Pattern
(?<prefix>
r | br | cr
)
(?<hashes_1>
\# {1,255}
)
"
(?<literal_content>
. *
)
"
(?<hashes_2>
\# {1,255}
)
(?<suffix>
(?:
[ \p{XID_Start} _ ]
\p{XID_Continue} *
) ?
)
Constraint
The constraint is satisfied if (and only if) the character sequence captured by the hashes_1
capture group is equal to the character sequence captured by the hashes_2
capture group.
Pretoken kind
RawDoubleQuoteLiteral
Attributes
prefix | captured characters |
literal content | captured characters |
suffix | captured characters |
Note: the difference between the 2015/2018 and 2021/2024 patterns is that the 2021/2024 pattern allows
cr
as a prefix.
Float literal with exponent
Pattern
(?<body>
(?:
(?<based>
(?: 0b | 0o )
[ 0-9 _ ] *
)
|
[ 0-9 ]
[ 0-9 _ ] *
)
(?:
\.
[ 0-9 ]
[ 0-9 _ ] *
) ?
[eE]
[+-] ?
(?<exponent_digits>
[ 0-9 _ ] *
)
)
(?<suffix>
(?:
[ \p{XID_Start} ]
\p{XID_Continue} *
) ?
)
Pretoken kind
FloatLiteral
Attributes
has base | true if the based capture group participates in the match,false otherwise |
body | captured characters |
exponent digits | captured characters |
suffix | captured characters |
Float literal without exponent
Pattern
(?<body>
(?:
(?<based>
(?: 0b | 0o )
[ 0-9 _ ] *
|
0x
[ 0-9 a-f A-F _ ] *
)
|
[ 0-9 ]
[ 0-9 _ ] *
)
\.
[ 0-9 ]
[ 0-9 _ ] *
)
(?<suffix>
(?:
[ \p{XID_Start} -- eE]
\p{XID_Continue} *
) ?
)
Pretoken kind
FloatLiteral
Attributes
has base | true if the based capture group participates in the match,false otherwise |
body | captured characters |
exponent digits | none |
suffix | captured characters |
Float literal with final dot
Pattern
(?:
(?<based>
(?: 0b | 0o )
[ 0-9 _ ] *
|
0x
[ 0-9 a-f A-F _ ] *
)
|
[ 0-9 ]
[ 0-9 _ ] *
)
\.
Forbidden followers:
- The character _
- The character .
- The characters with the Unicode property
XID_start
Pretoken kind
FloatLiteral
Attributes
has base | true if the based capture group participates in the match,false otherwise |
body | the entire character sequence matched by the pattern |
exponent digits | none |
suffix | empty character sequence |
Integer binary literal
Pattern
0b
(?<digits>
[ 0-9 _ ] *
)
(?<suffix>
(?:
[ \p{XID_Start} -- eE]
\p{XID_Continue} *
) ?
)
Pretoken kind
IntegerBinaryLiteral
Attributes
digits | captured characters |
suffix | captured characters |
Integer octal literal
Pattern
0o
(?<digits>
[ 0-9 _ ] *
)
(?<suffix>
(?:
[ \p{XID_Start} -- eE]
\p{XID_Continue} *
) ?
)
Pretoken kind
IntegerOctalLiteral
Attributes
digits | captured characters |
suffix | captured characters |
Integer hexadecimal literal
Pattern
0x
(?<digits>
[ 0-9 a-f A-F _ ] *
)
(?<suffix>
(?:
[ \p{XID_Start} -- aAbBcCdDeEfF]
\p{XID_Continue} *
) ?
)
Pretoken kind
IntegerHexadecimalLiteral
Attributes
digits | captured characters |
suffix | captured characters |
Integer decimal literal
Pattern
(?<digits>
[ 0-9 ]
[ 0-9 _ ] *
)
(?<suffix>
(?:
[ \p{XID_Start} -- eE]
\p{XID_Continue} *
) ?
)
digits | captured characters |
suffix | captured characters |
Pretoken kind
IntegerDecimalLiteral
Attributes
Note: it is important that this rule has lower priority than the other numeric literal rules. See Integer literal base-vs-suffix ambiguity.
Raw identifier
Pattern
r \#
(?<identifier>
[ \p{XID_Start} _ ]
\p{XID_Continue} *
)
Pretoken kind
RawIdentifier
Attributes
identifier | captured characters |
Unterminated literal (Rust 2015 and 2018)
Pattern
( r \# | b r \# | r " | b r " | b ' )
Note: I believe the double-quoted forms here aren't strictly needed: if this rule is chosen when its pattern matched via one of those forms then the input must be rejected eventually anyway.
Pretoken kind
Reserved
Attributes
(none)
Reserved prefix or unterminated literal (Rust 2021 and 2024)
Pattern
[ \p{XID_Start} _ ]
\p{XID_Continue} *
( \# | " | ' )
Pretoken kind
Reserved
Attributes
(none)
Non-raw identifier
Pattern
(?<identifier>
[ \p{XID_Start} _ ]
\p{XID_Continue} *
)
Pretoken kind
Identifier
Attributes
identifier | captured characters |
Note: this is following the specification in Unicode Standard Annex #31 for Unicode version 16.0, with the addition of permitting underscore as the first character.
Reprocessing
Reprocessing examines a pretoken, and either accepts it (producing a fine-grained token), or rejects it (causing lexical analysis to fail).
Note: reprocessing behaves in the same way in all Rust editions.
The effect of reprocessing each kind of pretoken is given in List of reprocessing cases.
Fine-grained tokens
Reprocessing produces fine-grained tokens.
Each fine-grained token has an extent, which is a sequence of characters taken from the input.
Each fine-grained token has a kind, and possibly also some attributes, as described in the tables below.
Kind | Attributes |
---|---|
Whitespace | |
LineComment | style, body |
BlockComment | style, body |
Punctuation | mark |
Identifier | represented identifier |
RawIdentifier | represented identifier |
LifetimeOrLabel | name |
RawLifetimeOrLabel | name |
CharacterLiteral | represented character, suffix |
ByteLiteral | represented byte, suffix |
StringLiteral | represented string, suffix |
RawStringLiteral | represented string, suffix |
ByteStringLiteral | represented bytes, suffix |
RawByteStringLiteral | represented bytes, suffix |
CStringLiteral | represented bytes, suffix |
RawCStringLiteral | represented bytes, suffix |
IntegerLiteral | base, digits, suffix |
FloatLiteral | body, suffix |
These attributes have the following types:
Attribute | Type |
---|---|
base | binary / octal / decimal / hexadecimal |
body | sequence of characters |
digits | sequence of characters |
mark | single character |
name | sequence of characters |
represented byte | single byte |
represented bytes | sequence of bytes |
represented character | single character |
represented identifier | sequence of characters |
represented string | sequence of characters |
style | non-doc / inner doc / outer doc |
suffix | sequence of characters |
Notes:
At this stage:
- Both _ and keywords are treated as instances of
Identifier
. - There are explicit tokens representing whitespace and comments.
- Single-character tokens are used for all punctuation.
- A lifetime (or label) is represented as a single token (which includes the leading ').
Escape processing
The descriptions of the effect of reprocessing string and character literals make use of several forms of escape.
Each form of escape is characterised by:
- an escape sequence: a sequence of characters, which always begins with \
- an escaped value: either a single character or an empty sequence of characters
In the definitions of escapes below:
- An octal digit is any of the characters in the range 0..=7.
- A hexadecimal digit is any of the characters in the ranges 0..=9, a..=f, or A..=F.
Simple escapes
Each sequence of characters occurring in the first column of the following table is an escape sequence.
In each case, the escaped value is the character given in the corresponding entry in the second column.
Escape sequence | Escaped value |
---|---|
\0 | U+0000 NUL |
\t | U+0009 HT |
\n | U+000A LF |
\r | U+000D CR |
\" | U+0022 QUOTATION MARK |
\' | U+0027 APOSTROPHE |
\\ | U+005C REVERSE SOLIDUS |
Note: the escaped value therefore has a Unicode scalar value which can be represented in a byte.
8-bit escapes
The escape sequence consists of \x followed by two hexadecimal digits.
The escaped value is the character whose Unicode scalar value is the result of interpreting the final two characters in the escape sequence as a hexadecimal integer,
as if by u8::from_str_radix
with radix 16.
Note: the escaped value therefore has a Unicode scalar value which can be represented in a byte.
7-bit escapes
The escape sequence consists of \x followed by an octal digit then a hexadecimal digit.
The escaped value is the character whose Unicode scalar value is the result of interpreting the final two characters in the escape sequence as a hexadecimal integer,
as if by u8::from_str_radix
with radix 16.
Unicode escapes
The escape sequence consists of \u{, followed by a hexadecimal digit, followed by a sequence of characters each of which is a hexadecimal digit or _, followed by }, with the restriction that there are no more than six hexadecimal digits in the entire escape sequence
The escaped value is the character whose Unicode scalar value is the result of interpreting the hexadecimal digits contained in the escape sequence as a hexadecimal integer, as if by u32::from_str_radix
with radix 16.
String continuation escapes
The escape sequence consists of \ followed immediately by LF, and all following whitespace characters before the next non-whitespace character.
For this purpose, the whitespace characters are
HT (U+0009
), LF (U+000A
), CR (U+000D
), and SPACE (U+0020
).
The escaped value is an empty sequence of characters.
The Reference says this behaviour may change in future; see String continuation escapes.
Table of contents
Reserved
Whitespace
LineComment
BlockComment
Punctuation
Identifier
RawIdentifier
LifetimeOrLabel
RawLifetimeOrLabel
SingleQuoteLiteral
DoubleQuoteLiteral
RawDoubleQuoteLiteral
IntegerDecimalLiteral
IntegerHexadecimalLiteral
IntegerOctalLiteral
IntegerBinaryLiteral
FloatLiteral
The list of of reprocessing cases
The list below has an entry for each kind of pretoken, describing what kind of fine-grained token it produces, how the fine-grained token's attributes are determined, and the circumstances under which a pretoken is rejected.
When an attribute value is given below as "copied", it has the same value as the pretoken's attribute with the same name.
Reserved
A Reserved
pretoken is always rejected.
Whitespace
Fine-grained token kind produced:
Whitespace
A Whitespace
pretoken is always accepted.
LineComment
Fine-grained token kind produced:
LineComment
Attributes
style and body are determined from the pretoken's comment content as follows:
-
if the comment content begins with //:
- style is non-doc
- body is empty
-
otherwise, if the comment content begins with /,
- style is outer doc
- body is the characters from the comment content after that /
-
otherwise, if the comment content begins with !,
- style is inner doc
- body is the characters from the comment content after that !
-
otherwise
- style is non-doc
- body is empty
The pretoken is rejected if (and only if) the resulting body includes a CR character.
Note: the body of a non-doc comment is ignored by the rest of the compilation process
BlockComment
Fine-grained token kind produced:
BlockComment
Attributes
style and body are determined from the pretoken's comment content as follows:
-
if the comment content begins with
**
:- style is non-doc
- body is empty
-
otherwise, if the comment content begins with
*
and contains at least one further character,- style is outer doc
- body is the characters from the comment content after that
*
-
otherwise, if the comment content begins with
!
,- style is inner doc
- body is the characters from the comment content after that
!
-
otherwise
- style is non-doc
- body is empty
The pretoken is rejected if (and only if) the resulting body includes a CR character.
Note: it follows that
/**/
and/***/
are not doc-comments
Note: the body of a non-doc comment is ignored by the rest of the compilation process
Punctuation
Fine-grained token kind produced:
Punctuation
A Punctuation
pretoken is always accepted.
Attributes
mark: copied
Identifier
Fine-grained token kind produced:
Identifier
An Identifier
pretoken is always accepted.
Attributes
represented identifier: NFC-normalised form of the pretoken's identifier
RawIdentifier
Fine-grained token kind produced:
RawIdentifier
Attributes
represented identifier: NFC-normalised form of the pretoken's identifier
The pretoken is rejected if (and only if) the represented identifier is one of the following sequences of characters:
- _
- crate
- self
- super
- Self
LifetimeOrLabel
Fine-grained token kind produced:
LifetimeOrLabel
A LifetimeOrLabel
pretoken is always accepted.
Attributes
name: copied
Note that the name is not NFC-normalised. See NFC normalisation for lifetime/label.
RawLifetimeOrLabel
Fine-grained token kind produced:
RawLifetimeOrLabel
The pretoken is rejected if (and only if) the name is one of the following sequences of characters:
- _
- crate
- self
- super
- Self
Attributes
name: copied
Note that the name is not NFC-normalised. See NFC normalisation for lifetime/label.
SingleQuoteLiteral
The pretokeniser guarantees the pretoken's prefix attribute is one of the following:
- empty, in which case it is reprocessed as described under Character literal
- the single character b, in which case it is reprocessed as described under Byte literal.
In either case, the pretoken is rejected if its suffix consists of the single character _.
Character literal
Fine-grained token kind produced:
CharacterLiteral
Attributes
The represented character is derived from the pretoken's literal content as follows:
-
If the literal content is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:
-
If the literal content begins with a \ character which did not introduce one of the above forms of escape, the pretoken is rejected.
-
Otherwise, if the single character that makes up the literal content is LF, CR, or TAB, the pretoken is rejected.
-
Otherwise the represented character is the single character that makes up the literal content.
suffix: copied
Note: The protokeniser guarantees the pretoken's literal content is either a single character, or a character sequence beginning with \.
Byte literal
Fine-grained token kind produced:
ByteLiteral
Attributes
Define a represented character, derived from the pretoken's literal content as follows:
-
If the literal content is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:
-
If the literal content begins with a \ character which did not introduce one of the above forms of escape, the pretoken is rejected.
-
Otherwise, if the single character that makes up the literal content is LF, CR, or TAB, the pretoken is rejected.
-
Otherwise, if the single character that makes up the literal content has a unicode scalar value greater than 127, the pretoken is rejected.
-
Otherwise the represented character is the single character that makes up the literal content.
The represented byte is the represented character's Unicode scalar value.
suffix: copied
Note: The protokeniser guarantees the pretoken's literal content is either a single character, or a character sequence beginning with \.
DoubleQuoteLiteral
The pretokeniser guarantees the pretoken's prefix attribute is one of the following:
- empty, in which case it is reprocessed as described under String literal
- the single character b, in which case it is reprocessed as described under Byte-string literal
- the single character c, in which case it is reprocessed as described under C-string literal
In each case, the pretoken is rejected if its suffix consists of the single character _.
String literal
Fine-grained token kind produced:
StringLiteral
Attributes
The represented string is derived from the pretoken's literal content by replacing each escape sequence of any of the following forms occurring in the literal content with the escape sequence's escaped value.
These replacements take place in left-to-right order.
For example, the pretoken with extent "\\x41"
is converted to the characters \ x 4 1.
If a \ character appears in the literal content but is not part of one of the above forms of escape, the pretoken is rejected.
If a CR character appears in the literal content and is not part of a string continuation escape, the pretoken is rejected.
suffix: copied
Byte-string literal
Fine-grained token kind produced:
ByteStringLiteral
If any character whose unicode scalar value is greater than 127 appears in the literal content, the pretoken is rejected.
Attributes
Define a represented string (a sequence of characters) derived from the pretoken's literal content by replacing each escape sequence of any of the following forms occurring in the literal content with the escape sequence's escaped value.
These replacements take place in left-to-right order.
For example, the pretoken with extent b"\\x41"
is converted to the characters \ x 4 1.
If a \ character appears in the literal content but is not part of one of the above forms of escape, the pretoken is rejected.
If a CR character appears in the literal content and is not part of a string continuation escape, the pretoken is rejected.
The represented bytes are the sequence of Unicode scalar values of the characters in the represented string.
suffix: copied
C-string literal
Fine-grained token kind produced:
CStringLiteral
Attributes
The pretoken's literal content is treated as a sequence of items, each of which is either a single Unicode character other than \ or an escape.
The sequence of items is converted to the represented bytes as follows:
- Each single Unicode character contributes its UTF-8 representation.
- Each simple escape contributes a single byte containing the Unicode scalar value of its escaped value.
- Each 8-bit escape contributes a single byte containing the Unicode scalar value of its escaped value.
- Each unicode escape contributes the UTF-8 representation of its escaped value.
- Each string continuation escape contributes no bytes.
If a \ character appears in the literal content but is not part of one of the above forms of escape, the pretoken is rejected.
If a CR character appears in the literal content and is not part of a string continuation escape, the pretoken is rejected.
If any of the resulting represented bytes have value 0, the pretoken is rejected.
suffix: copied
RawDoubleQuoteLiteral
The pretokeniser guarantees the pretoken's prefix attribute is one of the following:
- the single character r, in which case it is reprocessed as described under Raw string literal
- the characters br, in which case it is reprocessed as described under Raw byte-string literal
- the characters cr, in which case it is reprocessed as described under Raw C-string literal
In each case, the pretoken is rejected if its suffix consists of the single character _.
Raw string literal
Fine-grained token kind produced:
RawStringLiteral
The pretoken is rejected if (and only if) a CR character appears in the literal content.
Attributes
represented string: the pretoken's literal content
suffix: copied
Raw byte-string literal
Fine-grained token kind produced:
RawByteStringLiteral
If any character whose unicode scalar value is greater than 127 appears in the literal content, the pretoken is rejected.
If a CR character appears in the literal content, the pretoken is rejected.
Attributes
represented bytes: the sequence of Unicode scalar values of the characters in the pretoken's literal content
suffix: copied
Raw C-string literal
Fine-grained token kind produced:
RawCStringLiteral
If a CR character appears in the literal content, the pretoken is rejected.
Attributes
represented bytes: the UTF-8 encoding of the pretoken's literal content
suffix: copied
If any of the resulting represented bytes have value 0, the pretoken is rejected.
IntegerDecimalLiteral
Fine-grained token kind produced:
IntegerLiteral
The pretoken is rejected if (and only if) its digits attribute consists entirely of _ characters.
Attributes
base: decimal
digits: copied
suffix: copied
Note: in particular, an
IntegerDecimalLiteral
whose digits is empty is rejected.
IntegerHexadecimalLiteral
Fine-grained token kind produced:
IntegerLiteral
The pretoken is rejected if (and only if) its digits attribute consists entirely of _ characters.
Attributes
base: hexadecimal
digits: copied
suffix: copied
Note: in particular, an
IntegerHexadecimalLiteral
whose digits is empty is rejected.
IntegerOctalLiteral
Fine-grained token kind produced:
IntegerLiteral
The pretoken is rejected if (and only if):
- its digits attribute consists entirely of _ characters; or
- its digits attribute contains any character other than 0, 1, 2, 3, 4, 5, 6, 7, or _.
Attributes
base: octal
digits: copied
suffix: copied
Note: in particular, an
IntegerOctalLiteral
whose digits is empty is rejected.
IntegerBinaryLiteral
Fine-grained token kind produced:
IntegerLiteral
The pretoken is rejected if (and only if):
- its digits attribute consists entirely of _ characters; or
- its digits attribute contains any character other than 0, 1, or _.
Attributes
base: binary
digits: copied
suffix: copied
Note: in particular, an
IntegerBinaryLiteral
whose digits is empty is rejected.
FloatLiteral
Fine-grained token kind produced:
FloatLiteral
The pretoken is rejected if (and only if)
- its has base attribute is true; or
- its exponent digits attribute is a character sequence which consists entirely of _ characters.
Attributes
body: copied
suffix: copied
Note: in particular, a
FloatLiteral
whose exponent digits is empty is rejected.
Rationale for this model
Pretokenising
The main difference between the model described in this document and the way the Reference (as of Rust 1.85) describes lexing is the split into pretokenisation and reprocessing.
There are a number of forms which are errors at lexing time, even though in principle they could be analysed as multiple tokens.
Examples include
- the rfc3101 "reserved prefixes" (in Rust 2021 and newer):
k#abc
,f"..."
, orf'...'
. - the variants of numeric literals reserved in rfc0879, eg
0x1.2
or0b123
- adjacent-lifetime-like forms such as
'ab'c
- stringlike literals with a single
_
as a suffix - byte or C strings with unacceptable contents that would be accepted in plain strings, eg
b"€"
,b"\u{00a0}"
, orc"\x00"
The Reference tries to account for some of these cases by adding rules which match the forms that cause errors, while keeping the forms matched by those rules disjoint from the forms matched by the non-error-causing rules.
The resulting rules for reserved prefixes and numeric literals are quite complicated (and still have mistakes). Rules of this sort haven't been attempted for stringlike literals.
The rules are simpler in a model with a 'pretokenising' step which can match a form such as c"\x00"
(preventing it being matched as c
followed by "\x00"
), leaving it to a later stage to decide whether it's a valid token or a lexing-time error.
This separation also gives us a natural way to lex doc and non-doc comments uniformly, and inspect their contents later to make the distinction, rather than trying to write non-overlapping lexer rules as the Reference does.
Lookahead
The model described in this document uses one-character lookahead (beyond the token which will be matched) in the prelexing step, in two cases:
- the lifetime-or-label rule, to prevent
'ab'c'
being analysed as'ab
followed by'c
(and similarly for the raw-lifetime-or-label rule) - the rule for float literals ending in
.
, to make sure that1.a
is analysed as1
.
a
rather than1.
a
I think some kind of lookahead is unavoidable in these cases.
I think the lookahead could be done after prelexing instead, by adding a pass that could reject pretokens or join them together, but I think that would be less clear. (In particular, the float rule would end up using a list of pretoken kinds that start with an identifier character, which seems worse than just looking for such a character.)
Constraints and imperative code
There are two kinds of token which are hard to deal with using a "regular" lexer:
raw-string literals (where the number of #
characters need to match),
and block comments (where the /*
and */
delimiters need to be balanced).
Raw-string literals can in principle fit into a regular language because there's a limit of 255 #
symbols, but it seems hard to do anything useful with that.
Nested comments can in principle be described using non-regular rules (as the Reference does).
The model described in this document deals with these cases by allowing rules to define constraints beyond the simple pattern match, effectively intervening in the "find the longest match" part of pattern matching.
The constraint for raw strings is simple, but the one for block comments has ended up using imperative code, which doesn't seem very satisfactory. See Defining the block-comment constraint.
Producing tokens with attributes
This model makes the lexing process responsible for a certain amount of 'interpretation' of the tokens, rather than simply describing how the input source is sliced up and assigning a 'kind' to each resulting token.
The main motivation for this is to deal with stringlike literals: it means we don't need to separate the description of the result of "unescaping" strings from the description of which strings contain well-formed escapes.
In particular, describing unescaping at lexing time makes it easy to describe the rule about rejecting NULs in C-strings, even if they were written using an escape.
For numeric literals, the way the suffix is identified isn't always simple (see Integer literal base-vs-suffix ambiguity); I think it's best to make the lexer responsible for doing it, so that the description of numeric literal expressions doesn't have to.
For identifiers, many parts of the spec will need a notion of equivalence (both for handling raw identifiers and for dealing with NFC normalisation), and some restrictions depend on the normalised form (see ASCII identifiers). I think it's best for the lexer to handle this by defining the represented identifier.
This document treats the lexer's "output" as a stream of tokens which have concrete attributes, but of course it would be equivalent (and I think more usual for a spec) to treat each attribute as an independent defined term, and write things like "the represented character of a character literal token is…".
Open questions
Table of contents
Terminology
Pattern notation
Rule priority
Token kinds and attributes
Defining the block-comment constraint
Wording for string unescaping
How to model shebang removal
String continuation escapes
Terminology
Some of the terms used in this document are taken from pre-existing documentation or rustc's error output, but many of them are new (and so can freely be changed).
Here's a partial list:
Term | Source |
---|---|
pretoken | New |
reprocessing | New |
fine-grained token | New |
compound token | New |
literal content | Reference (recent) |
simple escape | Reference (recent) |
escape sequence | Reference |
escaped value | Reference (recent) |
string continuation escape | Reference (as STRING_CONTINUE ) |
string representation | Reference (recent) |
represented byte | New |
represented character | Reference (recent) |
represented bytes | Reference (recent) |
represented string | Reference (recent) |
represented identifier | New |
style (of a comment) | rustc internal |
body (of a comment) | Reference |
Terms listed as "Reference (recent)" are ones I introduced in PRs merged in January 2024, so it's not very likely that they've been picked up more widely.
Pattern notation
This document is relying on the regex
crate for its pattern notation.
This is convenient for checking that the writeup is the same as the comparable implementation, but it's presumably not suitable for the spec.
The usual thing for specs seems to be to define their own notation from scratch.
Requirements for patterns
I've tried to keep the patterns used here as simple as possible.
There's no use of non-greedy matches.
I think all the uses of alternation are obviously unambiguous.
In particular, all uses of alternation inside repetition have disjoint sets of accepted first characters.
I believe all uses of repetition in the unconstrained patterns have unambiguous termination. That is, anything permitted to follow the repeatable section would not be permitted to start a new repetition. In these cases, the distinction between greedy and non-greedy matches doesn't matter.
Naming sub-patterns
The patterns used in this document are inconveniently repetitive, particularly for the edition-specific rule variants and for numeric literals.
Of course the usual thing is to have a way to define reusable named sub-patterns. So I think addressing this should be treated as part of choosing a pattern notation.
Rule priority
At present this document gives the pretokenisation rules explicit priorities, used to determine which rule is chosen in positions where more than one rule matches.
I believe that in almost all cases it would be equivalent to say that the rule which matches the longest extent is chosen (in particular, if multiple rules match then one has a longer extent than any of the others).
See Integer literal base-vs-suffix ambiguity below for the exception.
This document uses the order in which the rules are presented as the priority, which has the downside of forcing an unnatural presentation order (for example, Raw identifier and Non-raw identifier are separated).
Perhaps it would be better to say that longest-extent is the primary way to disambiguate, and add a secondary principle to cover the exceptional cases.
The comparable implementation reports (as "model error") any cases (other than the Integer literal base-vs-suffix ambiguity) where the priority principle doesn't agree with the longest-extent principle, or where there wasn't a unique longest match.
Integer literal base-vs-suffix ambiguity
The Reference's lexer rules for input such as 0x3
allow two interpretations, matching the same extent:
- as a hexadecimal integer literal:
0x3
with no suffix - as a decimal integer literal:
0
with a suffix ofx3
If the output of the lexer is simply a token with a kind and an extent, this isn't a problem: both cases have the same kind.
But if we want to make the lexer responsible for identifying which part of the input is the suffix, we need to make sure it gets the right answer (ie, the one with no suffix).
Further, there are cases where we need to reject input which matches the rule for a decimal integer literal 0
with a suffix,
for example 0b1e2
, 0b0123
(see rfc0879), or 0x·
.
(Note that · has the XID_Continue
property but not XID_Start
.)
In these cases we can't avoid dealing with the base-vs-suffix ambiguity in the lexer.
This model uses a separate rule for integer decimal literals, with lower priority than all other numeric literals, to make sure we get these results.
Note that in the 0x·
example the extent matched by the lower priority rule is longer than the extent matched by the chosen rule.
If relying on priorities like this seems undesirable, I think it would be possible to rework the rules to avoid it. It might work to allow the difficult cases to pretokenise as decimal integer literals, and have reprocessing reject decimal literal pretokens which begin with a base indicator.
Token kinds and attributes
What kinds and attributes should fine-grained tokens have?
Distinguishing raw and non-raw forms
The current table distinguishes raw from non-raw forms as different top-level "kinds".
I think this distinction will be needed in some cases,
but perhaps it would be better represented using an attributes on unified kinds
(like rustc_ast::StrStyle
and rustc_ast::token::IdentIsRaw
).
As an example of where it might be wanted: proc-macros Display
for raw identifers includes the r#
prefix for raw identifiers, but I think simply using the source extent isn't correct because the Display
output is NFC-normalised.
Hash count
Should there be an attribute recording the number of hashes in a raw string or byte-string literal? Rustc has something of the sort.
ASCII identifiers
Should there be an attribute indicating whether an identifier is all ASCII? The Reference lists several places where identifiers have this restriction, and it seems natural for the lexer to be responsible for making this check.
The list in the Reference is:
extern crate
declarations- External crate names referenced in a path
- Module names loaded from the filesystem without a
path
attribute no_mangle
attributed items- Item names in external blocks
I believe this restriction is applied after NFC-normalisation, so it's best thought of as a restriction on the represented identifier.
Represented bytes for C strings
At present this document says that the sequence of "represented bytes" for C string literals doesn't include the added NUL.
That's following the way the Reference currently uses the term "represented bytes",
but rustc
includes the NUL in its equivalent piece of data.
Defining the block-comment constraint
This document currently uses imperative Rust code to define the Block comment constraint
(ie, to say that /*
and */
must be properly nested inside a candidate comment).
It would be nice to do better; the options might depend on what pattern notation is chosen.
I don't think there's any very elegant way to describe the constraint in English
(note that the constraint is asymmetrical; for example /* /*/ /*/ */
is rejected).
Perhaps the natural continuation of this writeup's approach would be to define a mini-tokeniser to use inside the constraint, but that would be a lot of words for a small part of the spec.
Or perhaps this part could borrow some definitions from whatever formalisation the spec ends up using for Rust's grammar, and use the traditional sort of context-free-grammar approach.
Wording for string unescaping
The description of reprocessing for String literals and C-string literals was originally drafted for the Reference. Should there be a more formal definition of unescaping processes than the current "left-to-right order" and "contributes" wording?
I believe that any literal content which will be accepted can be written uniquely as a sequence of (escape-sequence or non-\-character), but I'm not sure that's obvious enough that it can be stated without justification.
This is a place where the comparable implementation isn't closely parallel to the writeup.
How to model shebang removal
This part of the Reference text isn't trying to be rigorous:
As an exception, if the
#!
characters are followed (ignoring intervening comments or whitespace) by a[
token, nothing is removed. This prevents an inner attribute at the start of a source file being removed.
rustc
implements the "ignoring intervening comments or whitespace" part by
running its lexer for long enough to see whether the [
is there or not,
then discarding the result (see #70528 and #71487 for history).
So should the spec define this in terms of its model of the lexer?
String continuation escapes
rustc
has a warning that the behaviour of String continuation escapes
(when multiple newlines are skipped)
may change in future.
The Reference has a note about this, and points to #1042 for more information. Should the spec say anything?
Rustc oddities
NFC normalisation for lifetime/label
Identifiers are normalised to NFC,
which means that Kelvin
and Kelvin
are treated as representing the same identifier.
See rfc2457.
But this doesn't happen for lifetimes or labels, so 'Kelvin
and 'Kelvin
are different as lifetimes or labels.
For example, this compiles without warning in Rust 1.83, while this doesn't.
In this writeup, the represented identifier attribute of Identifier
and RawIdentifier
fine-grained tokens is in NFC,
and the name attribute of LifetimeOrLabel
and RawLifetimeOrLabel
tokens isn't.
I think this behaviour is a promising candidate for provoking the "Wait...that's what we currently do? We should fix that." reaction to being given a spec to review.
Filed as rustc #126759.