Pretokenising

Pretokenisation is described by a Parsing Expression Grammar which describes how to match a single pretoken (not a sequence of pretokens).

The grammar isn't strictly a PEG. See Grammar for raw string literals

The grammar defines an edition nonterminal for each Rust edition:

EditionEdition nonterminal
2015PRETOKEN_2015
2021PRETOKEN_2021
2024PRETOKEN_2024

Each edition nonterminal is defined as a choice expression, each of whose subexpressions is a single nonterminal (a pretoken nonterminal).

Grammar
PRETOKEN_2015 = {
    Whitespace |
    Line_comment |
    Block_comment |
    Unterminated_block_comment |
    Single_quoted_literal |
    Double_quoted_literal_2015 |
    Raw_double_quoted_literal_2015 |
    Unterminated_literal_2015 |
    Float_literal_1 |
    Reserved_float_empty_exponent |
    Reserved_float_e_suffix_restriction |
    Float_literal_2 |
    Reserved_float_based |
    Reserved_integer_e_suffix_restriction |
    Integer_literal |
    Lifetime_or_label |
    Raw_identifier |
    Reserved_prefix_2015 |
    Identifier |
    Punctuation
}

PRETOKEN_2021 = {
    Whitespace |
    Line_comment |
    Block_comment |
    Unterminated_block_comment |
    Single_quoted_literal |
    Double_quoted_literal_2021 |
    Raw_double_quoted_literal_2021 |
    Reserved_literal_2021 |
    Float_literal_1 |
    Reserved_float_empty_exponent |
    Reserved_float_e_suffix_restriction |
    Float_literal_2 |
    Reserved_float_based |
    Reserved_integer_e_suffix_restriction |
    Integer_literal |
    Raw_lifetime_or_label_2021 |
    Reserved_lifetime_or_label_prefix_2021 |
    Lifetime_or_label |
    Raw_identifier |
    Reserved_prefix_2021 |
    Identifier |
    Punctuation
}

PRETOKEN_2024 = {
    Whitespace |
    Line_comment |
    Block_comment |
    Unterminated_block_comment |
    Single_quoted_literal |
    Double_quoted_literal_2021 |
    Raw_double_quoted_literal_2021 |
    Reserved_literal_2021 |
    Reserved_guard_2024 |
    Float_literal_1 |
    Reserved_float_empty_exponent |
    Reserved_float_e_suffix_restriction |
    Float_literal_2 |
    Reserved_float_based |
    Reserved_integer_e_suffix_restriction |
    Integer_literal |
    Raw_lifetime_or_label_2021 |
    Reserved_lifetime_or_label_prefix_2021 |
    Lifetime_or_label |
    Raw_identifier |
    Reserved_prefix_2021 |
    Identifier |
    Punctuation
}

The pretoken nonterminals are distinguished in the grammar as having names in Title_case.

The rest of the grammar is presented in the following pages in this section. It's also available on a single page.

The pretoken nonterminals are presented in an order consistent with their appearance in the edition nonterminals. That means they appear in priority order (highest priority first). There is one exception, for floating-point literals and their related reserved forms (see Float literal).

Extracting pretokens

To extract a pretoken from the input:

  • Attempt to match the edition's edition nonterminal at the start of the input.
  • If the match fails, lexical analysis fails.
  • If the match succeeds, the extracted pretoken has:
    • extent: the characters consumed by the nonterminal's expression
    • kind and attributes: determined by the pretoken nonterminal used in the match, as described below.
  • Remove the extracted pretoken's extent from the start of the input.

Strictly speaking we have to justify the assumption that the match will always either fail or succeed, which basically means observing that the grammar has no left recursion.

Determining the pretoken kind and attributes

Each pretoken nonterminal produces a single kind of pretoken.

In most cases a given kind of pretoken is produced only by a single pretoken nonterminal. The exceptions are:

  • Several pretoken nonterminals produce Reserved pretokens.
  • There are two pretoken nonterminals producing FloatLiteral pretokens.
  • In some cases there are variant pretoken nonterminals for different editions.

Each pretoken nonterminal (or group of edition variants) has a subsection on the following pages, which lists the pretoken kind and provides a table of that pretoken kind's attributes.

In most cases an attribute value is "captured" by a named definition from the grammar:

  • If an attributes table entry says "from NONTERMINAL", the attribute's value is the sequence of characters consumed by that nonterminal, which will appear in one of the pretoken nonterminal's subexpressions (possibly via the definitions of additional nonterminals).

  • Some attributes table entries list multiple nonterminals, eg "from NONTERMINAL1 or NONTERMINAL2". In these cases the grammar ensures that at most one of those nonterminals may be matched, so that the attribute is unambiguously defined.

  • If no listed nonterminal is matched (which can happen if they all appear before ? or inside choice expressions), the attribute's value is none. The table says "(may be none)" in these cases.

If for any input the above rules don't result in a unique well-defined attribute value, it's a bug in this specification.

In other cases the attributes table entry defines the attribute value explicitly, depending on the characters consumed by the pretoken nonterminal or on which subexpression of the pretoken nonterminal matched.