Open questions

Terminology
Pattern notation
Rule priority
Token kinds and attributes
Defining the block-comment constraint
Wording for string unescaping
How to model shebang removal
String continuation escapes

Terminology

Some of the terms used in this document are taken from pre-existing documentation or rustc's error output, but many of them are new (and so can freely be changed).

Here's a partial list:

Term	Source
pretoken	New
reprocessing	New
fine-grained token	New
compound token	New
literal content	Reference (recent)
simple escape	Reference (recent)
escape sequence	Reference
escaped value	Reference (recent)
string continuation escape	Reference (as `STRING_CONTINUE`)
string representation	Reference (recent)
represented byte	New
represented character	Reference (recent)
represented bytes	Reference (recent)
represented string	Reference (recent)
represented identifier	New
style (of a comment)	rustc internal
body (of a comment)	Reference

Terms listed as "Reference (recent)" are ones I introduced in PRs merged in January 2024, so it's not very likely that they've been picked up more widely.

Pattern notation

This document is relying on the regex crate for its pattern notation.

This is convenient for checking that the writeup is the same as the comparable implementation, but it's presumably not suitable for the spec.

The usual thing for specs seems to be to define their own notation from scratch.

Requirements for patterns

I've tried to keep the patterns used here as simple as possible.

There's no use of non-greedy matches.

I think all the uses of alternation are obviously unambiguous.

In particular, all uses of alternation inside repetition have disjoint sets of accepted first characters.

I believe all uses of repetition in the unconstrained patterns have unambiguous termination. That is, anything permitted to follow the repeatable section would not be permitted to start a new repetition. In these cases, the distinction between greedy and non-greedy matches doesn't matter.

Naming sub-patterns

The patterns used in this document are inconveniently repetitive, particularly for the edition-specific rule variants and for numeric literals.

Of course the usual thing is to have a way to define reusable named sub-patterns. So I think addressing this should be treated as part of choosing a pattern notation.

Rule priority

At present this document gives the pretokenisation rules explicit priorities, used to determine which rule is chosen in positions where more than one rule matches.

I believe that in almost all cases it would be equivalent to say that the rule which matches the longest extent is chosen (in particular, if multiple rules match then one has a longer extent than any of the others).

See Integer literal base-vs-suffix ambiguity and Exponent-vs-suffix ambiguity below for the exceptions.

This document uses the order in which the rules are presented as the priority, which has the downside of forcing an unnatural presentation order (for example, Raw identifier and Non-raw identifier are separated).

Perhaps it would be better to say that longest-extent is the primary way to disambiguate, and add a secondary principle to cover the exceptional cases.

The comparable implementation reports (as "model error") any cases (other than the exceptions described below) where the priority principle doesn't agree with the longest-extent principle, or where there wasn't a unique longest match.

Integer literal base-vs-suffix ambiguity

The Reference's lexer rules for input such as 0x3 allow two interpretations, matching the same extent:

as a hexadecimal integer literal: 0x3 with no suffix
as a decimal integer literal: 0 with a suffix of x3

If the output of the lexer is simply a token with a kind and an extent, this isn't a problem: both cases have the same kind.

But if we want to make the lexer responsible for identifying which part of the input is the suffix, we need to make sure it gets the right answer (ie, the one with no suffix).

Further, there are cases where we need to reject input which matches the rule for a decimal integer literal 0 with a suffix, for example 0b1e2, 0b0123 (see rfc0879), or 0x·.

(Note that · has the XID_Continue property but not XID_Start.)

In these cases we can't avoid dealing with the base-vs-suffix ambiguity in the lexer.

This model uses a separate rule for integer decimal literals, with lower priority than all other numeric literals, to make sure we get these results.

Note that in the 0x· example the extent matched by the lower priority rule is longer than the extent matched by the chosen rule.

If relying on priorities like this seems undesirable, I think it would be possible to rework the rules to avoid it. It might work to allow the difficult cases to pretokenise as decimal integer literals, and have reprocessing reject decimal literal pretokens which begin with a base indicator.

Exponent-vs-suffix ambiguity

Now that numeric literal suffixes can begin with e or E, many cases of float literals with an exponent could also be interpreted as integer or float literals with a suffix, for example 123e4 or 123.4e5.

This model gives the rules for float literals with an exponent higher priority than any other rules for numeric literals, to make sure we get the desired result.

Note that there are again examples where the extent matched by the lower priority rule is longer than the extent matched by the chosen rule. For example 1e2· could be interpreted as an integer decimal literal with suffix e2·, but instead we find the float literal 1e2 and then reject the remainder of the input.

Token kinds and attributes

What kinds and attributes should fine-grained tokens have?

Distinguishing raw and non-raw forms

The current table distinguishes raw from non-raw forms as different top-level "kinds".

I think this distinction will be needed in some cases, but perhaps it would be better represented using an attributes on unified kinds (like rustc_ast::StrStyle and rustc_ast::token::IdentIsRaw).

As an example of where it might be wanted: proc-macros Display for raw identifers includes the r# prefix for raw identifiers, but I think simply using the source extent isn't correct because the Display output is NFC-normalised.

Hash count

Should there be an attribute recording the number of hashes in a raw string or byte-string literal? Rustc has something of the sort.

ASCII identifiers

Should there be an attribute indicating whether an identifier is all ASCII? The Reference lists several places where identifiers have this restriction, and it seems natural for the lexer to be responsible for making this check.

The list in the Reference is:

extern crate declarations
External crate names referenced in a path
Module names loaded from the filesystem without a path attribute
no_mangle attributed items
Item names in external blocks

I believe this restriction is applied after NFC-normalisation, so it's best thought of as a restriction on the represented identifier.

Represented bytes for C strings

At present this document says that the sequence of "represented bytes" for C string literals doesn't include the added NUL.

That's following the way the Reference currently uses the term "represented bytes", but rustc includes the NUL in its equivalent piece of data.

Defining the block-comment constraint

This document currently uses imperative Rust code to define the Block comment constraint (ie, to say that /* and */ must be properly nested inside a candidate comment).

It would be nice to do better; the options might depend on what pattern notation is chosen.

I don't think there's any very elegant way to describe the constraint in English (note that the constraint is asymmetrical; for example /* /*/ /*/ */ is rejected).

Perhaps the natural continuation of this writeup's approach would be to define a mini-tokeniser to use inside the constraint, but that would be a lot of words for a small part of the spec.

Or perhaps this part could borrow some definitions from whatever formalisation the spec ends up using for Rust's grammar, and use the traditional sort of context-free-grammar approach.

Wording for string unescaping

The description of reprocessing for String literals and C-string literals was originally drafted for the Reference. Should there be a more formal definition of unescaping processes than the current "left-to-right order" and "contributes" wording?

I believe that any literal content which will be accepted can be written uniquely as a sequence of (escape-sequence or non-\-character), but I'm not sure that's obvious enough that it can be stated without justification.

This is a place where the comparable implementation isn't closely parallel to the writeup.

How to model shebang removal

This part of the Reference text isn't trying to be rigorous:

As an exception, if the #! characters are followed (ignoring intervening comments or whitespace) by a [ token, nothing is removed. This prevents an inner attribute at the start of a source file being removed.

rustc implements the "ignoring intervening comments or whitespace" part by running its lexer for long enough to see whether the [ is there or not, then discarding the result (see #70528 and #71487 for history).

So should the spec define this in terms of its model of the lexer?

String continuation escapes

rustc has a warning that the behaviour of String continuation escapes (when multiple newlines are skipped) may change in future.

The Reference has a note about this, and points to #1042 for more information. Should the spec say anything?