List of reprocessing cases - Writeup of Rust's lexer

Reserved
Whitespace
LineComment
BlockComment
Punctuation
Identifier
RawIdentifier
LifetimeOrLabel
RawLifetimeOrLabel
SingleQuoteLiteral
DoubleQuoteLiteral
RawDoubleQuoteLiteral
IntegerDecimalLiteral
IntegerHexadecimalLiteral
IntegerOctalLiteral
IntegerBinaryLiteral
FloatLiteral

The list of of reprocessing cases

The list below has an entry for each kind of pretoken, describing what kind of fine-grained token it produces, how the fine-grained token's attributes are determined, and the circumstances under which a pretoken is rejected.

When an attribute value is given below as "copied", it has the same value as the pretoken's attribute with the same name.

`Reserved`

A Reserved pretoken is always rejected.

`Whitespace`

Fine-grained token kind produced: Whitespace

A Whitespace pretoken is always accepted.

`LineComment`

Fine-grained token kind produced: LineComment

Attributes

style and body are determined from the pretoken's comment content as follows:

if the comment content begins with //:
- style is non-doc
- body is empty
otherwise, if the comment content begins with /,
- style is outer doc
- body is the characters from the comment content after that /
otherwise, if the comment content begins with !,
- style is inner doc
- body is the characters from the comment content after that !
otherwise
- style is non-doc
- body is empty

The pretoken is rejected if (and only if) the resulting body includes a CR character.

Note: the body of a non-doc comment is ignored by the rest of the compilation process

`BlockComment`

Fine-grained token kind produced: BlockComment

Attributes

style and body are determined from the pretoken's comment content as follows:

if the comment content begins with **:
- style is non-doc
- body is empty
otherwise, if the comment content begins with * and contains at least one further character,
- style is outer doc
- body is the characters from the comment content after that *
otherwise, if the comment content begins with !,
- style is inner doc
- body is the characters from the comment content after that !
otherwise
- style is non-doc
- body is empty

The pretoken is rejected if (and only if) the resulting body includes a CR character.

Note: it follows that /**/ and /***/ are not doc-comments

Note: the body of a non-doc comment is ignored by the rest of the compilation process

`Punctuation`

Fine-grained token kind produced: Punctuation

A Punctuation pretoken is always accepted.

Attributes

mark: copied

`Identifier`

Fine-grained token kind produced: Identifier

An Identifier pretoken is always accepted.

Attributes

represented identifier: NFC-normalised form of the pretoken's identifier

`RawIdentifier`

Fine-grained token kind produced: RawIdentifier

Attributes

represented identifier: NFC-normalised form of the pretoken's identifier

The pretoken is rejected if (and only if) the represented identifier is one of the following sequences of characters:

_
crate
self
super
Self

`LifetimeOrLabel`

Fine-grained token kind produced: LifetimeOrLabel

A LifetimeOrLabel pretoken is always accepted.

Attributes

name: copied

Note that the name is not NFC-normalised. See NFC normalisation for lifetime/label.

`RawLifetimeOrLabel`

Fine-grained token kind produced: RawLifetimeOrLabel

The pretoken is rejected if (and only if) the name is one of the following sequences of characters:

_
crate
self
super
Self

Attributes

name: copied

Note that the name is not NFC-normalised. See NFC normalisation for lifetime/label.

`SingleQuoteLiteral`

The pretokeniser guarantees the pretoken's prefix attribute is one of the following:

empty, in which case it is reprocessed as described under Character literal
the single character b, in which case it is reprocessed as described under Byte literal.

In either case, the pretoken is rejected if its suffix consists of the single character _.

Character literal

Fine-grained token kind produced: CharacterLiteral

Attributes

The represented character is derived from the pretoken's literal content as follows:

If the literal content is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:
If the literal content begins with a \ character which did not introduce one of the above forms of escape, the pretoken is rejected.
Otherwise, if the single character that makes up the literal content is LF, CR, or TAB, the pretoken is rejected.
Otherwise the represented character is the single character that makes up the literal content.

suffix: copied

Note: The protokeniser guarantees the pretoken's literal content is either a single character, or a character sequence beginning with \.

Byte literal

Fine-grained token kind produced: ByteLiteral

Attributes

Define a represented character, derived from the pretoken's literal content as follows:

If the literal content is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:
- Simple escapes
- 8-bit escapes
If the literal content begins with a \ character which did not introduce one of the above forms of escape, the pretoken is rejected.
Otherwise, if the single character that makes up the literal content is LF, CR, or TAB, the pretoken is rejected.
Otherwise, if the single character that makes up the literal content has a unicode scalar value greater than 127, the pretoken is rejected.
Otherwise the represented character is the single character that makes up the literal content.

The represented byte is the represented character's Unicode scalar value.

suffix: copied

Note: The protokeniser guarantees the pretoken's literal content is either a single character, or a character sequence beginning with \.

`DoubleQuoteLiteral`

The pretokeniser guarantees the pretoken's prefix attribute is one of the following:

empty, in which case it is reprocessed as described under String literal
the single character b, in which case it is reprocessed as described under Byte-string literal
the single character c, in which case it is reprocessed as described under C-string literal

In each case, the pretoken is rejected if its suffix consists of the single character _.

String literal

Fine-grained token kind produced: StringLiteral

Attributes

The represented string is derived from the pretoken's literal content by replacing each escape sequence of any of the following forms occurring in the literal content with the escape sequence's escaped value.

These replacements take place in left-to-right order. For example, the pretoken with extent "\\x41" is converted to the characters \ x 4 1.

If a \ character appears in the literal content but is not part of one of the above forms of escape, the pretoken is rejected.

If a CR character appears in the literal content and is not part of a string continuation escape, the pretoken is rejected.

suffix: copied

See Wording for string unescaping

Byte-string literal

Fine-grained token kind produced: ByteStringLiteral

If any character whose unicode scalar value is greater than 127 appears in the literal content, the pretoken is rejected.

Attributes

Define a represented string (a sequence of characters) derived from the pretoken's literal content by replacing each escape sequence of any of the following forms occurring in the literal content with the escape sequence's escaped value.

These replacements take place in left-to-right order. For example, the pretoken with extent b"\\x41" is converted to the characters \ x 4 1.

If a \ character appears in the literal content but is not part of one of the above forms of escape, the pretoken is rejected.

If a CR character appears in the literal content and is not part of a string continuation escape, the pretoken is rejected.

The represented bytes are the sequence of Unicode scalar values of the characters in the represented string.

suffix: copied

See Wording for string unescaping

C-string literal

Fine-grained token kind produced: CStringLiteral

Attributes

The pretoken's literal content is treated as a sequence of items, each of which is either a single Unicode character other than \ or an escape.

The sequence of items is converted to the represented bytes as follows:

Each single Unicode character contributes its UTF-8 representation.
Each simple escape contributes a single byte containing the Unicode scalar value of its escaped value.
Each 8-bit escape contributes a single byte containing the Unicode scalar value of its escaped value.
Each unicode escape contributes the UTF-8 representation of its escaped value.
Each string continuation escape contributes no bytes.

If a \ character appears in the literal content but is not part of one of the above forms of escape, the pretoken is rejected.

If a CR character appears in the literal content and is not part of a string continuation escape, the pretoken is rejected.

If any of the resulting represented bytes have value 0, the pretoken is rejected.

suffix: copied

See Wording for string unescaping

`RawDoubleQuoteLiteral`

The pretokeniser guarantees the pretoken's prefix attribute is one of the following:

the single character r, in which case it is reprocessed as described under Raw string literal
the characters br, in which case it is reprocessed as described under Raw byte-string literal
the characters cr, in which case it is reprocessed as described under Raw C-string literal

In each case, the pretoken is rejected if its suffix consists of the single character _.

If a CR character appears in the literal content, the pretoken is rejected.

If any of the resulting represented bytes have value 0, the pretoken is rejected.

`IntegerDecimalLiteral`

Fine-grained token kind produced: IntegerLiteral

The pretoken is rejected if (and only if) its digits attribute consists entirely of _ characters.

Attributes

base: decimal

digits: copied

suffix: copied

Note: in particular, an IntegerDecimalLiteral whose digits is empty is rejected.

`IntegerHexadecimalLiteral`

Fine-grained token kind produced: IntegerLiteral

The pretoken is rejected if (and only if) its digits attribute consists entirely of _ characters.

Attributes

base: hexadecimal

digits: copied

suffix: copied

Note: in particular, an IntegerHexadecimalLiteral whose digits is empty is rejected.

`IntegerOctalLiteral`

Fine-grained token kind produced: IntegerLiteral

The pretoken is rejected if (and only if):

its digits attribute consists entirely of _ characters; or
its digits attribute contains any character other than 0, 1, 2, 3, 4, 5, 6, 7, or _.

Attributes

base: octal

digits: copied

suffix: copied

Note: in particular, an IntegerOctalLiteral whose digits is empty is rejected.

`IntegerBinaryLiteral`

Fine-grained token kind produced: IntegerLiteral

The pretoken is rejected if (and only if):

its digits attribute consists entirely of _ characters; or
its digits attribute contains any character other than 0, 1, or _.

Attributes

base: binary

digits: copied

suffix: copied

Note: in particular, an IntegerBinaryLiteral whose digits is empty is rejected.

`FloatLiteral`

Fine-grained token kind produced: FloatLiteral

The pretoken is rejected if (and only if)

its has base attribute is true; or
its exponent digits attribute is a character sequence which consists entirely of _ characters.

Attributes

body: copied

suffix: copied

Note: in particular, a FloatLiteral whose exponent digits is empty is rejected.