Table of contents
Reserved
Whitespace
LineComment
BlockComment
Punctuation
Identifier
RawIdentifier
LifetimeOrLabel
RawLifetimeOrLabel
SingleQuoteLiteral
DoubleQuoteLiteral
RawDoubleQuoteLiteral
IntegerDecimalLiteral
IntegerHexadecimalLiteral
IntegerOctalLiteral
IntegerBinaryLiteral
FloatLiteral
The list of of reprocessing cases
The list below has an entry for each kind of pretoken, describing what kind of fine-grained token it produces, how the fine-grained token's attributes are determined, and the circumstances under which a pretoken is rejected.
When an attribute value is given below as "copied", it has the same value as the pretoken's attribute with the same name.
Reserved
A Reserved
pretoken is always rejected.
Whitespace
Fine-grained token kind produced:
Whitespace
A Whitespace
pretoken is always accepted.
LineComment
Fine-grained token kind produced:
LineComment
Attributes
style and body are determined from the pretoken's comment content as follows:
-
if the comment content begins with //:
- style is non-doc
- body is empty
-
otherwise, if the comment content begins with /,
- style is outer doc
- body is the characters from the comment content after that /
-
otherwise, if the comment content begins with !,
- style is inner doc
- body is the characters from the comment content after that !
-
otherwise
- style is non-doc
- body is empty
The pretoken is rejected if (and only if) the resulting body includes a CR character.
Note: the body of a non-doc comment is ignored by the rest of the compilation process
BlockComment
Fine-grained token kind produced:
BlockComment
Attributes
style and body are determined from the pretoken's comment content as follows:
-
if the comment content begins with
**
:- style is non-doc
- body is empty
-
otherwise, if the comment content begins with
*
and contains at least one further character,- style is outer doc
- body is the characters from the comment content after that
*
-
otherwise, if the comment content begins with
!
,- style is inner doc
- body is the characters from the comment content after that
!
-
otherwise
- style is non-doc
- body is empty
The pretoken is rejected if (and only if) the resulting body includes a CR character.
Note: it follows that
/**/
and/***/
are not doc-comments
Note: the body of a non-doc comment is ignored by the rest of the compilation process
Punctuation
Fine-grained token kind produced:
Punctuation
A Punctuation
pretoken is always accepted.
Attributes
mark: copied
Identifier
Fine-grained token kind produced:
Identifier
An Identifier
pretoken is always accepted.
Attributes
represented identifier: NFC-normalised form of the pretoken's identifier
RawIdentifier
Fine-grained token kind produced:
RawIdentifier
Attributes
represented identifier: NFC-normalised form of the pretoken's identifier
The pretoken is rejected if (and only if) the represented identifier is one of the following sequences of characters:
- _
- crate
- self
- super
- Self
LifetimeOrLabel
Fine-grained token kind produced:
LifetimeOrLabel
A LifetimeOrLabel
pretoken is always accepted.
Attributes
name: copied
Note that the name is not NFC-normalised. See NFC normalisation for lifetime/label.
RawLifetimeOrLabel
Fine-grained token kind produced:
RawLifetimeOrLabel
The pretoken is rejected if (and only if) the name is one of the following sequences of characters:
- _
- crate
- self
- super
- Self
Attributes
name: copied
Note that the name is not NFC-normalised. See NFC normalisation for lifetime/label.
SingleQuoteLiteral
The pretokeniser guarantees the pretoken's prefix attribute is one of the following:
- empty, in which case it is reprocessed as described under Character literal
- the single character b, in which case it is reprocessed as described under Byte literal.
In either case, the pretoken is rejected if its suffix consists of the single character _.
Character literal
Fine-grained token kind produced:
CharacterLiteral
Attributes
The represented character is derived from the pretoken's literal content as follows:
-
If the literal content is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:
-
If the literal content begins with a \ character which did not introduce one of the above forms of escape, the pretoken is rejected.
-
Otherwise, if the single character that makes up the literal content is LF, CR, or TAB, the pretoken is rejected.
-
Otherwise the represented character is the single character that makes up the literal content.
suffix: copied
Note: The protokeniser guarantees the pretoken's literal content is either a single character, or a character sequence beginning with \.
Byte literal
Fine-grained token kind produced:
ByteLiteral
Attributes
Define a represented character, derived from the pretoken's literal content as follows:
-
If the literal content is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:
-
If the literal content begins with a \ character which did not introduce one of the above forms of escape, the pretoken is rejected.
-
Otherwise, if the single character that makes up the literal content is LF, CR, or TAB, the pretoken is rejected.
-
Otherwise, if the single character that makes up the literal content has a unicode scalar value greater than 127, the pretoken is rejected.
-
Otherwise the represented character is the single character that makes up the literal content.
The represented byte is the represented character's Unicode scalar value.
suffix: copied
Note: The protokeniser guarantees the pretoken's literal content is either a single character, or a character sequence beginning with \.
DoubleQuoteLiteral
The pretokeniser guarantees the pretoken's prefix attribute is one of the following:
- empty, in which case it is reprocessed as described under String literal
- the single character b, in which case it is reprocessed as described under Byte-string literal
- the single character c, in which case it is reprocessed as described under C-string literal
In each case, the pretoken is rejected if its suffix consists of the single character _.
String literal
Fine-grained token kind produced:
StringLiteral
Attributes
The represented string is derived from the pretoken's literal content by replacing each escape sequence of any of the following forms occurring in the literal content with the escape sequence's escaped value.
These replacements take place in left-to-right order.
For example, the pretoken with extent "\\x41"
is converted to the characters \ x 4 1.
If a \ character appears in the literal content but is not part of one of the above forms of escape, the pretoken is rejected.
If a CR character appears in the literal content and is not part of a string continuation escape, the pretoken is rejected.
suffix: copied
Byte-string literal
Fine-grained token kind produced:
ByteStringLiteral
If any character whose unicode scalar value is greater than 127 appears in the literal content, the pretoken is rejected.
Attributes
Define a represented string (a sequence of characters) derived from the pretoken's literal content by replacing each escape sequence of any of the following forms occurring in the literal content with the escape sequence's escaped value.
These replacements take place in left-to-right order.
For example, the pretoken with extent b"\\x41"
is converted to the characters \ x 4 1.
If a \ character appears in the literal content but is not part of one of the above forms of escape, the pretoken is rejected.
If a CR character appears in the literal content and is not part of a string continuation escape, the pretoken is rejected.
The represented bytes are the sequence of Unicode scalar values of the characters in the represented string.
suffix: copied
C-string literal
Fine-grained token kind produced:
CStringLiteral
Attributes
The pretoken's literal content is treated as a sequence of items, each of which is either a single Unicode character other than \ or an escape.
The sequence of items is converted to the represented bytes as follows:
- Each single Unicode character contributes its UTF-8 representation.
- Each simple escape contributes a single byte containing the Unicode scalar value of its escaped value.
- Each 8-bit escape contributes a single byte containing the Unicode scalar value of its escaped value.
- Each unicode escape contributes the UTF-8 representation of its escaped value.
- Each string continuation escape contributes no bytes.
If a \ character appears in the literal content but is not part of one of the above forms of escape, the pretoken is rejected.
If a CR character appears in the literal content and is not part of a string continuation escape, the pretoken is rejected.
If any of the resulting represented bytes have value 0, the pretoken is rejected.
suffix: copied
RawDoubleQuoteLiteral
The pretokeniser guarantees the pretoken's prefix attribute is one of the following:
- the single character r, in which case it is reprocessed as described under Raw string literal
- the characters br, in which case it is reprocessed as described under Raw byte-string literal
- the characters cr, in which case it is reprocessed as described under Raw C-string literal
In each case, the pretoken is rejected if its suffix consists of the single character _.
Raw string literal
Fine-grained token kind produced:
RawStringLiteral
The pretoken is rejected if (and only if) a CR character appears in the literal content.
Attributes
represented string: the pretoken's literal content
suffix: copied
Raw byte-string literal
Fine-grained token kind produced:
RawByteStringLiteral
If any character whose unicode scalar value is greater than 127 appears in the literal content, the pretoken is rejected.
If a CR character appears in the literal content, the pretoken is rejected.
Attributes
represented bytes: the sequence of Unicode scalar values of the characters in the pretoken's literal content
suffix: copied
Raw C-string literal
Fine-grained token kind produced:
RawCStringLiteral
If a CR character appears in the literal content, the pretoken is rejected.
Attributes
represented bytes: the UTF-8 encoding of the pretoken's literal content
suffix: copied
If any of the resulting represented bytes have value 0, the pretoken is rejected.
IntegerDecimalLiteral
Fine-grained token kind produced:
IntegerLiteral
The pretoken is rejected if (and only if) its digits attribute consists entirely of _ characters.
Attributes
base: decimal
digits: copied
suffix: copied
Note: in particular, an
IntegerDecimalLiteral
whose digits is empty is rejected.
IntegerHexadecimalLiteral
Fine-grained token kind produced:
IntegerLiteral
The pretoken is rejected if (and only if) its digits attribute consists entirely of _ characters.
Attributes
base: hexadecimal
digits: copied
suffix: copied
Note: in particular, an
IntegerHexadecimalLiteral
whose digits is empty is rejected.
IntegerOctalLiteral
Fine-grained token kind produced:
IntegerLiteral
The pretoken is rejected if (and only if):
- its digits attribute consists entirely of _ characters; or
- its digits attribute contains any character other than 0, 1, 2, 3, 4, 5, 6, 7, or _.
Attributes
base: octal
digits: copied
suffix: copied
Note: in particular, an
IntegerOctalLiteral
whose digits is empty is rejected.
IntegerBinaryLiteral
Fine-grained token kind produced:
IntegerLiteral
The pretoken is rejected if (and only if):
- its digits attribute consists entirely of _ characters; or
- its digits attribute contains any character other than 0, 1, or _.
Attributes
base: binary
digits: copied
suffix: copied
Note: in particular, an
IntegerBinaryLiteral
whose digits is empty is rejected.
FloatLiteral
Fine-grained token kind produced:
FloatLiteral
The pretoken is rejected if (and only if)
- its has base attribute is true; or
- its exponent digits attribute is a character sequence which consists entirely of _ characters.
Attributes
body: copied
suffix: copied
Note: in particular, a
FloatLiteral
whose exponent digits is empty is rejected.