String and byte literal tokens
Table of contents
Single-quoted literals
The following nonterminals are common to the definitions below:
Grammar
SQ_REMAINDER = {
"'" ~ SQ_CONTENT ~ "'" ~
SUFFIX ?
}
SQ_CONTENT = {
"\\" ~ ANY ~ ( !"'" ~ ANY ) * |
!"'" ~ ANY
}
SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }
Character literal
Grammar
Character_literal = { SQ_REMAINDER }
Definitions
Define a represented character, derived from SQ_CONTENT as follows:
-
If SQ_CONTENT is the single character LF, CR, or TAB, the match is rejected.
-
If SQ_CONTENT is any other single character, the represented character is that character.
-
If SQ_CONTENT is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:
-
Otherwise the match is rejected
Attributes
The token's represented character is the represented character.
The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.
Rejection
The match is rejected if:
- SQ_CONTENT is unacceptable, as described in the definition of the represented character above; or
- the token's suffix would consist of the single character _.
Byte literal
Grammar
Byte_literal = { "b" ~ SQ_REMAINDER }
Definitions
Define a represented character, derived from SQ_CONTENT as follows:
-
If SQ_CONTENT is the single character LF, CR, or TAB, the match is rejected.
-
If SQ_CONTENT is a single character with Unicode scalar value greater than 127, the match is rejected.
-
If SQ_CONTENT is any other single character, the represented character is that character.
-
If SQ_CONTENT is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:
-
Otherwise the match is rejected
Attributes
The token's represented byte is the represented character's Unicode scalar value. (This is well defined because the definition above ensures that value is less than 256.)
The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.
Rejection
The match is rejected if:
- SQ_CONTENT is unacceptable, as described in the definition of the represented character above; or
- the token's suffix would consist of the single character _.
(Non-raw) double-quoted literals
The following nonterminals are common to the definitions below:
Grammar
DQ_REMAINDER = {
"\"" ~ DQ_CONTENT ~ "\"" ~
SUFFIX ?
}
DQ_CONTENT = {
(
"\\" ~ ANY |
!"\"" ~ ANY
) *
}
SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }
String literal
Grammar
String_literal = { DQ_REMAINDER }
Attributes
The token's represented string is derived from DQ_CONTENT by replacing each escape sequence of any of the following forms with the escape sequence's escaped value:
These replacements take place in left-to-right order.
For example, a match against the characters "\\x41" is converted to the characters \ x 4 1.
The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.
Rejection
The match is rejected if:
- a \ character appears in DQ_CONTENT and is not part of one of the above forms of escape; or
- a CR character appears in DQ_CONTENT and is not part of a string continuation escape; or
- the token's suffix would consist of the single character _.
Byte-string literal
Grammar
Byte_string_literal = { "b" ~ DQ_REMAINDER }
Definitions
Define a represented string (a sequence of characters) derived from DQ_CONTENT by replacing each escape sequence of any of the following forms with the escape sequence's escaped value:
These replacements take place in left-to-right order.
For example, a match against the characters b"\\x41" is converted to the characters \ x 4 1.
Attributes
The token's represented bytes are the sequence of Unicode scalar values of the characters in the represented string. (This is well defined because of the first rejection case below.)
The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.
Rejection
The match is rejected if:
- any character whose unicode scalar value is greater than 127 appears in DQ_CONTENT; or
- a \ character appears in DQ_CONTENT and is not part of one of the above forms of escape; or
- a CR character appears in DQ_CONTENT and is not part of a string continuation escape; or
- the token's suffix would consist of the single character _.
C-string literal
Grammar
C_string_literal = { "c" ~ DQ_REMAINDER }
Attributes
DQ_CONTENT is treated as a sequence of items, each of which is either a single Unicode character other than \ or an escape.
The token's represented bytes are derived from that sequence of items in the following way:
- Each single Unicode character contributes its UTF-8 representation.
- Each simple escape or 8-bit escape contributes a single byte containing the Unicode scalar value of its escaped value.
- Each unicode escape contributes the UTF-8 representation of its escaped value.
- Each string continuation escape contributes no bytes.
The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.
Rejection
The match is rejected if:
- a \ character appears in DQ_CONTENT and is not part of one of the above forms of escape; or
- a CR character appears in DQ_CONTENT and is not part of a string continuation escape; or
- any of the token's represented bytes would have value 0; or
- the token's suffix would consist of the single character _.
Raw double-quoted literals
The following nonterminals are common to the definitions below:
Grammar
RAW_DQ_REMAINDER = {
HASHES¹ ~
"\"" ~ RAW_DQ_CONTENT ~ "\"" ~
HASHES² ~
SUFFIX ?
}
RAW_DQ_CONTENT = {
( !("\"" ~ HASHES²) ~ ANY ) *
}
HASHES = { "#" {0, 255} }
SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }
These definitions require an extension to the Parsing Expression Grammar formalism:
each of the expressions marked as HASHES² fails unless the text it matches is the same as the text matched by the (only) successful match using the expression marked as HASHES¹ in the same attempt to match the current token-kind nonterminal.
See Grammar for raw string literals for a discussion of alternatives to this extension.
Raw string literal
Grammar
Raw_string_literal = { "r" ~ RAW_DQ_REMAINDER }
Attributes
The token's represented string is RAW_DQ_CONTENT.
The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.
Rejection
The match is rejected if:
- a CR character appears in RAW_DQ_CONTENT; or
- the token's suffix would consist of the single character _.
Raw byte-string literal
Grammar
Raw_byte_string_literal = { "br" ~ RAW_DQ_REMAINDER }
Attributes
The token's represented bytes are the Unicode scalar values of the characters in RAW_DQ_CONTENT. (This is well defined because of the first rejection case below.)
The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.
Rejection
The match is rejected if:
- any character whose unicode scalar value is greater than 127 appears in RAW_DQ_CONTENT; or
- a CR character appears in RAW_DQ_CONTENT; or
- the token's suffix would consist of the single character _.
Raw C-string literal
Grammar
Raw_c_string_literal = { "cr" ~ RAW_DQ_REMAINDER }
Attributes
The token's represented bytes are the UTF-8 encoding of RAW_DQ_CONTENT
The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.
Rejection
The match is rejected if:
- a CR character appears in RAW_DQ_CONTENT; or
- any of the token's represented bytes would have value 0; or
- the token's suffix would consist of the single character _.
Reserved forms
Reserved or unterminated literal
Grammar
Unterminated_literal_2015 = { "r\"" | "br\"" | "b'" }
Reserved_literal_2021 = { IDENT ~ ( "\"" | "'" ) }
Rejection
All matches are rejected.
Note: I believe in the
Unterminated_literal_2015definition only theb'form is strictly needed: if that definition matches using one of the other subexpressions then the input will be rejected eventually anyway (given that the corresponding string literal nonterminal didn't match).
Note:
Reserved_literal_2021catches both reserved forms and unterminatedb'literals.
Reserved single-quoted literal
Grammar
Reserved_single_quoted_literal_2015 = { "'" ~ IDENT ~ "'" }
Reserved_single_quoted_literal_2021 = { "'" ~ "r#" ? ~ IDENT ~ "'" }
Rejection
All matches are rejected.
Note: This reservation is to catch forms like
'aaa'bbb, so this definition must come beforeLifetime_or_label.
Reserved guard (Rust 2024)
Grammar
Reserved_guard = { "##" | "#\"" }
Rejection
All matches are rejected.
Note: This definition is listed here near the double-quoted string literals because these forms were reserved during discussions about introducing string literals formed like
#"…"#.