Quoted literal tokens
Table of contents
- Summary
- Single-quoted literals
- (Non-raw) double-quoted literals
- Raw double-quoted literals
- Reserved forms
Each kind of quoted literal represents a character, byte, sequence of characters, or sequence of bytes.
These representations are obtained by interpreting the literal content (the consumed characters between ' ' or " "), which may contain several forms of escape (character sequences beginning with \).
The descriptions of processing the non-raw literals below are based on the single-escape interpretation and escape interpretation of the literal content, which are defined in Escape processing.
Summary
The following table summarises which forms of character and escape are accepted in each kind of quoted literal.
| Literal | Forbidden | Simple | Unicode | Hexadecimal | String continuation |
|---|---|---|---|---|---|
'' | CR LF HT | ✓ | ✓ | ✓ (<= 127) | |
b'' | CR LF HT > 127 | ✓ | ✓ | ||
"" | CR | ✓ | ✓ | ✓ (<= 127) | ✓ |
b"" | CR > 127 | ✓ | ✓ | ✓ | |
c"" | CR | ✓ | ✓ | ✓ | ✓ |
r"" | CR | ||||
br"" | CR > 127 | ||||
cr"" | CR | ||||
| Eg | \n | \u{2014} | \x1b |
The "Forbidden" column indicates which characters may not appear directly in the literal content; "> 127" means any character whose Unicode scalar value is greater than 127.
The remaining columns indicate which forms of escape are accepted.
The "(<= 127)" annotation means that hexadecimal escapes whose first hexadecimal digit is greater than 7 aren't accepted.
In raw literals the \ character represents itself; otherwise a \ that doesn't introduce an escape is forbidden.
Single-quoted literals
The following nonterminals are common to the definitions below:
Grammar
SINGLE_QUOTED_FORM = {
"'" ~ SINGLE_QUOTED_CONTENT ~ "'" ~
SUFFIX ?
}
SINGLE_QUOTED_CONTENT = {
BACKSLASH ~ ANY ~ ( !"'" ~ ANY ) * |
!"'" ~ ANY
}
SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }
Character literal
Grammar
Character_literal = { SINGLE_QUOTED_FORM }
Attributes
The token's represented character is the represented character of SINGLE_QUOTED_CONTENT's single-escape interpretation.
The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.
Rejection
The match is rejected if:
- SINGLE_QUOTED_CONTENT has no single-escape interpretation; or
- SINGLE_QUOTED_CONTENT's single-escape interpretation has no represented character; or
- SINGLE_QUOTED_CONTENT's single-escape interpretation is a non-escape whose represented character is LF, CR, or HT; or
- the token's suffix would consist of the single character _.
Byte literal
Grammar
Byte_literal = { "b" ~ SINGLE_QUOTED_FORM }
Attributes
The token's represented byte is the represented byte of SINGLE_QUOTED_CONTENT's single-escape interpretation.
The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.
Rejection
The match is rejected if:
- SINGLE_QUOTED_CONTENT has no single-escape interpretation; or
- SINGLE_QUOTED_CONTENT's single-escape interpretation is any of the following:
- a non-escape whose represented character is LF, CR, or HT
- a Unicode escape; or
- SINGLE_QUOTED_CONTENT's single-escape interpretation has no represented byte; or
- the token's suffix would consist of the single character _.
(Non-raw) double-quoted literals
The following nonterminals are common to the definitions below:
Grammar
DOUBLE_QUOTED_FORM = {
DOUBLEQUOTE ~ DOUBLE_QUOTED_CONTENT ~ DOUBLEQUOTE ~
SUFFIX ?
}
DOUBLE_QUOTED_CONTENT = {
(
BACKSLASH ~ ANY |
!DOUBLEQUOTE ~ ANY
) *
}
SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }
String literal
Grammar
String_literal = { DOUBLE_QUOTED_FORM }
Attributes
The token's represented string is the sequence made up of the represented character of each component of DOUBLE_QUOTED_CONTENT's escape interpretation.
The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.
Rejection
The match is rejected if:
- DOUBLE_QUOTED_CONTENT has no escape interpretation; or
- DOUBLE_QUOTED_CONTENT's escape interpretation contains any of the following
- a component that has no represented character
- a non-escape whose represented character is CR; or
- the token's suffix would consist of the single character _.
Byte-string literal
Grammar
Byte_string_literal = { "b" ~ DOUBLE_QUOTED_FORM }
Attributes
The token's represented bytes are the represented byte of each component of DOUBLE_QUOTED_CONTENT's escape interpretation.
The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.
Rejection
The match is rejected if:
- DOUBLE_QUOTED_CONTENT has no escape interpretation; or
- DOUBLE_QUOTED_CONTENT's escape interpretation contains any of the following:
- a non-escape whose represented character is CR
- a Unicode escape
- a component that has no represented byte; or
- the token's suffix would consist of the single character _.
C-string literal
Grammar
C_string_literal = { "c" ~ DOUBLE_QUOTED_FORM }
Attributes
The token's represented bytes are derived from DOUBLE_QUOTED_CONTENT's escape interpretation in the following way:
- Each non-escape, simple escape, or Unicode escape contributes the UTF-8 encoding of its represented character.
- Each hexadecimal escape contributes its represented byte.
The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.
Rejection
The match is rejected if:
- DOUBLE_QUOTED_CONTENT has no escape interpretation; or
- DOUBLE_QUOTED_CONTENT's escape interpretation contains any of the following:
- a Unicode escape that has no represented character
- a non-escape whose represented character is CR; or
- any of the token's represented bytes would be 0; or
- the token's suffix would consist of the single character _.
Raw double-quoted literals
The following nonterminals are common to the definitions below:
Grammar
RAW_DOUBLE_QUOTED_FORM = {
HASHES¹ ~
DOUBLEQUOTE ~ RAW_DOUBLE_QUOTED_CONTENT ~ DOUBLEQUOTE ~
HASHES² ~
SUFFIX ?
}
RAW_DOUBLE_QUOTED_CONTENT = {
( !(DOUBLEQUOTE ~ HASHES²) ~ ANY ) *
}
HASHES = { "#" {0, 255} }
SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }
These definitions require an extension to the Parsing Expression Grammar formalism:
an attempt to match one of the parsing expressions marked as HASHES² fails
unless the characters it consumes are the same as the characters consumed by the (only) match of the expression marked as HASHES¹ under the same match attempt of a token-kind nonterminal.
See Grammar for raw string literals for a discussion of alternatives to this extension.
Raw string literal
Grammar
Raw_string_literal = { "r" ~ RAW_DOUBLE_QUOTED_FORM }
Attributes
The token's represented string is RAW_DOUBLE_QUOTED_CONTENT.
The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.
Rejection
The match is rejected if:
- a CR character appears in RAW_DOUBLE_QUOTED_CONTENT; or
- the token's suffix would consist of the single character _.
Raw byte-string literal
Grammar
Raw_byte_string_literal = { "br" ~ RAW_DOUBLE_QUOTED_FORM }
Attributes
The token's represented bytes are the Unicode scalar values of the characters in RAW_DOUBLE_QUOTED_CONTENT. (This is well defined because of the first rejection case below.)
The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.
Rejection
The match is rejected if:
- any character whose Unicode scalar value is greater than 127 appears in RAW_DOUBLE_QUOTED_CONTENT; or
- a CR character appears in RAW_DOUBLE_QUOTED_CONTENT; or
- the token's suffix would consist of the single character _.
Raw C-string literal
Grammar
Raw_c_string_literal = { "cr" ~ RAW_DOUBLE_QUOTED_FORM }
Attributes
The token's represented bytes are the UTF-8 encoding of RAW_DOUBLE_QUOTED_CONTENT
The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.
Rejection
The match is rejected if:
- a CR character appears in RAW_DOUBLE_QUOTED_CONTENT; or
- any of the token's represented bytes would be 0; or
- the token's suffix would consist of the single character _.
Reserved forms
Reserved or unterminated literal
Grammar
Unterminated_literal_2015 = { "r" ~ DOUBLEQUOTE | "br" ~ DOUBLEQUOTE | "b'" }
Reserved_literal_2021 = { IDENT ~ ( DOUBLEQUOTE | "'" ) }
Rejection
All matches are rejected.
Note: I believe in the
Unterminated_literal_2015definition only theb'form is strictly needed: if that definition matches using one of the other subexpressions then the input will be rejected eventually anyway (given that the corresponding string literal nonterminal didn't match).
Note:
Reserved_literal_2021catches both reserved forms and unterminatedb'literals.
Reserved single-quoted literal
Grammar
Reserved_single_quoted_literal_2015 = { "'" ~ IDENT ~ "'" }
Reserved_single_quoted_literal_2021 = { "'" ~ "r#" ? ~ IDENT ~ "'" }
Rejection
All matches are rejected.
Note: This reservation is to catch forms like
'aaa'bbb, so this definition must come beforeLifetime_or_label.
Reserved guard (Rust 2024)
Grammar
Reserved_guard = { "##" | "#" ~ DOUBLEQUOTE }
Rejection
All matches are rejected.
Note: This definition is listed here near the double-quoted string literals because these forms were reserved during discussions about introducing string literals formed like
#"…"#.