Quoted literal tokens

Summary
Single-quoted literals
- Character literal
- Byte literal
(Non-raw) double-quoted literals
Raw double-quoted literals
Reserved forms

Each kind of quoted literal represents a character, byte, sequence of characters, or sequence of bytes.

These representations are obtained by interpreting the literal content (the consumed characters between ' ' or " "), which may contain several forms of escape (character sequences beginning with \).

The descriptions of processing the non-raw literals below are based on the single-escape interpretation and escape interpretation of the literal content, which are defined in Escape processing.

Summary

The following table summarises which forms of character and escape are accepted in each kind of quoted literal.

Literal	Forbidden	Simple	Unicode	Hexadecimal	String continuation
`''`	`CR` `LF` `HT`	✓	✓	✓ (<= 127)
`b''`	`CR` `LF` `HT` > 127	✓		✓
`""`	`CR`	✓	✓	✓ (<= 127)	✓
`b""`	`CR` > 127	✓		✓	✓
`c""`	`CR`	✓	✓	✓	✓
`r""`	`CR`
`br""`	`CR` > 127
`cr""`	`CR`
Eg		`\n`	`\u{2014}`	`\x1b`

The "Forbidden" column indicates which characters may not appear directly in the literal content; "> 127" means any character whose Unicode scalar value is greater than 127.

The remaining columns indicate which forms of escape are accepted.

The "(<= 127)" annotation means that hexadecimal escapes whose first hexadecimal digit is greater than 7 aren't accepted.

In raw literals the \ character represents itself; otherwise a \ that doesn't introduce an escape is forbidden.

Single-quoted literals

The following nonterminals are common to the definitions below:

Grammar

SINGLE_QUOTED_FORM = {
    "'" ~ SINGLE_QUOTED_CONTENT ~ "'" ~
    SUFFIX ?
}
SINGLE_QUOTED_CONTENT = {
    BACKSLASH ~ ANY ~ ( !"'" ~ ANY ) * |
    !"'" ~ ANY
}

SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }

Character literal

Grammar

Character_literal = { SINGLE_QUOTED_FORM }

Attributes

The token's represented character is the represented character of SINGLE_QUOTED_CONTENT's single-escape interpretation.

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

SINGLE_QUOTED_CONTENT has no single-escape interpretation; or
SINGLE_QUOTED_CONTENT's single-escape interpretation has no represented character; or
SINGLE_QUOTED_CONTENT's single-escape interpretation is a non-escape whose represented character is LF, CR, or HT; or
the token's suffix would consist of the single character _.

Byte literal

Grammar

Byte_literal = { "b" ~ SINGLE_QUOTED_FORM }

Attributes

The token's represented byte is the represented byte of SINGLE_QUOTED_CONTENT's single-escape interpretation.

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

SINGLE_QUOTED_CONTENT has no single-escape interpretation; or
SINGLE_QUOTED_CONTENT's single-escape interpretation is any of the following:
- a non-escape whose represented character is LF, CR, or HT
- a Unicode escape; or
SINGLE_QUOTED_CONTENT's single-escape interpretation has no represented byte; or
the token's suffix would consist of the single character _.

(Non-raw) double-quoted literals

The following nonterminals are common to the definitions below:

Grammar

DOUBLE_QUOTED_FORM = {
    DOUBLEQUOTE ~ DOUBLE_QUOTED_CONTENT ~ DOUBLEQUOTE ~
    SUFFIX ?
}
DOUBLE_QUOTED_CONTENT = {
    (
        BACKSLASH ~ ANY |
        !DOUBLEQUOTE ~ ANY
    ) *
}

SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }

String literal

Grammar

String_literal = { DOUBLE_QUOTED_FORM }

Attributes

The token's represented string is the sequence made up of the represented character of each component of DOUBLE_QUOTED_CONTENT's escape interpretation.

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

DOUBLE_QUOTED_CONTENT has no escape interpretation; or
DOUBLE_QUOTED_CONTENT's escape interpretation contains any of the following
- a component that has no represented character
- a non-escape whose represented character is CR; or
the token's suffix would consist of the single character _.

Byte-string literal

Grammar

Byte_string_literal = { "b" ~ DOUBLE_QUOTED_FORM }

Attributes

The token's represented bytes are the represented byte of each component of DOUBLE_QUOTED_CONTENT's escape interpretation.

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

DOUBLE_QUOTED_CONTENT has no escape interpretation; or
DOUBLE_QUOTED_CONTENT's escape interpretation contains any of the following:
- a non-escape whose represented character is CR
- a Unicode escape
- a component that has no represented byte; or
the token's suffix would consist of the single character _.

C-string literal

Grammar

C_string_literal = { "c" ~ DOUBLE_QUOTED_FORM }

Attributes

The token's represented bytes are derived from DOUBLE_QUOTED_CONTENT's escape interpretation in the following way:

Each non-escape, simple escape, or Unicode escape contributes the UTF-8 encoding of its represented character.
Each hexadecimal escape contributes its represented byte.

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

DOUBLE_QUOTED_CONTENT has no escape interpretation; or
DOUBLE_QUOTED_CONTENT's escape interpretation contains any of the following:
- a Unicode escape that has no represented character
- a non-escape whose represented character is CR; or
any of the token's represented bytes would be 0; or
the token's suffix would consist of the single character _.

Raw double-quoted literals

The following nonterminals are common to the definitions below:

Grammar

RAW_DOUBLE_QUOTED_FORM = {
    HASHES¹ ~
    DOUBLEQUOTE ~ RAW_DOUBLE_QUOTED_CONTENT ~ DOUBLEQUOTE ~
    HASHES² ~
    SUFFIX ?
}
RAW_DOUBLE_QUOTED_CONTENT = {
    ( !(DOUBLEQUOTE ~ HASHES²) ~ ANY ) *
}
HASHES = { "#" {0, 255} }

SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }

These definitions require an extension to the Parsing Expression Grammar formalism: an attempt to match one of the parsing expressions marked as HASHES² fails unless the characters it consumes are the same as the characters consumed by the (only) match of the expression marked as HASHES¹ under the same match attempt of a tokenisation nonterminal.

See Grammar for raw string literals for a discussion of alternatives to this extension.

Raw string literal

Grammar

Raw_string_literal = { "r" ~ RAW_DOUBLE_QUOTED_FORM }

Attributes

The token's represented string is RAW_DOUBLE_QUOTED_CONTENT.

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

a CR character appears in RAW_DOUBLE_QUOTED_CONTENT; or
the token's suffix would consist of the single character _.

Raw byte-string literal

Grammar

Raw_byte_string_literal = { "br" ~ RAW_DOUBLE_QUOTED_FORM }

Attributes

The token's represented bytes are the Unicode scalar values of the characters in RAW_DOUBLE_QUOTED_CONTENT. (This is well defined because of the first rejection case below.)

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

any character whose Unicode scalar value is greater than 127 appears in RAW_DOUBLE_QUOTED_CONTENT; or
a CR character appears in RAW_DOUBLE_QUOTED_CONTENT; or
the token's suffix would consist of the single character _.

Raw C-string literal

Grammar

Raw_c_string_literal = { "cr" ~ RAW_DOUBLE_QUOTED_FORM }

Attributes

The token's represented bytes are the UTF-8 encoding of RAW_DOUBLE_QUOTED_CONTENT

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

a CR character appears in RAW_DOUBLE_QUOTED_CONTENT; or
any of the token's represented bytes would be 0; or
the token's suffix would consist of the single character _.

Reserved_literal_2015 = { "r" ~ DOUBLEQUOTE | "br" ~ DOUBLEQUOTE | "b'" }
Reserved_literal_2021 = { IDENT ~ ( DOUBLEQUOTE | "'" ) }

Rejection

All matches are rejected.

Note: I believe in the Reserved_literal_2015 definition only the b' form is strictly needed: if that definition matches using one of the other subexpressions then the input will be rejected eventually anyway (given that the corresponding quoted literal nonterminal didn't match).

Note: Reserved_literal_2021 catches both reserved forms and unterminated b' literals.

Reserved single-quoted literal

Grammar

Reserved_single_quoted_literal_2015 = { "'" ~ IDENT ~ "'" }
Reserved_single_quoted_literal_2021 = { "'" ~ "r#" ? ~ IDENT ~ "'" }

Rejection

All matches are rejected.

Note: This reservation is to catch forms like 'aaa'bbb, so this definition must come before Lifetime_or_label.

Reserved guard (Rust 2024)

Grammar

Reserved_guard = { "##" | "#" ~ DOUBLEQUOTE }

Rejection

All matches are rejected.

Note: This definition is listed here near the double-quoted string literals because these forms were reserved during discussions about introducing string literals formed like #"…"#.

Keyboard shortcuts

Writeup of Rust's lexer