String and byte literal tokens

Single-quoted literals
- Character literal
- Byte literal
(Non-raw) double-quoted literals
Raw double-quoted literals
Reserved forms

Single-quoted literals

The following nonterminals are common to the definitions below:

Grammar

SQ_REMAINDER = {
    "'" ~ SQ_CONTENT ~ "'" ~
    SUFFIX ?
}
SQ_CONTENT = {
    BACKSLASH ~ ANY ~ ( !"'" ~ ANY ) * |
    !"'" ~ ANY
}

SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }

Character literal

Grammar

Character_literal = { SQ_REMAINDER }

Definitions

Define a represented character, derived from SQ_CONTENT as follows:

If SQ_CONTENT is the single character LF, CR, or HT, the match is rejected.
If SQ_CONTENT is any other single character, the represented character is that character.
If SQ_CONTENT is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:
Otherwise the match is rejected

Attributes

The token's represented character is the represented character.

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

SQ_CONTENT is unacceptable, as described in the definition of the represented character above; or
the token's suffix would consist of the single character _.

Byte literal

Grammar

Byte_literal = { "b" ~ SQ_REMAINDER }

Definitions

Define a represented character, derived from SQ_CONTENT as follows:

If SQ_CONTENT is the single character LF, CR, or HT, the match is rejected.
If SQ_CONTENT is a single character with Unicode scalar value greater than 127, the match is rejected.
If SQ_CONTENT is any other single character, the represented character is that character.
If SQ_CONTENT is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:
- Simple escapes
- 8-bit escapes
Otherwise the match is rejected

Attributes

The token's represented byte is the represented character's Unicode scalar value. (This is well defined because the definition above ensures that value is less than 256.)

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

SQ_CONTENT is unacceptable, as described in the definition of the represented character above; or
the token's suffix would consist of the single character _.

(Non-raw) double-quoted literals

The following nonterminals are common to the definitions below:

Grammar

DQ_REMAINDER = {
    DOUBLEQUOTE ~ DQ_CONTENT ~ DOUBLEQUOTE ~
    SUFFIX ?
}
DQ_CONTENT = {
    (
        BACKSLASH ~ ANY |
        !DOUBLEQUOTE ~ ANY
    ) *
}

SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }

String literal

Grammar

String_literal = { DQ_REMAINDER }

Attributes

The token's represented string is derived from DQ_CONTENT by replacing each escape sequence of any of the following forms with the escape sequence's escaped value:

These replacements take place in left-to-right order. For example, a match against the characters "\\x41" is converted to the characters \ x 4 1.

See Wording for string unescaping

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

a \ character appears in DQ_CONTENT and is not part of one of the above forms of escape; or
a CR character appears in DQ_CONTENT and is not part of a string continuation escape; or
the token's suffix would consist of the single character _.

Byte-string literal

Grammar

Byte_string_literal = { "b" ~ DQ_REMAINDER }

Definitions

Define a represented string (a sequence of characters) derived from DQ_CONTENT by replacing each escape sequence of any of the following forms with the escape sequence's escaped value:

These replacements take place in left-to-right order. For example, a match against the characters b"\\x41" is converted to the characters \ x 4 1.

See Wording for string unescaping

Attributes

The token's represented bytes are the sequence of Unicode scalar values of the characters in the represented string. (This is well defined because of the first rejection case below.)

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

any character whose unicode scalar value is greater than 127 appears in DQ_CONTENT; or
a \ character appears in DQ_CONTENT and is not part of one of the above forms of escape; or
a CR character appears in DQ_CONTENT and is not part of a string continuation escape; or
the token's suffix would consist of the single character _.

C-string literal

Grammar

C_string_literal = { "c" ~ DQ_REMAINDER }

Attributes

DQ_CONTENT is treated as a sequence of items, each of which is either a single Unicode character other than \ or an escape.

The token's represented bytes are derived from that sequence of items in the following way:

Each single Unicode character contributes its UTF-8 representation.
Each simple escape or 8-bit escape contributes a single byte containing the Unicode scalar value of its escaped value.
Each unicode escape contributes the UTF-8 representation of its escaped value.
Each string continuation escape contributes no bytes.

See Wording for string unescaping

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

a \ character appears in DQ_CONTENT and is not part of one of the above forms of escape; or
a CR character appears in DQ_CONTENT and is not part of a string continuation escape; or
any of the token's represented bytes would have value 0; or
the token's suffix would consist of the single character _.

Raw double-quoted literals

The following nonterminals are common to the definitions below:

Grammar

RAW_DQ_REMAINDER = {
    HASHES¹ ~
    DOUBLEQUOTE ~ RAW_DQ_CONTENT ~ DOUBLEQUOTE ~
    HASHES² ~
    SUFFIX ?
}
RAW_DQ_CONTENT = {
    ( !(DOUBLEQUOTE ~ HASHES²) ~ ANY ) *
}
HASHES = { "#" * }

SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }

These definitions require an extension to the Parsing Expression Grammar formalism: an attempt to match one of the parsing expressions marked as HASHES² fails unless the characters it consumes are the same as the characters consumed by the (only) match of the expression marked as HASHES¹ under the same match attempt of a token-kind nonterminal. Only the expressions marked as HASHES¹ are participating matches of HASHES.

See Grammar for raw string literals for a discussion of alternatives to this extension.

Raw string literal

Grammar

Raw_string_literal = { "r" ~ RAW_DQ_REMAINDER }

Attributes

The token's represented string is RAW_DQ_CONTENT.

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

HASHES contains more than 255 characters; or
a CR character appears in RAW_DQ_CONTENT; or
the token's suffix would consist of the single character _.

Raw byte-string literal

Grammar

Raw_byte_string_literal = { "br" ~ RAW_DQ_REMAINDER }

Attributes

The token's represented bytes are the Unicode scalar values of the characters in RAW_DQ_CONTENT. (This is well defined because of the first rejection case below.)

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

HASHES contains more than 255 characters; or
any character whose unicode scalar value is greater than 127 appears in RAW_DQ_CONTENT; or
a CR character appears in RAW_DQ_CONTENT; or
the token's suffix would consist of the single character _.

Raw C-string literal

Grammar

Raw_c_string_literal = { "cr" ~ RAW_DQ_REMAINDER }

Attributes

The token's represented bytes are the UTF-8 encoding of RAW_DQ_CONTENT

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

HASHES contains more than 255 characters; or
a CR character appears in RAW_DQ_CONTENT; or
any of the token's represented bytes would have value 0; or
the token's suffix would consist of the single character _.

Unterminated_literal_2015 = { "r" ~ DOUBLEQUOTE | "br" ~ DOUBLEQUOTE | "b'" }
Reserved_literal_2021 = { IDENT ~ ( DOUBLEQUOTE | "'" ) }

Rejection

All matches are rejected.

Note: I believe in the Unterminated_literal_2015 definition only the b' form is strictly needed: if that definition matches using one of the other subexpressions then the input will be rejected eventually anyway (given that the corresponding string literal nonterminal didn't match).

Note: Reserved_literal_2021 catches both reserved forms and unterminated b' literals.

Reserved single-quoted literal

Grammar

Reserved_single_quoted_literal_2015 = { "'" ~ IDENT ~ "'" }
Reserved_single_quoted_literal_2021 = { "'" ~ "r#" ? ~ IDENT ~ "'" }

Rejection

All matches are rejected.

Note: This reservation is to catch forms like 'aaa'bbb, so this definition must come before Lifetime_or_label.

Reserved guard (Rust 2024)

Grammar

Reserved_guard = { "##" | "#" ~ DOUBLEQUOTE }

Rejection

All matches are rejected.

Note: This definition is listed here near the double-quoted string literals because these forms were reserved during discussions about introducing string literals formed like #"…"#.

Keyboard shortcuts

Writeup of Rust's lexer