Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

String and byte literal tokens

Table of contents

Single-quoted literals

The following nonterminals are common to the definitions below:

Grammar
SQ_REMAINDER = {
    "'" ~ SQ_CONTENT ~ "'" ~
    SUFFIX ?
}
SQ_CONTENT = {
    "\\" ~ ANY ~ ( !"'" ~ ANY ) * |
    !"'" ~ ANY
}

SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }

Character literal

Grammar
Character_literal = { SQ_REMAINDER }
Definitions

Define a represented character, derived from SQ_CONTENT as follows:

  • If SQ_CONTENT is the single character LF, CR, or TAB, the match is rejected.

  • If SQ_CONTENT is any other single character, the represented character is that character.

  • If SQ_CONTENT is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:

  • Otherwise the match is rejected

Attributes

The token's represented character is the represented character.

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

  • SQ_CONTENT is unacceptable, as described in the definition of the represented character above; or
  • the token's suffix would consist of the single character _.

Byte literal

Grammar
Byte_literal = { "b" ~ SQ_REMAINDER }
Definitions

Define a represented character, derived from SQ_CONTENT as follows:

  • If SQ_CONTENT is the single character LF, CR, or TAB, the match is rejected.

  • If SQ_CONTENT is a single character with Unicode scalar value greater than 127, the match is rejected.

  • If SQ_CONTENT is any other single character, the represented character is that character.

  • If SQ_CONTENT is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:

  • Otherwise the match is rejected

Attributes

The token's represented byte is the represented character's Unicode scalar value. (This is well defined because the definition above ensures that value is less than 256.)

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

  • SQ_CONTENT is unacceptable, as described in the definition of the represented character above; or
  • the token's suffix would consist of the single character _.

(Non-raw) double-quoted literals

The following nonterminals are common to the definitions below:

Grammar
DQ_REMAINDER = {
    "\"" ~ DQ_CONTENT ~ "\"" ~
    SUFFIX ?
}
DQ_CONTENT = {
    (
        "\\" ~ ANY |
        !"\"" ~ ANY
    ) *
}

SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }

String literal

Grammar
String_literal = { DQ_REMAINDER }
Attributes

The token's represented string is derived from DQ_CONTENT by replacing each escape sequence of any of the following forms with the escape sequence's escaped value:

These replacements take place in left-to-right order. For example, a match against the characters "\\x41" is converted to the characters \ x 4 1.

See Wording for string unescaping

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

  • a \ character appears in DQ_CONTENT and is not part of one of the above forms of escape; or
  • a CR character appears in DQ_CONTENT and is not part of a string continuation escape; or
  • the token's suffix would consist of the single character _.

Byte-string literal

Grammar
Byte_string_literal = { "b" ~ DQ_REMAINDER }
Definitions

Define a represented string (a sequence of characters) derived from DQ_CONTENT by replacing each escape sequence of any of the following forms with the escape sequence's escaped value:

These replacements take place in left-to-right order. For example, a match against the characters b"\\x41" is converted to the characters \ x 4 1.

See Wording for string unescaping

Attributes

The token's represented bytes are the sequence of Unicode scalar values of the characters in the represented string. (This is well defined because of the first rejection case below.)

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

  • any character whose unicode scalar value is greater than 127 appears in DQ_CONTENT; or
  • a \ character appears in DQ_CONTENT and is not part of one of the above forms of escape; or
  • a CR character appears in DQ_CONTENT and is not part of a string continuation escape; or
  • the token's suffix would consist of the single character _.

C-string literal

Grammar
C_string_literal = { "c" ~ DQ_REMAINDER }
Attributes

DQ_CONTENT is treated as a sequence of items, each of which is either a single Unicode character other than \ or an escape.

The token's represented bytes are derived from that sequence of items in the following way:

See Wording for string unescaping

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

  • a \ character appears in DQ_CONTENT and is not part of one of the above forms of escape; or
  • a CR character appears in DQ_CONTENT and is not part of a string continuation escape; or
  • any of the token's represented bytes would have value 0; or
  • the token's suffix would consist of the single character _.

Raw double-quoted literals

The following nonterminals are common to the definitions below:

Grammar
RAW_DQ_REMAINDER = {
    HASHES¹ ~
    "\"" ~ RAW_DQ_CONTENT ~ "\"" ~
    HASHES² ~
    SUFFIX ?
}
RAW_DQ_CONTENT = {
    ( !("\"" ~ HASHES²) ~ ANY ) *
}
HASHES = { "#" {0, 255} }

SUFFIX = { IDENT }
IDENT = { IDENT_START ~ XID_CONTINUE * }
IDENT_START = { XID_START | "_" }

These definitions require an extension to the Parsing Expression Grammar formalism: each of the expressions marked as HASHES² fails unless the text it matches is the same as the text matched by the (only) successful match using the expression marked as HASHES¹ in the same attempt to match the current token-kind nonterminal.

See Grammar for raw string literals for a discussion of alternatives to this extension.

Raw string literal

Grammar
Raw_string_literal = { "r" ~ RAW_DQ_REMAINDER }
Attributes

The token's represented string is RAW_DQ_CONTENT.

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

  • a CR character appears in RAW_DQ_CONTENT; or
  • the token's suffix would consist of the single character _.

Raw byte-string literal

Grammar
Raw_byte_string_literal = { "br" ~ RAW_DQ_REMAINDER }
Attributes

The token's represented bytes are the Unicode scalar values of the characters in RAW_DQ_CONTENT. (This is well defined because of the first rejection case below.)

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

  • any character whose unicode scalar value is greater than 127 appears in RAW_DQ_CONTENT; or
  • a CR character appears in RAW_DQ_CONTENT; or
  • the token's suffix would consist of the single character _.

Raw C-string literal

Grammar
Raw_c_string_literal = { "cr" ~ RAW_DQ_REMAINDER }
Attributes

The token's represented bytes are the UTF-8 encoding of RAW_DQ_CONTENT

The token's suffix is SUFFIX, or empty if SUFFIX did not participate in the match.

Rejection

The match is rejected if:

  • a CR character appears in RAW_DQ_CONTENT; or
  • any of the token's represented bytes would have value 0; or
  • the token's suffix would consist of the single character _.

Reserved forms

Reserved or unterminated literal

Grammar
Unterminated_literal_2015 = { "r\"" | "br\"" | "b'" }
Reserved_literal_2021 = { IDENT ~ ( "\"" | "'" ) }
Rejection

All matches are rejected.

Note: I believe in the Unterminated_literal_2015 definition only the b' form is strictly needed: if that definition matches using one of the other subexpressions then the input will be rejected eventually anyway (given that the corresponding string literal nonterminal didn't match).

Note: Reserved_literal_2021 catches both reserved forms and unterminated b' literals.

Reserved single-quoted literal

Grammar
Reserved_single_quoted_literal_2015 = { "'" ~ IDENT ~ "'" }
Reserved_single_quoted_literal_2021 = { "'" ~ "r#" ? ~ IDENT ~ "'" }
Rejection

All matches are rejected.

Note: This reservation is to catch forms like 'aaa'bbb, so this definition must come before Lifetime_or_label.

Reserved guard (Rust 2024)

Grammar
Reserved_guard = { "##" | "#\"" }
Rejection

All matches are rejected.

Note: This definition is listed here near the double-quoted string literals because these forms were reserved during discussions about introducing string literals formed like #"…"#.