Escape processing

The escape-processing grammar
Classifying escapes
Escape interpretations

The escape-processing grammar

The escape-processing grammar is the following Parsing Expression Grammar:

LITERAL_COMPONENTS = {
    LITERAL_COMPONENT *
}

LITERAL_COMPONENT = {
    !BACKSLASH ~ ANY |
    BACKSLASH ~ ESCAPE_BODY
}

ESCAPE_BODY = {
    SIMPLE_ESCAPE_BODY |
    UNICODE_ESCAPE_BODY |
    HEXADECIMAL_ESCAPE_BODY |
    STRING_CONTINUATION_ESCAPE_BODY
}

SIMPLE_ESCAPE_BODY = {
    "0" | "t" | "n" | "r" | DOUBLEQUOTE | "'" | BACKSLASH
}

UNICODE_ESCAPE_BODY = {
    "u" ~ "{" ~ ( HEXADECIMAL_DIGIT ~ "_" * ){1,6} ~ "}"
}

HEXADECIMAL_ESCAPE_BODY = {
    "x" ~ HEXADECIMAL_DIGIT ~ HEXADECIMAL_DIGIT
}

STRING_CONTINUATION_ESCAPE_BODY = {
    LF ~ ( TAB | LF | CR | " " ) *
}

HEXADECIMAL_DIGIT = {
    '0'..'9' | 'a'..'f' | 'A'..'F'
}

Classifying escapes

A match of LITERAL_COMPONENT is:

a non-escape if ESCAPE_BODY did not participate in the match
a simple escape if SIMPLE_ESCAPE_BODY participated in the match
a Unicode escape if UNICODE_ESCAPE_BODY participated in the match
a hexadecimal escape if HEXADECIMAL_ESCAPE_BODY participated in the match
a string continuation escape if STRING_CONTINUATION_ESCAPE_BODY participated in the match.

It follows from the definitions of LITERAL_COMPONENT AND ESCAPE_BODY that each match of LITERAL_COMPONENT is exactly one of the above forms.

Non-escapes

The represented character of a non-escape is the single character consumed by the non-escape.

The represented byte of a non-escape whose represented character has a Unicode scalar value that is less than 128 is that Unicode scalar value. Other non-escapes have no represented byte.

Note: this means a non-escape has a represented byte exactly when the UTF-8 encoding of its represented character is a single byte.

Simple escapes

A simple escape is a form like \n or \". Simple escapes are used to represent common control characters and characters that have special meaning in the tokenisation grammar.

The represented character of a simple escape is determined from the character consumed by the match of SIMPLE_ESCAPE_BODY that participated in the escape, according to the table below.

Simple escape body	Represented character
0	U+0000 `NUL`
t	U+0009 `HT`
n	U+000A `LF`
r	U+000D `CR`
"	U+0022 (QUOTATION MARK)
'	U+0027 (APOSTROPHE)
\	U+005C (REVERSE SOLIDUS)

The represented byte of a simple escape is the Unicode scalar value of its represented character.

Unicode escapes

A Unicode escape is a form like \u{211d} or \u{01_F9_80}. A Unicode escape can represent any single character.

The digits of a Unicode escape are the characters consumed by the sequence of participating matches of HEXADECIMAL_DIGIT in the escape.

The numeric value of a Unicode escape is the result of interpreting the escape's digits as a hexadecimal integer, as if by u32::from_str_radix with radix 16.

If a Unicode escape's numeric value is a Unicode scalar value, the represented character of the escape is the character with that Unicode scalar value. Otherwise the Unicode escape has no represented character.

Hexadecimal escapes

A hexadecimal escape is a form like \xA0 or \x1b. In byte, byte-string, and C-string literals, a hexadecimal escape can represent any single byte. In character and string literals, a hexadecimal escape can represent any single ASCII character.

The digits of a hexadecimal escape are the characters consumed by the sequence of participating matches of HEXADECIMAL_DIGIT in the escape.

The represented byte of a hexadecimal escape is the result of interpreting the escape's digits as a hexadecimal integer, as if by u8::from_str_radix with radix 16.

The represented character of a hexadecimal escape whose represented byte is less than 128 is the character whose Unicode scalar value is the escape's represented byte. Other hexadecimal escapes have no represented character.

Note: this means a hexadecimal escape has a represented character exactly when its represented byte is the UTF-8 encoding of a character.

String continuation escapes

A string continuation escape is \ followed immediately by LF, optionally followed by some forms of additional whitespace (see STRING_CONTINUATION_ESCAPE_BODY). The escape is effectively removed from the literal content.

The Reference says the whitespace-removal behaviour may change in future; see String continuation escapes.

Escape interpretations

Single-escape interpretation

If an attempt to match the LITERAL_COMPONENT nonterminal against a character sequence succeeds and consumes the entire sequence, and the match is not a string continuation escape, the single-escape interpretation of that character sequence is the resulting match.

Otherwise the character sequence has no single-escape interpretation.

This means a single-escape interpretation is one of the forms described under Classifying escapes above, other than a string continuation escape.

Escape interpretation

If an attempt to match the LITERAL_COMPONENTS nonterminal against a character sequence succeeds and consumes the entire sequence, the escape interpretation of that character sequence is the sequence of participating matches of LITERAL_COMPONENT in the resulting match, omitting any string continuation escapes.

Otherwise, the character sequence has no escape interpretation.

The individual matches in an escape interpretation are referred to as its components.

This means the escape interpretation is a sequence of components, each of which has one of the forms described under Classifying escapes above, and it doesn't include any string continuation escapes.

Keyboard shortcuts

Writeup of Rust's lexer