Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Grammar for raw string literals

Table of contents

I believe the PEG formalism can't naturally describe Rust's rule for matching the number of # characters in raw string literals.

(The same limitations apply to matching the number of - characters in frontmatter fences.)

I can think of the following ways to handle this:

Corresponding nonterminal extension

This writeup uses an ad-hoc extension to the formalism, along similar lines to the stack extension described below (but without needing a full stack).

It's described as follows:

an attempt to match one of the parsing expressions marked as HASHES² fails unless the characters it consumes are the same as the characters consumed by the (only) match of the expression marked as HASHES¹ under the same match attempt of a token-kind nonterminal.

This extension isn't formalised in the appendix on PEGs.

It could be formalised in a similar way to the mark/check extension below, with the addition of some notion of a scoping nonterminal which uses an empty context for its sub-attempt.

Pest's stack extension

Pest provides a stack extension which is a good fit for this requirement, and is used in the reimplementation.

It looks like this:

RAW_DOUBLE_QUOTED_FORM = {
    PUSH(HASHES) ~
    "\"" ~ RAW_DOUBLE_QUOTED_CONTENT ~ "\"" ~
    POP ~
    SUFFIX ?
}
RAW_DOUBLE_QUOTED_CONTENT = {
    ( !("\"" ~ PEEK) ~ ANY ) *
}
HASHES = { "#" {0, 255} }

The notion of attempting to match a parsing expression is extended to include a context stack (a stack of character sequences): each match attempt takes the stack as an additional input and produces an updated stack as an additional part of the outcome.

The stack is initially empty.

There are three additional forms of parsing expression: PUSH(e), PEEK, and POP, where e is an arbitrary parsing expression.

PUSH(e) behaves in the same way as the parsing expression e. If it succeeds, it additionally pushes the text consumed by e onto the stack.

An attempt to match PEEK against a character sequence s succeeds if and only if the top entry of the stack is a prefix of s. If the stack is empty, PEEK fails.

POP behaves in the same way as PEEK. Additionally, if it succeeds it then pops the top entry from the stack.

All other parsing expressions leave the stack unmodified.

Mark/check extension

This extension uses the same notation as the corresponding nonterminal extension. It might be described along the following lines:

An attempt to match a parsing expression marked with ² fails unless the characters it consumes are the same as the characters consumed by the previous match of an expression marked as ¹.

A formalisation of this extension in the style used in the appendix on PEGs is sketched below.

Treat ¹ and ² as operators, defining a mark expression and a check expression respectively.

Extend the characterisation of a match attempt to include a context, which is a sequence of matches (this formalises a notion of the matches preceding the attempt).

Alter the description of most kinds of expression to consider a context and use the same context for each sub-attempt, for example:

An attempt A to match a nonterminal against s in context c succeeds if and only if an attempt A′ to match the nonterminal's expression against s in context c succeeds.

Alter the description of sequencing expressions to use an updated context when attempting the right-hand side:

The outcome of an attempt A to match a sequencing expression e₁ ~ e₂ against s in context c is as follows:
  • If an attempt A₁ to match the expression e₁ against s in context c fails, A fails.
  • Otherwise, A succeeds if and only if an attempt A₂ to match e₂ against s′ in context c′ succeeds, where s′ is the sequence of characters obtained by removing the prefix consumed by A₁ from s, and c′ is c followed by the elaboration of A₁.

Include mark expressions in the elaboration:

An attempt A to match a mark expression against s in context c succeeds if and only if an attempt A′ to match e against s in context c succeeds.

If A is successful, it consumes the characters consumed by A′ and its elaboration is A followed by the elaboration of A′.

Describe a check expression as failing unless the characters its subexpression consumes are the same as the characters consumed by the last mark expression in its context:

An attempt A to match a check expression against s in context c succeeds if
  • an attempt A′ to match e against s in context c succeeds; and
  • c includes at least one mark expression; and
  • the characters consumed by A′ are the same as the characters consumed by the last mark expression in c.

Otherwise A fails.

Scheme of definitions

Because raw string literals have a limit of 255 # characters, it is in principle possible to model them using a PEG with 256 (pairs of) definitions.

So writing this out as a "scheme" of definitions might be thinkable:

RDQ_0 = {
    DOUBLEQUOTE ~ RDQ_0_CONTENT ~ DOUBLEQUOTE ~
}
RDQ_0_CONTENT = {
    ( !(DOUBLEQUOTE) ~ ANY ) *
}

RDQ_1 = {
    "#"{1} ~
    DOUBLEQUOTE ~ RDQ_1_CONTENT ~ DOUBLEQUOTE ~
    "#"{1} ~
}
RDQ_1_CONTENT = {
    ( !(DOUBLEQUOTE ~ "#"{1}) ~ ANY ) *
}

RDQ_2 = {
    "#"{2} ~
    DOUBLEQUOTE ~ RDQ_2_CONTENT ~ DOUBLEQUOTE ~
    "#"{2} ~
}
RDQ_2_CONTENT = {
    ( !(DOUBLEQUOTE ~ "#"{2}) ~ ANY ) *
}

…

RDQ_255 = {
    "#"{255} ~
    DOUBLEQUOTE ~ RDQ_255_CONTENT ~ DOUBLEQUOTE ~
    "#"{255} ~
}
RDQ_255_CONTENT = {
    ( !(DOUBLEQUOTE ~ "#"{255}) ~ ANY ) *
}