Grammar for raw string literals
Table of contents
- Corresponding nonterminal extension
- Pest's stack extension
- Mark/check extension
- Scheme of definitions
I believe the PEG formalism can't naturally describe Rust's rule for matching the number of # characters in raw string literals.
(The same limitations apply to matching the number of - characters in frontmatter fences.)
I can think of the following ways to handle this:
Corresponding nonterminal extension
This writeup uses an ad-hoc extension to the formalism, along similar lines to the stack extension described below (but without needing a full stack).
It's described as follows:
an attempt to match one of the parsing expressions marked as HASHES² fails unless the characters it consumes are the same as the characters consumed by the (only) match of the expression marked as HASHES¹ under the same match attempt of a token-kind nonterminal.
This extension isn't formalised in the appendix on PEGs.
It could be formalised in a similar way to the mark/check extension below, with the addition of some notion of a scoping nonterminal which uses an empty context for its sub-attempt.
Pest's stack extension
Pest provides a stack extension which is a good fit for this requirement, and is used in the reimplementation.
It looks like this:
RAW_DOUBLE_QUOTED_FORM = {
PUSH(HASHES) ~
"\"" ~ RAW_DOUBLE_QUOTED_CONTENT ~ "\"" ~
POP ~
SUFFIX ?
}
RAW_DOUBLE_QUOTED_CONTENT = {
( !("\"" ~ PEEK) ~ ANY ) *
}
HASHES = { "#" {0, 255} }
The notion of attempting to match a parsing expression is extended to include a context stack (a stack of character sequences): each match attempt takes the stack as an additional input and produces an updated stack as an additional part of the outcome.
The stack is initially empty.
There are three additional forms of parsing expression:
PUSH(e), PEEK, and POP,
where e is an arbitrary parsing expression.
PUSH(e) behaves in the same way as the parsing expression e.
If it succeeds, it additionally pushes the text consumed by e onto the stack.
An attempt to match PEEK against a character sequence s succeeds if and only if the top entry of the stack is a prefix of s.
If the stack is empty, PEEK fails.
POP behaves in the same way as PEEK.
Additionally, if it succeeds it then pops the top entry from the stack.
All other parsing expressions leave the stack unmodified.
Mark/check extension
This extension uses the same notation as the corresponding nonterminal extension. It might be described along the following lines:
A formalisation of this extension in the style used in the appendix on PEGs is sketched below.
Treat ¹ and ² as operators, defining a mark expression and a check expression respectively.
Extend the characterisation of a match attempt to include a context, which is a sequence of matches (this formalises a notion of the matches preceding the attempt).
Alter the description of most kinds of expression to consider a context and use the same context for each sub-attempt, for example:
Alter the description of sequencing expressions to use an updated context when attempting the right-hand side:
e₁ ~ e₂ against s in context c is as follows:
- If an attempt A₁ to match the expression e₁ against s in context c fails, A fails.
- Otherwise, A succeeds if and only if an attempt A₂ to match e₂ against s′ in context c′ succeeds, where s′ is the sequence of characters obtained by removing the prefix consumed by A₁ from s, and c′ is c followed by the elaboration of A₁.
Include mark expressions in the elaboration:
e¹ against s in context c succeeds
if and only if an attempt A′ to match e against s in context c succeeds.
If A is successful, it consumes the characters consumed by A′ and its elaboration is A followed by the elaboration of A′.
Describe a check expression as failing unless the characters its subexpression consumes are the same as the characters consumed by the last mark expression in its context:
e² against s in context c succeeds if
- an attempt A′ to match e against s in context c succeeds; and
- c includes at least one mark expression; and
- the characters consumed by A′ are the same as the characters consumed by the last mark expression in c.
Otherwise A fails.
Scheme of definitions
Because raw string literals have a limit of 255 # characters,
it is in principle possible to model them using a PEG with 256 (pairs of) definitions.
So writing this out as a "scheme" of definitions might be thinkable:
RDQ_0 = {
DOUBLEQUOTE ~ RDQ_0_CONTENT ~ DOUBLEQUOTE ~
}
RDQ_0_CONTENT = {
( !(DOUBLEQUOTE) ~ ANY ) *
}
RDQ_1 = {
"#"{1} ~
DOUBLEQUOTE ~ RDQ_1_CONTENT ~ DOUBLEQUOTE ~
"#"{1} ~
}
RDQ_1_CONTENT = {
( !(DOUBLEQUOTE ~ "#"{1}) ~ ANY ) *
}
RDQ_2 = {
"#"{2} ~
DOUBLEQUOTE ~ RDQ_2_CONTENT ~ DOUBLEQUOTE ~
"#"{2} ~
}
RDQ_2_CONTENT = {
( !(DOUBLEQUOTE ~ "#"{2}) ~ ANY ) *
}
…
RDQ_255 = {
"#"{255} ~
DOUBLEQUOTE ~ RDQ_255_CONTENT ~ DOUBLEQUOTE ~
"#"{255} ~
}
RDQ_255_CONTENT = {
( !(DOUBLEQUOTE ~ "#"{255}) ~ ANY ) *
}