Grammars used in this writeup
This document relies on two parsing expression grammars: one for tokenising and one for recognising frontmatter.
This page summarises how these grammars work. See the Parsing Expression Grammars appendix for a more formal treatment.
See Frontmatter grammar and Complete tokenisation grammar for the grammars themselves.
There is no standardised notation for parsing expression grammars. This writeup is based on the variant used by the Pest Rust library, so that it's easy to keep in sync with the reimplementation.
See Grammar for raw string literals for a discussion of extensions used to model raw string literals and frontmatter fences. Those extensions are not described on this page.
Table of contents
Grammars
Here is an example of a grammar:
DIGITS = { '0'..'9' + }
NUMBER = { DIGITS ~ "." ~ DIGITS }
VARIABLE = { ( 'a'..'z' | "_" ) + }
VALUE = { NUMBER | VARIABLE }
The name on the left hand side of each definition is a nonterminal.
The right hand side of each definition contains a parsing expression (between { and })
which describes what that nonterminal matches.
Matching
Given a grammar, we can attempt to match a nonterminal against an input sequence of characters.
If the start of the input matches what the nonterminal's parsing expression requires, we say the attempt succeeds and consumes the matched characters. Otherwise we say the attempt fails.
We describe the result of a successful attempt as a match.
The table below summarises the forms a parsing expression can take, and describes what each form matches.
Participating in a match
The process of matching a nonterminal often involves matching further nonterminals against a part of the input. We say that those further nonterminals participated in the match, and their matches are participating matches.
Examples
If
VALUEin the example grammar is matched against abc,VARIABLEparticipates in the match, and the participating match consumes the characters abc.If
NUMBERin the example grammar is matched against 123.456, there are two participating matches of theDIGITSnonterminal, the first consuming 123 and the second consuming 456.
Parsing expressions
The following forms of parsing expression are available:
| Form | Matching |
|---|---|
eg "abc" | Match the exact string provided |
eg 'a'..'f' | Match one character from the provided (inclusive) range |
| A nonterminal | Match the nonterminal's parsing expression |
e₁ ~ e₂ | First match e₁, then match e₂ |
e₁ | e₂ | Match either e₁ or e₂, with e₁ having higher priority |
e ? | Match e if possible |
e * | Match as many repetitions of e as possible (possibly zero) |
e + | Match as many repetitions of e as possible (at least one) |
! e | Fail if e would match at this point |
( e ) | Match e, overriding the usual precedence |
Here, e, e₁, and e₂ can be any parsing expression.
Special terminals
In addition, the following named terminals are available in all grammars in this document:
| Terminal | Matches |
|---|---|
ANY | Any single Unicode character |
DOUBLEQUOTE | A " character |
BACKSLASH | A \ character |
LF | An LF character |
TAB | An HT character |
PATTERN_WHITE_SPACE | A character with the Unicode Pattern_White_Space property |
XID_START | A character with the Unicode XID_Start property |
XID_CONTINUE | A character with the Unicode XID_Continue property |
EOI | The end of input |
EOI only matches when the remaining input is empty.
The Pattern_White_Space Unicode character property is defined in PropList.txt in the Unicode character database.
The XID_Start and XID_Continue Unicode character properties are defined in DerivedCoreProperties.txt in the Unicode character database.
Note: The characters with the
PATTERN_WHITE_SPACEUnicode character property are:
U+0009 CHARACTER TABULATION (HT) U+000A LINE FEED (LF) U+000B LINE TABULATION (VT) U+000C FORM FEED (FF) U+000D CARRIAGE RETURN (CR) U+0020 SPACE U+0085 NEXT LINE (NEL) U+200E LEFT-TO-RIGHT MARK U+200F RIGHT-TO-LEFT MARK U+2028 LINE SEPARATOR U+2029 PARAGRAPH SEPARATOR This set doesn't change in updated Unicode versions.
Prioritised choice
The prioritised choice operator | is the distinctive feature of parsing expression grammars.
An attempt to match ONE | TWO first attempts to match ONE,
and if that succeeds it never considers TWO.
If the match of ONE fails it attempts to match TWO instead.
Example
Matching
"aa" | "aaa"against aaa consumes the characters aa, not aaa.
Repetition and backtracking
The repetition operators * and + always match as many repetitions as possible.
If they are used as part of a larger match attempt which later fails,
the matching process does not backtrack to see if the whole match can succeed if the repetition expression consumes fewer repetitions.
For example, matching "a"* ~ "ab" against aaab fails.
Similarly ? matches as much as it can when it is first attempted,
and there is no backtracking.
Examples
Matching
"ab"? ~ "abc"against abc fails.Matching
"ab" ~ "abc" | "abc"against abc succeeds.
Example
With the following grammar
LETTER = { 'a'..'z' } LETTER_OR_DOT = { 'a'..'z' | "." } ENDS_WITH_LETTER = { LETTER_OR_DOT * ~ LETTER }matching
ENDS_WITH_LETTERagainst abcde fails.With this grammar it succeeds:
LETTER = { 'a'..'z' } LETTER_OR_DOT = { 'a'..'z' | "." } ENDS_WITH_LETTER = { LETTER_OR_DOT ~ ENDS_WITH_LETTER | LETTER }
Precedence
The prioritised choice operator | has the lowest precedence, so for example
ONE ~ TWO | THREE ~ FOUR
is equivalent to
( ONE ~ TWO ) | ( THREE ~ FOUR )
The sequencing operator ~ has the next-lowest precedence, so for example
!"." ~ SOMETHING
is equivalent to
(!".") ~ SOMETHING
Both the | and ~ operators can be used repeatedly without parentheses, so for example
ONE | TWO | THREE
means "match ONE, TWO, or THREE, in descending order of priority", and
ONE ~ TWO ~ THREE
means "first match ONE, then match TWO, then match THREE".
Common idioms
"Any character except" is written using the negative lookup operator and ANY.
Example
( !"'" ~ ANY )matches any single character except '.