Grammars used in this writeup

This document relies on three parsing expression grammars: one for tokenising, one for processing escapes, and one for recognising frontmatter.

This page summarises how these grammars work. See the Parsing Expression Grammars appendix for a more formal treatment.

See Frontmatter grammar, Complete tokenisation grammar, and Escape-processing grammar for the grammars themselves.

There is no standardised notation for parsing expression grammars. This writeup is based on the variant used by the Pest Rust library, so that it's easy to keep in sync with the reimplementation.

See Grammar for raw string literals for a discussion of extensions used to model raw string literals and frontmatter fences. Those extensions are not described on this page.

Grammars
Matching
Participating in a match
Parsing expressions
Common idioms

Grammars

Here is an example of a grammar:

DIGITS = { '0'..'9' + }
NUMBER = { DIGITS ~ "." ~ DIGITS }
VARIABLE = { ( 'a'..'z' | "_" ) + }
VALUE = { NUMBER | VARIABLE }

The name on the left hand side of each definition is a nonterminal.

The right hand side of each definition contains a parsing expression (between { and }) which describes what that nonterminal matches.

Matching

Given a grammar, we can attempt to match a nonterminal against an input sequence of characters.

If the start of the input matches what the nonterminal's parsing expression requires, we say the attempt succeeds and consumes the matched characters. Otherwise we say the attempt fails.

We describe the result of a successful attempt as a match.

The table below summarises the forms a parsing expression can take, and describes what each form matches.

The process of matching a nonterminal often involves matching further nonterminals against a part of the input. We say that those further nonterminals participated in the match, and their matches are participating matches.

Examples

If VALUE in the example grammar is matched against abc, VARIABLE participates in the match, and the participating match consumes the characters abc.

If NUMBER in the example grammar is matched against 123.456, there are two participating matches of the DIGITS nonterminal, the first consuming 123 and the second consuming 456.

Parsing expressions

The following forms of parsing expression are available:

Form	Matching
eg `"abc"`	Match the exact string provided
eg `'a'..'f'`	Match one character from the provided (inclusive) range
A nonterminal	Match the nonterminal's parsing expression
`e₁ ~ e₂`	First match `e₁`, then match `e₂`
`e₁ \| e₂`	Match either `e₁` or `e₂`, with `e₁` having higher priority
`e ?`	Match `e` if possible
`e *`	Match as many repetitions of `e` as possible (possibly zero)
`e +`	Match as many repetitions of `e` as possible (at least one)
`e {m, n}`	Match between `m` and `n` (inclusive) repetitions of `e`
`! e`	Fail if `e` would match at this point
`( e )`	Match `e`, overriding the usual precedence

Here, e, e₁, and e₂ can be any parsing expression, and m and n can be any positive whole number.

Special terminals

In addition, the following named terminals are available in all grammars in this document:

Terminal	Matches
`ANY`	Any single Unicode character
`DOUBLEQUOTE`	A " character
`BACKSLASH`	A \ character
`CR`	A `CR` character
`LF`	An `LF` character
`TAB`	An `HT` character
`PATTERN_WHITE_SPACE`	A character with the Unicode `Pattern_White_Space` property
`XID_START`	A character with the Unicode `XID_Start` property
`XID_CONTINUE`	A character with the Unicode `XID_Continue` property
`EOI`	The end of input

EOI only matches when the remaining input is empty.

The Pattern_White_Space Unicode character property is defined in PropList.txt in the Unicode character database. The XID_Start and XID_Continue Unicode character properties are defined in DerivedCoreProperties.txt in the Unicode character database.

Note: The characters with the PATTERN_WHITE_SPACE Unicode character property are:

U+0009 CHARACTER TABULATION (HT)

U+000A LINE FEED (LF)

U+000B LINE TABULATION (VT)

U+000C FORM FEED (FF)

U+000D CARRIAGE RETURN (CR)

U+0020 SPACE

U+0085 NEXT LINE (NEL)

U+200E LEFT-TO-RIGHT MARK

U+200F RIGHT-TO-LEFT MARK

U+2028 LINE SEPARATOR

U+2029 PARAGRAPH SEPARATOR

This set doesn't change in updated Unicode versions.


U+0009	CHARACTER TABULATION (HT)
U+000A	LINE FEED (LF)
U+000B	LINE TABULATION (VT)
U+000C	FORM FEED (FF)
U+000D	CARRIAGE RETURN (CR)
U+0020	SPACE
U+0085	NEXT LINE (NEL)
U+200E	LEFT-TO-RIGHT MARK
U+200F	RIGHT-TO-LEFT MARK
U+2028	LINE SEPARATOR
U+2029	PARAGRAPH SEPARATOR

Prioritised choice

The prioritised choice operator | is the distinctive feature of parsing expression grammars.

An attempt to match ONE | TWO first attempts to match ONE, and if that succeeds it never considers TWO. If the match of ONE fails it attempts to match TWO instead.

Example

Matching "aa" | "aaa" against aaa consumes the characters aa, not aaa.

Repetition and backtracking

The repetition operators * and + always match as many repetitions as possible. If they are used as part of a larger match attempt which later fails, the matching process does not backtrack to see if the whole match can succeed if the repetition expression consumes fewer repetitions.

For example, matching "a"* ~ "ab" against aaab fails.

Similarly {m, n} and ? match as much as they can when they are first attempted, and there is no backtracking.

Examples

Matching "ab"? ~ "abc" against abc fails.

Matching "ab" ~ "abc" | "abc" against abc succeeds.

Example

With the following grammar
LETTER = { 'a'..'z' }
LETTER_OR_DOT = { 'a'..'z' | "." }
ENDS_WITH_LETTER = { LETTER_OR_DOT * ~ LETTER }
matching ENDS_WITH_LETTER against abcde fails.

With this grammar it succeeds:
LETTER = { 'a'..'z' }
LETTER_OR_DOT = { 'a'..'z' | "." }
ENDS_WITH_LETTER = { LETTER_OR_DOT ~ ENDS_WITH_LETTER | LETTER }

Precedence

The prioritised choice operator | has the lowest precedence, so for example

ONE ~ TWO | THREE ~ FOUR

is equivalent to

( ONE ~ TWO ) | ( THREE ~ FOUR )

The sequencing operator ~ has the next-lowest precedence, so for example

!"." ~ SOMETHING

is equivalent to

(!".") ~ SOMETHING

Both the | and ~ operators can be used repeatedly without parentheses, so for example

ONE | TWO | THREE

means "match ONE, TWO, or THREE, in descending order of priority", and

ONE ~ TWO ~ THREE

means "first match ONE, then match TWO, then match THREE".

Common idioms

"Any character except" is written using the negative lookup operator and ANY.

Example
( !"'" ~ ANY )
matches any single character except '.

Writeup of Rust's lexer