Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Grammars used in this writeup

This document relies on two parsing expression grammars: one for tokenising and one for recognising frontmatter.

This page summarises how these grammars work. See the Parsing Expression Grammars appendix for a more formal treatment.

See Frontmatter grammar and Complete tokenisation grammar for the grammars themselves.

There is no standardised notation for parsing expression grammars. This writeup is based on the variant used by the Pest Rust library, so that it's easy to keep in sync with the reimplementation.

See Grammar for raw string literals for a discussion of extensions used to model raw string literals and frontmatter fences. Those extensions are not described on this page.

Table of contents

Grammars

Here is an example of a grammar:

DIGITS = { '0'..'9' + }
NUMBER = { DIGITS ~ "." ~ DIGITS }
VARIABLE = { ( 'a'..'z' | "_" ) + }
VALUE = { NUMBER | VARIABLE }

The name on the left hand side of each definition is a nonterminal.

The right hand side of each definition contains a parsing expression (between { and }) which describes what that nonterminal matches.

Matching

Given a grammar, we can attempt to match a nonterminal against an input sequence of characters.

If the start of the input matches what the nonterminal's parsing expression requires, we say the attempt succeeds and consumes the matched characters. Otherwise we say the attempt fails.

We describe the result of a successful attempt as a match.

The table below summarises the forms a parsing expression can take, and describes what each form matches.

Participating in a match

The process of matching a nonterminal often involves matching further nonterminals against a part of the input. We say that those further nonterminals participated in the match, and their matches are participating matches.

Examples

If VALUE in the example grammar is matched against abc, VARIABLE participates in the match, and the participating match consumes the characters abc.

If NUMBER in the example grammar is matched against 123.456, there are two participating matches of the DIGITS nonterminal, the first consuming 123 and the second consuming 456.

Parsing expressions

The following forms of parsing expression are available:

FormMatching
eg "abc"Match the exact string provided
eg 'a'..'f'Match one character from the provided (inclusive) range
A nonterminalMatch the nonterminal's parsing expression
e₁ ~ e₂First match e₁, then match e₂
e₁ | e₂Match either e₁ or e₂, with e₁ having higher priority
e ?Match e if possible
e *Match as many repetitions of e as possible (possibly zero)
e +Match as many repetitions of e as possible (at least one)
! eFail if e would match at this point
( e )Match e, overriding the usual precedence

Here, e, e₁, and e₂ can be any parsing expression.

Special terminals

In addition, the following named terminals are available in all grammars in this document:

TerminalMatches
ANYAny single Unicode character
DOUBLEQUOTEA " character
BACKSLASHA \ character
LFAn LF character
TABAn HT character
PATTERN_WHITE_SPACEA character with the Unicode Pattern_White_Space property
XID_STARTA character with the Unicode XID_Start property
XID_CONTINUEA character with the Unicode XID_Continue property
EOIThe end of input

EOI only matches when the remaining input is empty.

The Pattern_White_Space Unicode character property is defined in PropList.txt in the Unicode character database. The XID_Start and XID_Continue Unicode character properties are defined in DerivedCoreProperties.txt in the Unicode character database.

Note: The characters with the PATTERN_WHITE_SPACE Unicode character property are:

U+0009CHARACTER TABULATION (HT)
U+000ALINE FEED (LF)
U+000BLINE TABULATION (VT)
U+000CFORM FEED (FF)
U+000DCARRIAGE RETURN (CR)
U+0020SPACE
U+0085NEXT LINE (NEL)
U+200ELEFT-TO-RIGHT MARK
U+200FRIGHT-TO-LEFT MARK
U+2028LINE SEPARATOR
U+2029PARAGRAPH SEPARATOR

This set doesn't change in updated Unicode versions.

Prioritised choice

The prioritised choice operator | is the distinctive feature of parsing expression grammars.

An attempt to match ONE | TWO first attempts to match ONE, and if that succeeds it never considers TWO. If the match of ONE fails it attempts to match TWO instead.

Example

Matching "aa" | "aaa" against aaa consumes the characters aa, not aaa.

Repetition and backtracking

The repetition operators * and + always match as many repetitions as possible. If they are used as part of a larger match attempt which later fails, the matching process does not backtrack to see if the whole match can succeed if the repetition expression consumes fewer repetitions.

For example, matching "a"* ~ "ab" against aaab fails.

Similarly ? matches as much as it can when it is first attempted, and there is no backtracking.

Examples

Matching "ab"? ~ "abc" against abc fails.

Matching "ab" ~ "abc" | "abc" against abc succeeds.

Example

With the following grammar

LETTER = { 'a'..'z' }
LETTER_OR_DOT = { 'a'..'z' | "." }
ENDS_WITH_LETTER = { LETTER_OR_DOT * ~ LETTER }

matching ENDS_WITH_LETTER against abcde fails.

With this grammar it succeeds:

LETTER = { 'a'..'z' }
LETTER_OR_DOT = { 'a'..'z' | "." }
ENDS_WITH_LETTER = { LETTER_OR_DOT ~ ENDS_WITH_LETTER | LETTER }

Precedence

The prioritised choice operator | has the lowest precedence, so for example

ONE ~ TWO | THREE ~ FOUR

is equivalent to

( ONE ~ TWO ) | ( THREE ~ FOUR )

The sequencing operator ~ has the next-lowest precedence, so for example

!"." ~ SOMETHING

is equivalent to

(!".") ~ SOMETHING

Both the | and ~ operators can be used repeatedly without parentheses, so for example

ONE | TWO | THREE

means "match ONE, TWO, or THREE, in descending order of priority", and

ONE ~ TWO ~ THREE

means "first match ONE, then match TWO, then match THREE".

Common idioms

"Any character except" is written using the negative lookup operator and ANY.

Example

( !"'" ~ ANY )

matches any single character except '.