Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Parsing Expression Grammars

Parsing Expression Grammars were introduced in Ford 2004.

The notation used in this document is based on the variant used by the Pest Rust library.

This page describes a subset of the formalism that is sufficient for the grammars used in this writeup.

See Grammars above for a less formal treatment.

Table of contents

Nonterminal definitions

A Parsing Expression Grammar is made up of a sequence of nonterminal definitions of the form

NAME = { … }

The order of the nonterminal definitions is not significant.

The name on the left hand side of the = in a nonterminal definition is a nonterminal.

The part of the nonterminal definition that appears between { and } is that nonterminal's expression. It is a Parsing expression as defined below.

No nonterminal appears more than once on the left hand side of a definition.

Parsing expressions

Parsing expressions have the following forms, where e, e₁, and e₂ represent arbitrary parsing expressions.

Terminals
eg "abc"Character-sequence terminal
eg 'a'..'f'Character-range terminal
ANYAny character
DOUBLEQUOTE"
BACKSLASH\
LFLine feed
TABTab
PATTERN_WHITE_SPACE
XID_START
XID_CONTINUE
EOIEnd of input
EMPTYEmpty match
Nonterminals
A defined nonterminal
Compound expressions
e₁ ~ e₂Sequencing expression
e₁ | e₂Prioritised choice expression
e ?Option suffix expression
e *Zero-or-more repetitions expression
e +One-or-more repetitions expression
! eNegative lookahead expression
Grouping
( e )

The symbols ~, |, ?, *, +, and ! are called parsing operators.

Each nonterminal which appears in a parsing expression has a definition in the grammar.

The EMPTY terminal doesn't appear in any of grammars in this writeup; it's used below in the descriptions of matching option and repetition expressions.

Grouping, precedence, and association

The definition of matching below assumes that each parsing expression has a known interpretation as a tree of compound expressions, nonterminals, and terminals.

This section describes how to resolve ambiguities in the written form of a parsing expression to produce such a tree.

A subexpression in parentheses ( and ) is treated as a separate unit.

The prioritised choice parsing operator | has the lowest precedence, so for example e₁ ~ e₂ | e₃ ~ e₄ is interpreted as ( e₁ ~ e₂ ) | ( e₃ ~ e₄ ).

The sequencing parsing operator ~ has the next-lowest precedence, so for example !e₁ ~ e₂ is interpreted as (!e₁) ~ e₂, and e₁ ~ e₂? is interpreted as e₁ ~ (e₂?).

The grammars used in this writeup do not rely on a defined precedence between the unary parsing operators.

The binary parsing operators ~ and | are left-associative:

  • e₁ ~ e₂ ~ e₃ is interpreted as (e₁ ~ e₂) ~ e₃
  • e₁ | e₂ | e₃ is interpreted as (e₁ | e₂) | e₃

The associativity doesn't matter in practice: (e₁ ~ e₂) ~ e₃ and e₁ ~ (e₂ ~ e₃) have identical outcomes, and (e₁ | e₂) | e₃ and e₁ | (e₂ | e₃) have identical outcomes.

Matching

A match attempt is characterised by a grammar, a parsing expression e, and a character sequence s. In this document the grammar is always implicit.

A match attempt is identified using the form "a match attempt of e against s" or "an attempt to match e against s". On this page, a match attempt may be referred to simply as an attempt.

The descriptions of terminals, nonterminals, and compound expressions below, taken together, define the outcome of any match attempt.

It is possible to write a grammar under which the definition of outcome below is not well-founded, because of direct or indirect left recursion in the definitions of nonterminals. The grammars used in this writeup do not have this complication, so we may assume all match attempts have a well-defined outcome.

The outcome of a match attempt against s is one of:

  • success, together with
    • a prefix of s which was consumed by the attempt
    • a sequence of matches (the attempt's elaboration)
  • failure.

We say a match attempt succeeds or is successful if its outcome is success, and fails if its outcome is failure.

A successful match attempt can be referred to as a match.

Note that any nonterminal is a parsing expression on its own, so it is meaningful to talk about an attempt to match a nonterminal against a character sequence.

For the purposes of this section, a prefix of a sequence is the first n characters of the sequence, for some n. The prefix may be empty, or the entire sequence.

In the descriptions below, s represents a character sequence.

Terminals

An attempt to match a character-sequence terminal "c₁…cₙ" against s succeeds if and only if the character sequence c₁…cₙ is a prefix of s, and (if it succeeds) consumes c₁…cₙ. Here, c₁…cₙ represents an arbitrary sequence of characters other than " (in practice, they are printable ASCII characters).

An attempt to match a character-range terminal 'c₁'..'c₂' against s succeeds if and only s begins with a character whose Unicode scalar value is between the Unicode scalar value of c₁ and the Unicode scalar value of c₂ (inclusive), and (if it succeeds) consumes that character. Here, c₁ and c₂ represent arbitrary single characters other than ' (in practice, ASCII digits or letters).

An attempt to match ANY against s succeeds if and only if s is not empty, and (if it succeeds) consumes the first character in s.

An attempt to match DOUBLEQUOTE against s succeeds if and only if s begins with the character ", and (if it succeeds) consumes that character.

An attempt to match BACKSLASH against s succeeds if and only if s begins with the character \, and (if it succeeds) consumes that character.

An attempt to match LF against s succeeds if and only if s begins with the character LF, and (if it succeeds) consumes that character.

An attempt to match TAB against s succeeds if and only if s begins with the character HT, and (if it succeeds) consumes that character.

An attempt to match PATTERN_WHITE_SPACE succeeds if and only if s begins with a character which has the Pattern_White_Space Unicode character property, as defined in PropList.txt in the Unicode character database, and (if it succeeds) consumes that character.

An attempt to match XID_START succeeds if and only if s begins with a character which has the XID_Start Unicode character property, as defined in DerivedCoreProperties.txt in the Unicode character database, and (if it succeeds) consumes that character.

An attempt to match XID_CONTINUE succeeds if and only if s begins with a character which has the XID_Continue Unicode character property, as defined in DerivedCoreProperties.txt in the Unicode character database, and (if it succeeds) consumes that character.

An attempt to match EOI against s succeeds if and only if s is empty, and (if it succeeds) consumes an empty character sequence.

An attempt to match EMPTY always succeeds, and (if it succeeds) consumes an empty character sequence.

All matches of terminals have an empty elaboration.

Nonterminals

An attempt A to match a nonterminal against s succeeds if and only if an attempt A′ to match the nonterminal's expression against s succeeds.

If A is successful, it consumes the characters consumed by A′ and its elaboration is A followed by the elaboration of A′.

Compound expressions

In the descriptions below, a statement that an expression e₁ reduces to an expression e₂ means that the outcome of an attempt to match e₁ against s is the outcome of an attempt to match e₂ against s.

Sequencing expressions (~)

The outcome of an attempt A to match a sequencing expression e₁ ~ e₂ against s is as follows:

  • If an attempt A₁ to match the expression e₁ against s fails, A fails.
  • Otherwise, A succeeds if and only if an attempt A₂ to match e₂ against s′ succeeds, where s′ is the sequence of characters obtained by removing the prefix consumed by A₁ from s.

If A succeeds:

  • It consumes the characters consumed by A₁ followed by the characters consumed by A₂.
  • Its elaboration is the elaboration of A₁ followed by the elaboration of A₂.
Prioritised choice expressions (|)

The outcome of an attempt A to match a prioritised choice expression e₁ | e₂ against s is as follows:

  • If an attempt A₁ to match e₁ against s succeeds, the outcome of A is the outcome of A₁.
  • Otherwise, the outcome of A is the outcome of an attempt to match e₂ against s.
Option expressions (?)

The option expression e? reduces to e | EMPTY.

Repetition expressions (* and +)

A zero-or-more repetitions expression e* reduces to ( e ~ e* ) | EMPTY.

A one-or-more repetitions expression e+ reduces to e ~ e*.

Negative lookahead expressions (!)

An attempt to match a negative lookahead expression !e against s succeeds if and only if an attempt to match e against s fails.

If the attempt succeeds, it consumes no characters and has an empty elaboration.

Participating in a match

A match is a participating match of a nonterminal N in a match A if it is a match of N which appears in the elaboration of A.

A nonterminal N participates in a match A if there is a participating match of N in A.

If a nonterminal N participates in a match A, the first participating match of N in A is the first match of N in the elaboration of A.

If 𝑵 is a class of nonterminals, the sequence of participating matches of 𝑵 in a match A is the sequence obtained by restricting the elaboration of A to matches of members of 𝑵.