Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Tokenising

This phase of processing takes a character sequence (the input), and either:

The analysis depends on the Rust edition which is in effect when the input is processed.

So strictly speaking, the edition is a second parameter to the process described here.

Tokenisation is described using two operations:

Either operation can cause lexical analysis to fail.

Process

The process is to repeat the following steps until the input is empty:

  1. extract a pretoken from the start of the input
  2. reprocess that pretoken

If no step determines that lexical analysis should fail, the output is the sequence of fine-grained tokens produced by the repetitions of the second step.

Note: Each fine-grained token corresponds to one pretoken, representing exactly the same characters from the input; reprocessing doesn't involve any combination or splitting.

Note: It doesn't make any difference whether we treat this as one pass with interleaved pretoken-extraction and reprocessing, or as two passes. The comparable implementation uses a single interleaved pass, which means when it reports an error it describes the earliest part of the input which caused trouble.

Finding the first non-whitespace token

This section defines a variant of the tokenisation process which is used in the definition of Shebang removal.

The process of finding the first non-whitespace token in a character sequence (the input) is:

  1. if the input is empty, the result is no token
  2. extract a pretoken from the start of the input
  3. reprocess that pretoken
  4. if the resulting fine-grained token is not a token representing whitespace, the result is that token
  5. otherwise, return to step 1

If any step determines that lexical analysis should fail, the result is no token.

For this purpose a token representing whitespace is any of:

  • a Whitespace token
  • a LineComment token whose style is non-doc
  • a BlockComment token whose style is non-doc