Tokenising
This phase of processing takes a character sequence (the input), and either:
- produces a sequence of fine-grained tokens; or
- reports that lexical analysis failed
The analysis depends on the Rust edition which is in effect when the input is processed.
So strictly speaking, the edition is a second parameter to the process described here.
Tokenisation is described using two operations:
- Pretokenising extracts pretokens from the character sequence.
- Reprocessing converts pretokens to fine-grained tokens.
Either operation can cause lexical analysis to fail.
Process
The process is to repeat the following steps until the input is empty:
- extract a pretoken from the start of the input
- reprocess that pretoken
If no step determines that lexical analysis should fail, the output is the sequence of fine-grained tokens produced by the repetitions of the second step.
Note: Each fine-grained token corresponds to one pretoken, representing exactly the same characters from the input; reprocessing doesn't involve any combination or splitting.
Note: It doesn't make any difference whether we treat this as one pass with interleaved pretoken-extraction and reprocessing, or as two passes. The comparable implementation uses a single interleaved pass, which means when it reports an error it describes the earliest part of the input which caused trouble.
Finding the first non-whitespace token
This section defines a variant of the tokenisation process which is used in the definition of Shebang removal.
The process of finding the first non-whitespace token in a character sequence (the input) is:
- if the input is empty, the result is no token
- extract a pretoken from the start of the input
- reprocess that pretoken
- if the resulting fine-grained token is not a token representing whitespace, the result is that token
- otherwise, return to step 1
If any step determines that lexical analysis should fail, the result is no token.
For this purpose a token representing whitespace is any of:
- a
Whitespace
token - a
LineComment
token whose style is non-doc - a
BlockComment
token whose style is non-doc