Processing that happens before tokenising
This document's description of tokenising takes a sequence of characters as input.
That sequence of characters is derived from an input sequence of bytes by peforming the steps listed below in order.
Normally the input sequence of bytes is the contents of a single source file.
Decoding
The input sequence of bytes is interpreted as a sequence of characters represented using the UTF-8 Unicode encoding scheme.
If the input sequence of bytes is not a well-formed UTF-8 code unit sequence, the input is rejected.
Byte order mark removal
If the first character in the sequence is U+FEFF
(BYTE ORDER MARK), it is removed.
CRLF normalisation
Each pair of characters CR immediately followed by LF is replaced by a single LF character.
Note: It's not possible for two such pairs to overlap, so this operation is unambiguously defined.
Note: Other occurrences of the character CR are left in place. It's still possible for the sequence CRLF to be passed on to the tokeniser: that will happen if the input contained the sequence CRCRLF.
Shebang removal
Shebang removal is performed if:
- the remaining sequence begins with the characters #!; and
- the result of finding the first non-whitespace token with the characters following the #! as input is not a
Punctuation
token whose mark is the [ character.
If shebang removal is performed:
- the characters up to and including the first LF character are removed from the sequence
- if the sequence did not contain a LF character, all characters are removed from the sequence.
Note: The check for [ prevents an inner attribute at the start of the input being removed. See #70528 and #71487 for history.