Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Processing that happens before tokenising

This document's description of tokenising takes a sequence of characters as input.

That sequence of characters is derived from an input sequence of bytes by peforming the steps listed below in order.

Normally the input sequence of bytes is the contents of a single source file.

Decoding

The input sequence of bytes is interpreted as a sequence of characters represented using the UTF-8 Unicode encoding scheme.

If the input sequence of bytes is not a well-formed UTF-8 code unit sequence, the input is rejected.

Byte order mark removal

If the first character in the sequence is U+FEFF (BYTE ORDER MARK), it is removed.

CRLF normalisation

Each pair of characters CR immediately followed by LF is replaced by a single LF character.

Note: It's not possible for two such pairs to overlap, so this operation is unambiguously defined.

Note: Other occurrences of the character CR are left in place. It's still possible for the sequence CRLF to be passed on to the tokeniser: that will happen if the input contained the sequence CRCRLF.

Shebang removal

Shebang removal is performed if:

  • the remaining sequence begins with the characters #!; and
  • the result of finding the first non-whitespace token with the characters following the #! as input is not a Punctuation token whose mark is the [ character.

If shebang removal is performed:

  • the characters up to and including the first LF character are removed from the sequence
  • if the sequence did not contain a LF character, all characters are removed from the sequence.

Note: The check for [ prevents an inner attribute at the start of the input being removed. See #70528 and #71487 for history.