Character encoding¶
The SGF format is defined as containing ASCII-encoded data, possibly with non-ASCII characters in Text and SimpleText property values. The low-level Sgfmill functions for loading and serialising SGF data work with Python bytes or bytes-like objects.
The encoding used for Text and SimpleText property values is given by the
CA
root property (if that isn’t present, the encoding is ISO-8859-1
).
In order for an encoding to be used in Sgfmill, it must exist as a Python
built-in codec, and it must be compatible with ASCII (at least whitespace,
\
, ]
, and :
must be in the usual places). Behaviour is unspecified
if a non-ASCII-compatible encoding is requested.
When encodings are passed as parameters (or returned from functions), they are
represented using the names or aliases of Python built-in codecs (eg
"UTF-8"
or "ISO-8859-1"
). See standard encodings for a list.
Values of the CA
property are interpreted in the same way.
The raw property encoding¶
Each Sgf_game
and Tree_node
has a fixed raw property
encoding, which is the encoding used internally to store the property values.
The Tree_node.get_raw()
and Tree_node.set_raw()
methods use the
raw property encoding.
When an SGF game is loaded from a bytes-like object, the raw property
encoding is taken from the CA
root property (unless overridden).
Improperly encoded property values will not be detected until they are
accessed (get()
will raise ValueError
; use
get_raw()
to retrieve the actual bytes).
When an SGF game is created from a Python string (which contains Unicode
characters), the raw property encoding is always UTF-8
.
Changing the CA property¶
When an SGF game is serialised to a string, the encoding represented by the
CA
root property is used. This target encoding will be the same as
the raw property encoding unless CA
has been changed since the
Sgf_game
was created.
When the raw property encoding and the target encoding match, the raw property values are included unchanged in the output (even if they are improperly encoded.)
Otherwise, if any raw property value is improperly encoded,
UnicodeDecodeError
is raised, and if any property value can’t be
represented in the target encoding, UnicodeEncodeError
is raised.
If the target encoding doesn’t identify a Python codec, ValueError
is
raised. The behaviour of serialise()
is unspecified if the
target encoding isn’t ASCII-compatible (eg, UTF-16).
Transcoding¶
Because changing the CA
property has no effect until you serialise the
game, it doesn’t broaden the set of characters you can use when you
set()
a property.
If you plan to save a file as UTF-8
and want to be able to set arbitrary
strings, you can ensure the raw property encoding is UTF-8
by changing
CA
and reloading the game:
game = sgf.Sgf_game.from_bytes(...)
game.get_root().set("CA", "utf-8")
game = sgf.Sgf_game.from_bytes(game.serialise())
game.get_root().set("PB", "本因坊秀策")