Programming Python (71 page)

BOOK: Programming Python

9.46Mb size Format: txt, pdf, ePub

ads

Unicode and the Text widget

The application of all this to tkinter
Textdisplays is straightforward: if we open
in binary mode to read
bytes, we
don’t need to be concerned about encodings in our own
code—
tkinter interprets the data as
expected, at least for these two encodings:

>>>
from tkinter import Text
>>>
t = Text()
>>>
t.insert('1.0', open('ldata', 'rb').read())
>>>
t.pack()
# string appears in GUI OK
>>>
t.get('1.0', 'end')
'AÄBäC\n'
>>>
>>>
t = Text()
>>>
t.insert('1.0', open('udata', 'rb').read())
>>>
t.pack()
# string appears in GUI OK
>>>
t.get('1.0', 'end')
'AÄBäC\n'

It works the same if we pass a
strfetched in text mode, but we then need
to know the encoding type on the Python side of the fence—reads will
fail if the encoding type doesn’t match the stored data:

>>>
t = Text()
>>>
t.insert('1.0', open('ldata', 'r', encoding='latin-1').read())
>>>
t.pack()
>>>
t.get('1.0', 'end')
'AÄBäC\n'
>>>
>>>
t = Text()
>>>
t.insert('1.0', open('udata', 'r', encoding='utf-8').read())
>>>
t.pack()
>>>
t.get('1.0', 'end')
'AÄBäC\n'

Either way, though, the fetched content is always a Unicode
str, so binary mode really only
addresses loads: we still need to know an encoding to store, whether
we write in text mode directly or write in binary mode after manual
encoding:

>>>
c = t.get('1.0', 'end')
>>>
c
# content is str
'AÄBäC\n'
>>>
open('cdata', 'wb').write(c)
# binary mode needs bytes
TypeError: must be bytes or buffer, not str
>>>
open('cdata', 'w', encoding='latin-1').write(c)
# each write returns 6
>>>
open('cdata', 'rb').read()
b'A\xc4B\xe4C\r\n'
>>>
open('cdata', 'w', encoding='utf-8').write(c)
# different bytes on files
>>>
open('cdata', 'rb').read()
b'A\xc3\x84B\xc3\xa4C\r\n'
>>>
open('cdata', 'w', encoding='utf-16').write(c)
>>>
open('cdata', 'rb').read()
b'\xff\xfeA\x00\xc4\x00B\x00\xe4\x00C\x00\r\x00\n\x00'
>>>
open('cdata', 'wb').write( c.encode('latin-1') )
# manual encoding first
>>>
open('cdata', 'rb').read()
# same but no \r on Win
b'A\xc4B\xe4C\n'
>>>
open('cdata', 'w', encoding='ascii').write(c)
# still must be compatible
UnicodeEncodeError: 'ascii' codec can't encode character '\xc4' in position 1: o

Notice the last test here: like manual encoding, file writes can
still fail if the data cannot be encoded in the target scheme. Because
of that, programs may need to recover from exceptions or try
alternative schemes; this is especially true on platforms where ASCII
may be the default platform encoding.

The problem with treating text as bytes

The prior sections’
rules may seem complex, but they boil down to the
following:

Unless strings always use the platform default, we need to
know encoding types to read or write in text mode and to manually
decode or encode for binary mode.
We can use almost any encoding to write new files as long as
it can handle the string’s characters, but must provide one that
is compatible with the existing data’s binary format on
reads.
We don’t need to know the encoding mode to read text as
bytesin binary mode for
display, but the
strcontent
returned by the
Textwidget
still requires us to encode to write on saves.

So why not always load text files in binary mode to display them
in a tkinter
Textwidget? While
binary mode input files seem to side-step encoding issues for display,
passing text to tkinter as
bytesinstead of
strreally just
delegates the encoding issue to the Tk library, which imposes
constraints of its own.

More specifically, opening input files in binary mode to read
bytes may seem to support viewing arbitrary types of text, but it has
two potential downsides:

It shifts the burden of deciding encoding type from our
script to the Tk GUI library. The library must still determine how
to render those bytes and may not support all encodings
possible.
It allows opening and viewing data that is not text in
nature, thereby defeating some of the purpose of the validity
checks performed by text decoding.

The first point is probably the most crucial here. In
experiments I’ve run on Windows, Tk seems to correctly handle raw
bytesstrings encoded in ASCII,
UTF-8 and Latin-1 format, but not UTF-16 or others such as CP500. By
contrast, these all render correctly if decoded in Python to
strbefore being passed on to Tk. In
programs intended for the world at large, this wider support is
crucial today. If you’re able to know or ask for encodings, you’re
better off using
strboth for
display and saves.

To some degree, regardless of whether you pass in
stror
bytes, tkinter GUIs are subject to the
constraints imposed by the underlying Tk library and the Tcl language
it uses internally, as well as any imposed by the techniques Python’s
tkinter uses to interface with Tk. For example:

Tcl, the internal implementation language of the Tk library,
stores strings internally in UTF-8 format, and decrees that
strings passed in to and returned from its C API be in this
format.
Tcl attempts to convert byte strings to its internal UTF-8
format, and generally supports translation using the platform and
locale encodings in the local operating system with Latin-1 as a
fallback.
Python’s tkinter passes
bytesstrings to Tcl directly, but
copies Python
strUnicode
strings to and from Tcl Unicode string objects.
Tk inherits all of Tcl’s Unicode policies, but adds
additional font selection policies for display.

In other words, GUIs that display text in tkinter are somewhat
at the mercy of multiple layers of software, above and beyond the
Python language itself. In general, though, Unicode is broadly
supported by Tk’s
Textwidget for
Python
str, but not for Python
bytes. As you can probably tell,
though, this story quickly becomes very low-level and detailed, so we
won’t explore it further in this book; see the Web and other resources
for more on tkinter, Tk, and Tcl, and the interfaces
between them.

Other binary mode considerations

Even in contexts
where it’s sufficient, using binary mode files to
finesse encodings for display is more complicated than you might
think. We always need to be careful to write output in binary mode,
too, so what we read is what we later write—if we read in binary mode,
content end-lines will be
\r\non
Windows, and we don’t want text-mode files to expand this to
\r\r\n. Moreover, there’s another difference
in tkinter for
strand
bytes. A
strread from a text-mode file appears in
the GUI as you expect, and end-lines are mapped on Windows as
usual:

C:\...\PP4E\Gui\Tour>
python
>>>
from tkinter import *
>>>
T = Text()
# str from text-mode file
>>>
T.insert('1.0', open('jack.txt').read())
# platform default encoding
>>>
T.pack()
# appears in GUI normally
>>>
T.get('1.0', 'end')[:75]
'000)  All work and no play makes Jack a dull boy.\n001)  All work and no pla'

If you pass in a
bytesobtained from a binary-mode file, however, it’s odd in the GUI on
Windows—there’s an extra space at the end of each line, which reflects
the
\rthat is not stripped by
binary mode files:

C:\...\PP4E\Gui\Tour>
python
>>>
from tkinter import *
>>>
T = Text()
# bytes from binary-mode
>>>
T.insert('1.0', open('jack.txt', 'rb').read())
# no decoding occurs
>>>
T.pack()
# lines have space at end!
>>>
T.get('1.0', 'end')[:75]
'000)  All work and no play makes Jack a dull boy.\r\n001)  All work and no pl'

To use
bytesto allow for
arbitrary text but make the text appear as expected by users, we also
have to strip the
\rcharacters at
line end manually. This assumes that a
\r\ncombination doesn’t mean something
special in the text’s encoding scheme, though data in which this
sequence does not mean end-of-line will likely have other issues when
displayed. The following avoids the extra end-of-line spaces—we open
for input in binary mode for undecoded bytes, but drop
\r:

C:\...\PP4E\Gui\Tour>
python
>>>
from tkinter import *
# use bytes, strip \r if any
>>>
T = Text()
>>>
data = open('jack.txt', 'rb').read()
>>>
data = data.replace(b'\r\n', b'\n')
>>>
T.insert('1.0', data)
>>>
T.pack()
>>>
T.get('1.0', 'end')[:75]
'000)  All work and no play makes Jack a dull boy.\n001)  All work and no pla'

To save content later, we can either add the
\rcharacters back on Windows only, manually
encode to
bytes, and save in binary
mode; or we can open in text mode to make the file object restore the
\rif needed and encode for us, and
write the
strcontent string
directly. The second of these is probably simpler, as we don’t need to
care about platform differences.

Either way, though, we still face an encoding step—we can either
rely on the platform default encoding or obtain an encoding name from
user interfaces. In the following, for example, the text-mode file
converts end-lines and encodes to
bytesinternally using the platform default.
If we care about supporting arbitrary Unicode types or run on a
platform whose default does not accommodate characters displayed, we
would need to pass in an explicit encoding argument (the Python slice
operation here has the same effect as fetching through Tk’s “end-1c”
position specification):

...continuing prior listing...
>>>
content = T.get('1.0', 'end')[:-1]
# drop added \n at end
>>>
open('copyjack.txt', 'w').write(content)
# use platform default
12500                                                 # text mode adds \n on Win
>>>
^Z
C:\...\PP4E\Gui\Tour>
fc jack.txt copyjack.txt
Comparing files jack.txt and COPYJACK.TXT
FC: no differences encountered

Supporting Unicode in PyEdit (ahead)

We’ll see a use
case of accommodating the
Textwidget’s Unicode behavior in the larger
PyEdit example of
Chapter 11
. Really,
supporting Unicode just means supporting
arbitrary
Unicode encodings in text
files on opens and saves; once in memory, text processing can always
be performed in terms of
str, since
that’s how tkinter returns content. To support Unicode, PyEdit will
open both input and output files in text mode with explicit encodings
whenever possible, and fall back on opening input files in binary mode
only as a last resort. This avoids relying on the limited Unicode
support Tk provides for display of raw byte strings.

To make this policy work, PyEdit will accept encoding names from
a wide variety of sources and allow the user to configure which to
attempt. Encodings may be obtained from user dialog inputs,
configuration file settings, the platform default, the prior open’s
encoding on saves, and even internal program values (parsed from email
headers, for instance). These sources are attempted until the first
that succeeds, though it may also be desirable to limit encoding
attempts to just one such source in some
contexts
.

Watch for this code in
Chapter 14
.
Frankly, PyEdit in this edition originally read and wrote files in
text mode with platform default encodings. I didn’t consider the
implications of Unicode on PyEdit until the PyMailGUI example’s
Internet world raised the specter of arbitrary text encodings. If it
seems that strings are a lot more complicated than they used to be,
it’s probably only because your scope has been too
narrow.

BOOK: Programming Python

9.46Mb size Format: txt, pdf, ePub

Read Book Download Book

ads

Other books

Peyton's Ride (Riding With The Hunt, #1) by Jennifer Van Gunten

The Obsidian Dagger (Horatio Lyle) by Webb, Catherine

Are You in the House Alone? by Richard Peck

Underworld by Don DeLillo

Winter's Salvation by Deyo, Jason

The Second Coming of Mavala Shikongo by Peter Orner

Sacred Country by Rose Tremain

Twisted by Lynda La Plante

Call Me Ismay by Sean McDevitt

Zenak by George S. Pappas