Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

Programming Python (185 page)

BOOK: Programming Python

10.1Mb size Format: txt, pdf, ePub

Read Book Download Book

ads

Chapter 19. Text and Language

“See Jack Hack. Hack, Jack, Hack”

In one form or another, processing text-based information is one of
the more common tasks that applications need to perform. This can include
anything from scanning a text file by columns to analyzing statements in a
language defined by a formal grammar. Such processing usually is
called
parsing
—analyzing the structure of a
text string. In this chapter, we’ll explore ways to handle language and
text-based information and summarize some Python development concepts in
sidebars along the way. In the process, we’ll meet string methods, text
pattern matching, XML and HTML parsers, and other tools.

Some of this material is advanced, but the examples are small to
keep this chapter short. For instance, recursive descent parsing is
illustrated with a simple example to show how it can be implemented in
Python. We’ll also see that it’s often unnecessary to write custom parsers
for each language processing task in Python. They can usually be replaced
by exporting APIs for use in Python programs, and sometimes by a single
built-in function call. Finally, this chapter closes by presenting
PyCalc—a calculator GUI written in Python, and the last major Python
coding example in this text. As we’ll see, writing calculators isn’t much
more difficult than juggling stacks while scanning text.

Strategies for Processing Text in Python

In the grand
scheme of things, there are a variety of ways to handle text
processing and language analysis in Python:

Expressions: Built-in string object expressions
Methods: Built-in string object method calls
Patterns: Regular expression pattern matching
Parsers: markup: XML and HTML text parsing
Parsers: grammars: Custom language parsers, both handcoded and generated
Embedding: Running Python code with
evaland
execbuilt-ins
And more: Natural language processing

For simpler tasks, Python’s built-in
string object is often all we really need. Python strings
can be indexed, concatenated, sliced, and processed with both string
method calls and built-in functions. Our main emphasis in this chapter is
mostly on higher-level tools and techniques for analyzing textual
information and language, but we’ll briefly explore each of these
techniques in turn. Let’s get
started.

Note

Some readers may have come to this chapter seeking coverage of
Unicode text, too, but this topic is not presented here. For a look at
Python’s Unicode support, see
Chapter 2
’s
discussion of string tools,
Chapter 4
’s discussion of text and binary
file distinctions and encodings, and
Chapter 9
’s coverage of text in tkinter
GUIs. Unicode also appears in various Internet and database topics
throughout this book (e.g., email encodings).

Because Unicode is a core language topic, all these chapters will
also refer you to the fuller coverage of Unicode in
Learning
Python
, Fourth Edition. Most of the topics in this
chapter, including string methods and pattern matching, apply to Unicode
automatically simply because the Python 3.X
strstring type
is
Unicode, whether ASCII or wider.

String Method Utilities

The first stop on
our text and language tour is the most basic: Python’s
string objects come with an array of text processing tools, and serve as
your first line of defense in this domain. As you undoubtedly know by now,
concatenation, slicing, formatting, and other string
expressions
are workhorses of most programs (I’m
including the newer
formatmethod
in this category, as it’s really just an alternative to the
%expression):

>>>
'spam eggs ham'[5:10]
# slicing: substring extraction
'eggs '
>>>
'spam ' + 'eggs ham'
# concatenation (and *, len(), [ix])
'spam eggs ham'
>>>
'spam %s %s' % ('eggs',  'ham')
# formatting expression: substitution
'spam eggs ham'
>>>
'spam {} {}'.format('eggs',  'ham')
# formatting method: % alternative
'spam eggs ham'
>>>
'spam = "%-5s", %+06d' % ('ham',  99)
# more complex formatting
'spam = "ham  ", +00099'
>>>
'spam = "{0:<5}", {1:+06}'.format('ham',  99)
'spam = "ham  ", +00099'

These operations are covered in core language resources such as
Learning
Python
. For the purposes of this chapter, though, we’re
interested in more powerful tools: Python’s string object
methods
include a wide variety of text-processing
utilities that go above and beyond string expression operators. We saw
some of these in action early on in
Chapter 2
, and
have been using them ever since. For instance, given an instance
strof the built-in string object type,
operations like the following are provided as object method calls:

str.find(substr): Performs
substring searches
str.replace(old,new): Performs
substring substitutions
str.split(delimiter): Chops up a
string around a delimiter or whitespace
str.join(iterable): Puts
substrings together with delimiters between
str.strip(): Removes l
eading and trailing whitespace
str.rstrip(): Removes
trailing whitespace only, if any
str.rjust(width): Right-justifies
a string in a fixed-width field
str.upper(): Converts
to uppercase
str.isupper(): Tests whether
the string is uppercase
str.isdigit(): Tests
whether the string is all digit characters
str.endswith(substr-or-tuple): Tests
for a substring (or a tuple of alternatives) at the
end
str.startswith(substr-or-tuple): Tests for a
substring (or a tuple of alternatives) at the
front

This list is representative but partial, and some of these methods
take additional optional arguments. For the full list of string methods,
run a
dir(str)call at the Python
interactive prompt and run
help(str.method)on any method for some quick documentation.
The Python library manual and reference books such as
Python
Pocket Reference
also include an exhaustive
list.

Moreover, in Python today all normal string methods apply to both
bytesand
strstrings. The latter makes them applicable to
arbitrarily encoded Unicode text, simply because the
strtype is Unicode text, even if it’s only
ASCII. These methods originally appeared as function in the
stringmodule, but are
only object methods today; the
stringmodule is still present because it contains predefined constants (e.g.,
string.ascii_uppercase),
as well as the
Templatesubstitution interface in 2.4 and
later—
one of the techniques discussed in the
next
section.

Templating with Replacements and Formats

By way of
review, let’s take a quick look at string methods in the
context of some of their most common use cases. As we saw when
generating HTML forwarding pages in
Chapter 6
, the string
replacemethod is
often adequate by itself as a string
templating
tool—we can compute values and insert them at fixed positions in a
string with simple replacement calls:

>>>
template = '---$target1---$target2---'
>>>
val1 = 'Spam'
>>>
val2 = 'shrubbery'
>>>
template = template.replace('$target1', val1)
>>>
template = template.replace('$target2', val2)
>>>
template
'---Spam---shrubbery---'

As we also saw when generating HTML reply pages in the CGI scripts
of Chapters
15
and
16
, the
string
%formatting operator is also
a powerful templating tool, especially when combined with
dictionaries—simply fill out a dictionary with values and apply multiple
substitutions to the HTML string all at once:

>>>
template = """
...
---
...
---%(key1)s---
...
---%(key2)s---
...
"""
>>>
>>>
vals = {}
>>>
vals['key1'] = 'Spam'
>>>
vals['key2'] = 'shrubbery'
>>>
print(template % vals)
---
---Spam---
---shrubbery---

Beginning with Python 2.4, the
stringmodule’s
Templatefeature
is essentially a simplified and limited variation of the
dictionary-based format scheme just shown, but it allows some additional
call patterns which some may consider simpler:

>>>
vals
{'key2': 'shrubbery', 'key1': 'Spam'}
>>>
import string
>>>
template = string.Template('---$key1---$key2---')
>>>
template.substitute(vals)
'---Spam---shrubbery---'
>>>
template.substitute(key1='Brian', key2='Loretta')
'---Brian---Loretta---'

See the library manual for more on this extension. Although the
string datatype does not itself support the pattern-directed text
processing that we’ll meet later in this chapter, its tools are powerful
enough for many tasks.

Parsing with Splits and Joins

In terms of this
chapter’s main focus, Python’s built-in tools for
splitting and joining strings around tokens turn out to be especially
useful when it comes to parsing text:

str.split(delimiter?, maxsplits?): Splits a string
into a list of substrings, using either whitespace
(tabs, spaces, newlines) or an explicitly passed string as a
delimiter.
maxsplitslimits the
number of splits performed, if passed.
delimiter.join(iterable): Concatenates
a sequence or other iterable of substrings (e.g.,
list, tuple, generator), adding the subject separator string
between each.

These two are among the most powerful of string methods. As we saw
in
Chapter 2
,
splitchops a string into a list of substrings
and
joinputs them back
together:

>>>
'A B C D'.split()
['A', 'B', 'C', 'D']
>>>
'A+B+C+D'.split('+')
['A', 'B', 'C', 'D']
>>>
'--'.join(['a', 'b', 'c'])
'a--b--c'

Despite their simplicity, they can handle surprisingly complex
text-parsing tasks. Moreover, string method calls are very fast because
they are implemented in C language code. For instance, to quickly
replace all tabs in a file with four periods, pipe the file into a
script that looks like this:

from sys import *
stdout.write(('.' * 4).join(stdin.read().split('\t')))

The
splitcall here divides
input around tabs, and the
joinputs
it back together with periods where tabs had been. In this case, the
combination of the two calls is equivalent to using the simpler global
replacement string method call as follows:

stdout.write(stdin.read().replace('\t', '.' * 4))

As we’ll see in the next section, splitting strings is sufficient
for many text-parsing goals.

Summing Columns in a File

Let’s look next at
some practical applications of string splits and joins. In
many domains, scanning files by columns is a fairly common task. For
instance, suppose you have a file containing columns of numbers output
by another system, and you need to sum each column’s numbers. In Python,
string splitting is the core operation behind solving this problem, as
demonstrated by
Example 19-1
. As
an added bonus, it’s easy to make the solution a reusable tool in Python
by packaging it as an importable function.

Example 19-1. PP4E\Lang\summer.py

#!/usr/local/bin/python
def summer(numCols, fileName):
sums = [0] * numCols                             # make list of zeros
for line in open(fileName):                      # scan file's lines
cols = line.split()                          # split up columns
for i in range(numCols):                     # around blanks/tabs
sums[i] += eval(cols[i])                 # add numbers to sums
return sums
if __name__ == '__main__':
import sys
print(summer(eval(sys.argv[1]), sys.argv[2]))    # '% summer.py cols file'

Notice that we use file iterators here to read line by line,
instead of calling the file
readlinesmethod
explicitly (recall from
Chapter 4
that
iterators avoid loading the entire file into memory all at once). The
file itself is a temporary object, which will be automatically closed
when garbage collected.

As usual for properly architected scripts, you
can both
import
this module and call
its function, and
run
it as a shell tool from the
command line. The
summer.py
script calls
splitto make a list of strings representing
the line’s columns, and
evalto
convert column strings to numbers. Here’s an input file that uses both
blanks and tabs to separate columns, and the result of turning our
script loose on it:

C:\...\PP4E\Lang>
type table1.txt
1       5       10    2   1.0
2       10      20    4   2.0
3       15      30    8    3
4       20      40   16   4.0
C:\...\PP4E\Lang>
python summer.py 5 table1.txt
[10, 50, 100, 30, 10.0]

Also notice that because the summer script uses
evalto convert file text to numbers, you
could really store arbitrary Python expressions in the file. Here, for
example, it’s run on a file of Python code snippets:

C:\...\PP4E\Lang>
type table2.txt
2     1+1          1<<1           eval("2")
16    2*2*2*2      pow(2,4)       16.0
3     len('abc')   [1,2,3][2]     {'spam':3}['spam']
C:\...\PP4E\Lang>
python summer.py 4 table2.txt
[21, 21, 21, 21.0]

Summing with zips and comprehensions

We’ll revisit
evallater
in this chapter, when we explore expression evaluators.
Sometimes this is more than we want—if we can’t be sure that the
strings that we run this way won’t contain malicious code, for
instance, it may be necessary to run them with limited machine access
or use more restrictive conversion tools. Consider the following
recoding of the
summerfunction
(this is in file
summer2.py
in
the examples package if you care to experiment with it):

def summer(numCols, fileName):
sums = [0] * numCols
for line in open(fileName):                     # use file iterators
cols = line.split(',')                      # assume comma-delimited
nums = [int(x) for x in cols]               # use limited converter
both = zip(sums, nums)                      # avoid nested for loop
sums = [x + y for (x, y) in both]           # 3.X: zip is an iterable
return sums

This version uses
intfor its
conversions from strings to support only numbers, and not arbitrary
and possibly unsafe expressions. Although the first four lines of this
coding are similar to the original, for variety this version also
assumes the data is separated by commas rather than whitespace, and
runs list comprehensions and
zipto
avoid the nested
forloop
statement. This version is also substantially trickier than the
original and so might be less desirable from a maintenance
perspective. If its code is confusing, try adding
printcall statements after each step to
trace the results of each operation. Here is its handiwork:

C:\...\PP4E\Lang>
type table3.txt
1,5,10,2,1
2,10,20,4,2
3,15,30,8,3
4,20,40,16,4
C:\...\PP4E\Lang>
python summer2.py 5 table3.txt
[10, 50, 100, 30, 10]

Summing with dictionaries

The summer
logic so far works, but it can be even more general— by
making the column numbers a key of a dictionary rather than an offset
in a list, we can remove the need to pass in a number-columns value
altogether. Besides allowing us to associate meaningful labels with
data rather than numeric positions, dictionaries are often more
flexible than lists in general, especially when there isn’t a fixed
size to our problem. For instance, suppose you need to sum up columns
of data stored in a text file where the number of columns is not known
or fixed:

C:\...\PP4E\Lang>
python
>>>
print(open('table4.txt').read())
001.1 002.2 003.3
010.1 020.2 030.3 040.4
100.1 200.2 300.3

Here, we cannot preallocate a fixed-length list of sums because
the number of columns may vary. Splitting on whitespace extracts the
columns, and
floatconverts to
numbers, but a fixed-size list won’t easily accommodate a set of sums
(at least, not without extra code to manage its size). Dictionaries
are more convenient here because we can use column positions as keys
instead of using absolute offsets. The following code demonstrates
interactively (it’s also in file
summer3.py
in the examples package):

>>>
sums = {}
>>>
for line in open('table4.txt'):
...
cols = [float(col) for col in line.split()]
...
for pos, val in enumerate(cols):
...
sums[pos] = sums.get(pos, 0.0) + val
...
>>>
for key in sorted(sums):
...
print(key, '=', sums[key])
...
0 = 111.3
1 = 222.6
2 = 333.9
3 = 40.4
>>>
sums
{0: 111.3, 1: 222.6, 2: 333.90000000000003, 3: 40.4}

Interestingly, most of this code uses tools added to Python over
the years—file and dictionary iterators, comprehensions,
dict.get, and the
enumerateand
sortedbuilt-ins were not yet formed when
Python was new. For related examples, also see the tkinter grid
examples in
Chapter 9
for another
case of
evaltable magic at work.
That chapter’s table sums logic is a variation on this theme, which
obtains the number of columns from the first line of a data file and
tailors its summations for display in a
GUI.

BOOK: Programming Python

10.1Mb size Format: txt, pdf, ePub

Read Book Download Book

ads

Other books

Second Chance by Rachel Hanna

Once Forgotten Twice Loved by Bryce Evans

Goodnight June: A Novel by Sarah Jio

The Kitemaker: Stories by Ruskin Bond

Opulent Match [Ménage.com 3] (Siren Publishing Ménage Everlasting) by Peyton Elizabeth

The Case of the Vanishing Boy by Alexander Key

The Crimson Claymore by Craig A. Price Jr.

Against God by Patrick Senécal

The Fire Prince (The Cursed Kingdoms Trilogy Book 2) by Emily Gee

The Polyglots by William Gerhardie