Read Programming Python Online

Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

Programming Python (186 page)

BOOK: Programming Python
11.66Mb size Format: txt, pdf, ePub
ads
Parsing and Unparsing Rule Strings

Splitting
comes in handy for dividing text into columns, but it can
also be used as a more general parsing tool—by splitting more than once
on different delimiters, we can pick apart more complex text. Although
such parsing can also be achieved with more powerful tools, such as the
regular expressions we’ll meet later in this chapter, split-based
parsing is simper to code in quick prototypes, and may run
faster.

For instance,
Example 19-2
demonstrates one way
that splitting and joining strings can be used to parse sentences in a
simple language. It is taken from a rule-based expert system shell
(holmes) that is written in Python and included in this book’s examples
distribution (more on holmes in a moment). Rule strings in holmes take
the form:

"rule  if , ... then , ..."

Tests and conclusions are conjunctions of terms (“,” means “and”).
Each term is a
list of words
or
variables separated by spaces; variables start with
?
. To use a rule, it is translated to an
internal form—a dictionary with nested lists. To display a rule, it is
translated back to the string form. For instance, given the call:

rules.internal_rule('rule x if a ?x, b then c, d ?x')

the conversion in function
internal_rule
proceeds as follows:

string = 'rule x if a ?x, b then c, d ?x'
i = ['rule x', 'a ?x, b then c, d ?x']
t = ['a ?x, b', 'c, d ?x']
r = ['', 'x']
result = {'rule':'x', 'if':[['a','?x'], ['b']], 'then':[['c'], ['d','?x']]}

We first split around the
if
,
then around the
then
, and finally
around
rule
. The result is the three
substrings that were separated by the keywords. Test and conclusion
substrings are split around “,” first and spaces last.
join
is used later to convert back (unparse) to the original
string for display.
Example 19-2
is the concrete
implementation of this scheme.

Example 19-2. PP4E\Lang\rules.py

def internal_rule(string):
i = string.split(' if ')
t = i[1].split(' then ')
r = i[0].split('rule ')
return {'rule': r[1].strip(), 'if':internal(t[0]), 'then':internal(t[1])}
def external_rule(rule):
return ('rule ' + rule['rule'] +
' if ' + external(rule['if']) +
' then ' + external(rule['then']) + '.')
def internal(conjunct):
res = [] # 'a b, c d'
for clause in conjunct.split(','): # -> ['a b', ' c d']
res.append(clause.split()) # -> [['a','b'], ['c','d']]
return res
def external(conjunct):
strs = []
for clause in conjunct: # [['a','b'], ['c','d']]
strs.append(' '.join(clause)) # -> ['a b', 'c d']
return ', '.join(strs) # -> 'a b, c d'

Today we could use list comprehensions and generator expressions
to gain some conciseness here. The
internal
and
external
functions, for instance, could be
recoded to simply (see file
rules2.py
):

def internal(conjunct):
return [clause.split() for clause in conjunct.split(',')]
def external(conjunct):
return ', '.join(' '.join(clause) for clause in conjunct)

to produce the desired nested lists and string by combining two
steps into one. This form might run faster; we’ll leave it to the reader
to decide whether it is more difficult to understand. As usual, we can
test components of this module interactively:

>>>
import rules
>>>
rules.internal('a ?x, b')
[['a', '?x'], ['b']]
>>>
rules.internal_rule('rule x if a ?x, b then c, d ?x')
{'then': [['c'], ['d', '?x']], 'rule': 'x', 'if': [['a', '?x'], ['b']]}
>>>
r = rules.internal_rule('rule x if a ?x, b then c, d ?x')
>>>
rules.external_rule(r)
'rule x if a ?x, b then c, d ?x.'

Parsing by splitting strings around tokens like this takes you
only so far. There is no direct support for recursive nesting of
components, and syntax errors are not handled very gracefully. But for
simple language tasks like this, string splitting might be enough, at
least for prototyping or experimental systems. You can always add a more
robust rule parser later or reimplement rules as Python code or
classes.

More on the holmes expert system shell

So how are these
rules actually used? As mentioned, the rule parser we
just met is part of the Python-coded holmes expert system shell.
Holmes is an old system written around 1993 and before Python 1.0. It
implemented both forward and backward chaining inference over rule
sets. For example, the rule:

rule pylike if ?X likes coding, ?X likes spam then ?X likes Python

can be used both to prove whether someone likes Python (chaining
backward
, from “then” to “if”), and to deduce
that someone likes Python from a set of known facts (chaining
forward
, from “if” to “then”). Deductions may
span multiple rules, multiple clauses represent conjunctions, and
rules that name the same conclusion represent alternatives. holmes
also performs simple pattern-matching along the way to assign the
variables that appear in rules (e.g.,
?X
), and it is able to explain both its
deductions and questions.

Holmes also served as proof of concept for Python in general at
a time when such evidence was still sometimes necessary, and at last
check it still worked unchanged on Python 2.X platforms. However, its
code no longer reflects modern Python best practice. Because of this,
I no longer maintain this system today. In fact, it has suffered from
bit rot so much and for so long that I’ve opted not to revisit it for
this edition at all. Its original Python 0.X code is included in the
book examples package, but it has not been ported to Python 3.X form,
and does not accurately reflect Python as it exists today.

That is, holmes is an ex-system. It has ceased to be. And it
won’t be discussed further here. For more modern AI tools and support
for Python, search the Web. This is a fun field to explore, but holmes
is probably best left to the foggy mists of Python prehistory (and
souls brave enough to try the
port).

Lesson 1: Prototype and Migrate

If you care about
performance, try to use the string object’s methods
rather than things such as regular expressions whenever you can.
Although this is a broad rule of thumb and can vary from release to
release, some string methods may be faster because they have less
work to do.

In fact, we can learn something from Python’s own history
here. Today’s string object methods began life as Python-coded
functions in the original
string
module. Due
to their prevalence, they were later optimized by moving their
implementation to the C language. When you imported
string
, it internally replaced most of its
content with functions imported from the
strop
C extension
module;
strop
methods were
reportedly 100 to 1,000 times faster than their Python-coded
equivalents at the time.

The result was dramatically faster performance for
string
client programs without impacting
the interface. That is, string module clients became instantly
faster without having to be modified for the new C-based module. Of
course, these operations evolved further and were finally moved to
string object methods, their only form today. But this reflects a
common pattern in Python work. A similar migration path was applied
to the
pickle
module we
met in
Chapter 17
—the later
cPickle
recoding in Python 2.X
and
_pickle
in 3.X are compatible
but much faster.

This is a great lesson about Python development: modules can
be coded quickly in Python at first and translated to C later for
efficiency if required. Because the interface to Python and C
extension modules is identical (both are imported modules with
callable function attributes), C translations of modules are
backward compatible with their Python prototypes. The only impact of
the translation of such modules on clients usually is an improvement
in performance.

There is normally no need to move every module to C for
delivery of an application: you can pick and choose
performance-critical modules (such as
string
and
pickle
) for translation and leave others
coded in Python. Use the timing and profiling techniques discussed
in
Chapter 18
to isolate which modules will
give the most improvement when translated to C. Once you do, the
next chapter shows how to go about writing C-based extension
modules.

Regular Expression Pattern Matching

Splitting and
joining strings is a simple way to process text, as long as
it follows the format you expect. For more general text analysis tasks
where the structure of your data is not so rigidly defined, Python
provides regular expression matching utilities. Especially for the kinds
of text associated with domains such as the Internet and databases today,
this flexibility can be a powerful ally.

Regular expressions
are simply strings that define
patterns
to be matched against other strings. Supply a pattern and a string and ask
whether the string matches your pattern. After a match, parts of the
string matched by parts of the pattern are made available to your script.
That is, matches not only give a yes/no answer, but also can pick out
substrings as well.

Regular expression pattern strings can be complicated (let’s be
honest—they can be downright gross to look at). But once you get the hang
of them, they can replace larger handcoded string search routines—a single
pattern string generally does the work of dozens of lines of manual string
scanning code and may run much faster. They are a concise way to encode
the expected structure of text and extract portions of it.

The re Module

In Python,
regular expressions are not part of the syntax of the
Python language itself, but they are supported by the
re
standard library module that you must
import to use. The module defines functions for running matches
immediately, compiling pattern strings into pattern objects, matching
these objects against strings, and fetching matched substrings after a
match. It also provides tools for pattern-based splitting, replacing,
and so on.

The
re
module implements a rich
regular expression pattern syntax that tries to be close to that used to
code patterns in the Perl language (regular expressions are a feature of
Perl worth emulating). For instance,
re
supports the notions of named groups;
character classes; and
nongreedy
matches—regular
expression pattern operators that match as few characters as possible
(other operators always match the longest possible substring). The
re
module has also been optimized
repeatedly, and in Python 3.X supports matching for both
bytes
byte-strings and
str
Unicode strings. The net effect is that
Python’s pattern support uses Perl-like patterns, but is invoked with a
different top-level module interface.

I need to point out up front that regular expressions are complex
tools that cannot be covered in depth here. If this area sparks your
interest, the text
Mastering Regular
Expressions
, written by Jeffrey E. F. Friedl
(O’Reilly), is a good next step to take. We won’t be able to cover
pattern construction itself well enough here to turn you into an expert.
Once you learn how to code patterns, though, the top-level interface for
performing matches is straightforward. In fact, they are so easy to use
that we’ll jump right into some live examples before getting into more
details.

First Examples

There are two
basic ways to kick off matches: through top-level function
calls and via methods of precompiled pattern objects. The latter
precompiled form is quicker if you will be applying the same pattern
more than once—to all lines in a text file, for instance. To
demonstrate, let’s do some matching on the following strings (see file
re-interactive.txt
for all the
interactive code run in this section):

>>>
text1 = 'Hello spam...World'
>>>
text2 = 'Hello spam...other'

The match performed in the following code does not precompile: it
executes an immediate match to look for all the characters between the
words
Hello
and
World
in our
text strings:

>>>
import re
>>>
matchobj = re.match('Hello(.*)World', text2)
>>>
print(matchobj)
None

When a match fails as it does here (the
text2
string doesn’t end in
World
), we get back the
None
object, which is Boolean false if tested
in an
if
statement.

In the pattern string we’re using here in the first argument to
re.match
, the
words
Hello
and
World
match themselves, and
(.*)
means any character (
.
) repeated zero or more times (
*
). The fact that it is enclosed in
parentheses tells Python to save away the part of the string matched by
that part of the pattern as a
group
—a matched
substring available after the match. To see how, we need to make a match
work:

>>>
matchobj = re.match('Hello(.*)World', text1)
>>>
print(matchobj)
<_sre.SRE_Match object at 0x009D6520>
>>>
matchobj.group(1)
' spam...'

When a match succeeds, we get back a
match
object
,
which has interfaces for extracting matched substrings—the
group(1)
call returns the portion of
the string matched by the first, leftmost, parenthesized portion of the
pattern (our
(.*)
). As mentioned,
matching is not just a yes/no answer; by enclosing parts of the pattern
in parentheses, it is also a way to extract matched substrings. In this
case, we’ve parsed out the text between
Hello
and
World
. Group number
0
is the entire string matched by the
pattern—useful if you want to be sure your pattern is consuming all the
text you think it is.

The interface for precompiling is similar, but the pattern is
implied in the
pattern object
we get back from
the
compile
call:

>>>
pattobj = re.compile('Hello(.*)World')
>>>
matchobj = pattobj.match(text1)
>>>
matchobj.group(1)
' spam...'

Again, you should precompile for speed if you will run the pattern
multiple times, and you normally will when scanning files line by line.
Here’s something a bit more complex that hints at the generality of
patterns. This one allows for zero or more blanks or tabs at the front
(
[ \t]*
), skips one or more after the
word
Hello
(
[
\t]+
), captures characters in the middle (
(.*)
), and allows the final word to begin with
an upper- or lowercase letter (
[Ww]
);
as you can see, patterns can handle wide variations in data:

>>>
patt = '[ \t]*Hello[ \t]+(.*)[Ww]orld'
>>>
line = ' Hello spamworld'
>>>
mobj = re.match(patt, line)
>>>
mobj.group(1)
'spam'

Notice that we matched a
str
pattern to a
str
string in the last
listing. We can also match
bytes
to
bytes
in order to handle data such as
encoded text, but we cannot mix the two string types (a constraint which
is true in Python in general—Python wouldn’t have the encoding
information needed to know how to convert between the raw bytes and the
Unicode text):

>>>
patt = b'[ \t]*Hello[ \t]+(.*)[Ww]orld'
# both as bytes works too
>>>
line = b' Hello spamworld'
# and returns bytes groups
>>>
re.match(patt, line).group(1)
# but cannot mix str/bytes
b'spam'
>>>
re.match(patt, ' Hello spamworld')
TypeError: can't use a bytes pattern on a string-like object
>>>
re.match('[ \t]*Hello[ \t]+(.*)[Ww]orld', line)
TypeError: can't use a string pattern on a bytes-like object

In addition to the tools these examples demonstrate, there are
methods for scanning ahead to find a match
(
search
), scanning to
find all matches
(
findall
), splitting
and replacing on patterns, and so on. All have analogous module and
precompiled call forms. The next section turns to a few examples to
demonstrate more of the
basics.

String Operations Versus Patterns

Notice how the
preceding example skips optional whitespace and allows for
uppercase or lowercase letters. This underscores why you may want to use
patterns in the first
place—
they
support more general kinds of text than string object methods can.
Here’s another case in point: we’ve seen that string methods can split
on and replace a substring, but they don’t suffice if the delimiter
might be more than one alternative:

>>>
'aaa--bbb--ccc'.split('--')
['aaa', 'bbb', 'ccc']
>>>
'aaa--bbb--ccc'.replace('--', '...')
# string methods use fixed strings
'aaa...bbb...ccc'
>>>
'aaa--bbb==ccc'.split(['--', '=='])
TypeError: Can't convert 'list' object to str implicitly
>>>
'aaa--bbb==ccc'.replace(['--', '=='], '...')
TypeError: Can't convert 'list' object to str implicitly

Patterns can do similar work, but also can handle alternatives
directly, by virtue of their pattern matching syntax. In the following,
the syntax
--|==
matches
either
string
--
or string
==
; the syntax
[-=]
matches either the character
-
or
=
(a
character set); and the form
(?:)
can
be used to group nested parts of a pattern without forming a saved
substring group (split treats groups specially):

>>>
import re
>>>
re.split('--', 'aaa--bbb--ccc')
['aaa', 'bbb', 'ccc']
>>>
re.sub('--', '...', 'aaa--bbb--ccc')
# single string case
'aaa...bbb...ccc'
>>>
re.split('--|==', 'aaa--bbb==ccc')
# split on -- or ==
['aaa', 'bbb', 'ccc']
>>>
re.sub('--|==', '...', 'aaa--bbb==ccc')
# replace -- or ==
'aaa...bbb...ccc'
>>>
re.split('[-=]', 'aaa-bbb=ccc')
# single char alternative
['aaa', 'bbb', 'ccc']
>>>
re.split('(--)|(==)', 'aaa--bbb==ccc')
# split includes groups
['aaa', '--', None, 'bbb', None, '==', 'ccc']
>>>
re.split('(?:--)|(?:==)', 'aaa--bbb==ccc')
# expr part, not group
['aaa', 'bbb', 'ccc']

Similarly, splits can extract simple substrings for fixed
delimiters, but patterns can also handle surrounding context like
brackets, mark parts as optional, ignore whitespace, and more. In the
next tests
\s*
means zero or more
whitespace characters (a character class);
\s+
means one or more of the same;
/?
matches an optional slash;
[a-z]
is any lowercase letter (a
range);
(.*?)
means a saved substring
of zero or more of any character again—but only as many as needed to
match the rest of the pattern (nongreedily); and the
groups
method is used to fetch the substrings
matched by the parenthesized subpatterns all at once:

>>>
'spam/ham/eggs'.split('/')
['spam', 'ham', 'eggs']
>>>
re.match('(.*)/(.*)/(.*)', 'spam/ham/eggs').groups()
('spam', 'ham', 'eggs')
>>>
re.match('<(.*)>/<(.*)>/<(.*)>', '//').groups()
('spam', 'ham', 'eggs')
>>>
re.match('\s*<(.*)>/?<(.*)>/?<(.*)>', ' /').groups()
('spam', 'ham', 'eggs')
>>>
'Hello pattern world!'.split()
['Hello', 'pattern', 'world!']
>>>
re.match('Hello\s*([a-z]*)\s+(.*?)\s*!', 'Hellopattern world !').groups()
('pattern', 'world')

In fact, there’s more than one way to match. The
findall
method provides generality that leaves string objects in
the dust—it locates all occurrences of a pattern and returns all the
substrings it matched (or a list of tuples for multiple groups). The
search
method is similar but stops at the first match—it’s like
match
plus an initial forward scan. In the following, string
object finds locate just one specific string, but patterns can be used
to locate and extract bracketed text anywhere in a string, even pairs
with optional text between:

>>>
'//'.find('ham')
# find substring offset
8
>>>
re.findall('<(.*?)>', '//')
# find all matches/groups
['spam', 'ham', 'eggs']
>>>
re.findall('<(.*?)>', ' / ')
['spam', 'ham', 'eggs']
>>>
re.findall('<(.*?)>/?<(.*?)>', '/ ... ')
[('spam', 'ham'), ('eggs', 'cheese')]
>>>
re.search('<(.*?)>/?<(.*?)>', 'todays menu: /...').groups()
('spam', 'ham')

Especially when using
findall
,
the
(?s)
operator comes in handy to
force
.
to match
end-of-line
characters in multiline text; without
it
.
matches everything
except
lines ends. The following searches look for
two adjacent bracketed strings with arbitrary text between, with and
without skipping line breaks:

>>>
re.findall('<(.*?)>.*<(.*?)>', ' \n \n')
# stop at \n
[]
>>>
re.findall('(?s)<(.*?)>.*<(.*?)>', ' \n \n')
# greedy
[('spam', 'eggs')]
>>>
re.findall('(?s)<(.*?)>.*?<(.*?)>', ' \n \n')
# nongreedy
[('spam', 'ham')]

To make larger patterns more mnemonic, we can even associate
names
with matched substring groups in using the
)
pattern syntax
and fetch them by name after matches, though this is of limited utility
for
findall
. The next tests look for
strings of “word” characters (
\w
)
separated by a
/
—this isn’t much more
than a string split, but parts are named, and
search
and
findall
both scan ahead:

>>>
re.search('(?P\w*)/(?P\w*)', '...aaa/bbb/ccc]').groups()
('aaa', 'bbb')
>>>
re.search('(?P\w*)/(?P\w*)', '...aaa/bbb/ccc]').groupdict()
{'part1': 'aaa', 'part2': 'bbb'}
>>>
re.search('(?P\w*)/(?P\w*)', '...aaa/bbb/ccc]').group(2)
'bbb'
>>>
re.search('(?P\w*)/(?P\w*)', '...aaa/bbb/ccc]').group('part2')
'bbb'
>>>
re.findall('(?P\w*)/(?P\w*)', '...aaa/bbb ccc/ddd]')
[('aaa', 'bbb'), ('ccc', 'ddd')]

Finally, although basic string operations such as slicing and
splits are sometimes enough, patterns are much more flexible. The
following uses
[^ ]
to match any
character
not
following the
^
, and escapes a dash within a
[]
alternative set using
\-
so it’s not taken to be a character set
range separator. It runs equivalent slices, splits, and matches, along
with a more general match that the other two cannot approach:

>>>
line = 'aaa bbb ccc'
>>>
line[:3], line[4:7], line[8:11]
# slice data at fixed offsets
('aaa', 'bbb', 'ccc')
>>>
line.split()
# split data with fixed delimiters
['aaa', 'bbb', 'ccc']
>>>
re.split(' +', line)
# split on general delimiters
['aaa', 'bbb', 'ccc']
>>>
re.findall('[^ ]+', line)
# find non-delimiter runs
['aaa', 'bbb', 'ccc']
>>>
line =
'aaa...bbb-ccc / ddd.-/e&e*e'
# handle generalized text
>>>
re.findall('[^ .\-/]+', line)
['aaa', 'bbb', 'ccc', 'ddd', 'e&e*e']

At this point, if you’ve never used pattern syntax in the past
your head may very well be spinning (or have blown off entirely!).
Before we go into any further examples, let’s dig into a few of the
details underlying the
re
module and
its patterns.

BOOK: Programming Python
11.66Mb size Format: txt, pdf, ePub
ads

Other books

Autumn Bones by Jacqueline Carey
Stay by Jennifer Sucevic
First Command by A. Bertram Chandler
Sealed In Lies by Abell, Kelly
A Journey by Chance by Sally John
Five's Legacy by Pittacus Lore