Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

Programming Python (158 page)

BOOK: Programming Python

7.85Mb size Format: txt, pdf, ePub

ads

Step 3: Putting It All Together—A New Reply Script

There’s one last
step on our path to software maintenance nirvana: we must
recode the reply page script itself to import data that was factored out
to the common module and import the reusable form mock-up module’s
tools. While we’re at it, we move code into functions (in case we ever
put things in this file that we’d like to import in another script), and
all HTML code to triple-quoted string blocks. The result is
Example 15-23
. Changing HTML is
generally easier when it has been isolated in single strings like this,
instead of being sprinkled throughout a program.

Example 15-23. PP4E\Internet\Web\cgi-bin\languages2reply.py

#!/usr/bin/python
"""
Same, but for easier maintenance, use HTML template strings, get the
Language table and input key from common module file, and get reusable
form field mockup utilities module for testing.
"""
import cgi, sys
from formMockup import FieldMockup                   # input field simulator
from languages2common import hellos, inputkey        # get common table, name
debugme = False
hdrhtml = """Content-type: text/html\n
Languages
Syntax
"""
langhtml = """
%s

%s

"""
def showHello(form):                                 # HTML for one language
choice = form[inputkey].value                    # escape lang name too
try:
print(langhtml % (cgi.escape(choice),
cgi.escape(hellos[choice])))
except KeyError:
print(langhtml % (cgi.escape(choice),
"Sorry--I don't know that language"))
def main():
if debugme:
form = {inputkey: FieldMockup(sys.argv[1])}  # name on cmd line
else:
form = cgi.FieldStorage()                    # parse real inputs
print(hdrhtml)
if not inputkey in form or form[inputkey].value == 'All':
for lang in hellos.keys():
mock = {inputkey: FieldMockup(lang)}     # not dict(n=v) here!
showHello(mock)
else:
showHello(form)
print('
')
if __name__ == '__main__': main()

When global
debugmeis set to
True, the script can be tested
offline from a simple command line as before:

C:\...\PP4E\Internet\Web\cgi-bin>
python languages2reply.py PythonContent-type: text/html
Languages

Syntax

Python


print('Hello World')

When run online using either the page in
Figure 15-25
or an explicitly
typed URL with query parameters, we get the same reply pages we saw for
the original version of this example (we won’t repeat them here again).
This transformation changed the program’s architecture, not its user
interface. Architecturally, though, both the input and reply pages are
now created by Python CGI scripts, not static HTML files.

Most of the code changes in this version of the reply script are
straightforward. If you test-drive these pages, the only differences
you’ll find are the URLs at the top of your browser (they’re different
files, after all), extra blank lines in the generated HTML (ignored by
the browser), and a potentially different ordering of language names in
the main page’s pull-down selection list.

Again, this selection list ordering difference arises because this
version relies on the order of the Python dictionary’s keys list, not on
a hardcoded list in an HTML file. Dictionaries, you’ll recall,
arbitrarily order entries for fast fetches; if you want the selection
list to be more predictable, simply sort the keys list before iterating
over it using the list
sortmethod or
the
sortedfunction introduced in
Python 2.4:

for lang in sorted(hellos):               # dict iterator instead of .keys()
mock = {inputkey: FieldMockup(lang)}

Faking Inputs with Shell Variables

If you’re familiar with shells, you might also be able to test
CGI scripts from the command line on some platforms by setting the
same environment variables that HTTP servers set, and then launching
your script. For example, we might be able to pretend to be a web
server by storing input parameters in the
QUERY_STRINGenvironment variable, using the same syntax we employ at the end of a
URL string after the
?:

$
setenv QUERY_STRING "name=Mel&job=trainer,+writer"
$
python tutor5.py
Content-type: text/html
tutor5.py<?TITLE><br/><H1>Greetings</H1><br/><HR><br/><H4>Your name is Mel</H4><br/><H4>You wear rather (unknown) shoes</H4><br/><H4>Your current job: trainer, writer</H4><br/><H4>You program in (unknown)</H4><br/><H4>You also said:</H4><br/><P>(unknown)</P><br/><HR><br/></pre><p>Here, we mimic the effects of a<br/><code>GET<br/></code>style form submission or explicit URL.<br/>HTTP servers place the query string (parameters) in the shell variable<br/><code>QUERY_STRING<br/></code>. Python’s<br/><code>cgi<br/></code>module finds them there as though they<br/>were sent by a browser.<br/><code>POST<br/></code>-style<br/>inputs can be simulated with shell variables too, but it’s more<br/>complex—so much so that you may be better off not bothering to learn<br/>how. In fact, it may be more robust in general to mock up inputs with<br/>Python objects (e.g., as in<br/><span><em>formMockup.py<br/></em></span>). But<br/>some CGI scripts may have additional environment or testing<br/>constraints that merit unique treatment.<br/></p></div></div><div><div><p><sup>[<br/><a>64<br/></a>]<br/></sup>Assuming, of course, that this module can be found on the<br/>Python module search path when those scripts are run. Since Python<br/>searches the current directory for imported modules by default, this<br/>generally works without<br/><code>sys.path<br/></code>changes if all of our files are in our main web directory. For other<br/>applications, we may need to add this directory to<br/><code>PYTHONPATH<br/></code>or use package (directory path)<br/>imports.<br/></p></div></div></div><div><div><div><div><b>More on HTML and URL Escapes<br/></b></div></div></div><p>Perhaps the subtlest<br/>change in the last section’s rewrite is that, for<br/>robustness, this version’s reply script (<br/><a>Example 15-23<br/></a>) also calls<br/><code>cgi.escape<br/></code>for the language<br/><span><em>name<br/></em></span>, not just for the language’s code snippet. This<br/>wasn’t required in<br/><span><em>languages2.py<br/></em></span>(<br/><a>Example 15-20<br/></a>) for the known language<br/>names in our selection list table. However, it is not impossible that<br/>someone could pass the script a language name with an embedded HTML<br/>character as a query parameter. For example, a URL such as:<br/></p><pre>http://localhost/cgi-bin/languages2reply.py?language=a<b<br/></pre><p>embeds a<br/><code><<br/></code>in the language<br/>name parameter (the name is<br/><code>a<b<br/></code>).<br/>When submitted, this version uses<br/><code>cgi.escape<br/></code>to properly<br/>translate the<br/><code><<br/></code>for use in the reply<br/>HTML, according to the standard HTML escape conventions discussed earlier;<br/>here is the reply text<br/><span>generated<br/></span>:<br/></p><pre><TITLE>Languages
Syntax

a<b

Sorry--I don't know that language

The original version in
Example 15-18
doesn’t escape the
language name, such that the embedded
is interpreted as an HTML tag (which makes
the rest of the page render in bold font!). As you can probably tell by
now, text escapes are pervasive in CGI scripting—even text that you may
think is safe must generally be escaped before being inserted into the
HTML code in the reply stream.

In fact, because the Web is a text-based medium that combines
multiple language syntaxes, multiple formatting rules may apply: one for
URLs and another for HTML. We met HTML escapes earlier in this chapter;
URLs, and combinations of HTML and URLs, merit a few additional
words.

URL Escape Code Conventions

Notice that in the prior section, although it’s wrong to embed an
unescaped
<in the HTML code
reply, it’s perfectly all right to include it literally in the URL
string used to trigger the reply. In fact, HTML and URLs define
completely different characters as special. For instance, although
&must be escaped as
&inside HTML code, we have to use
other escaping schemes to code a literal
&within a URL string (where it normally
separates parameters). To pass a language name like
a&bto our script, we have to type the
following URL:

http://localhost/cgi-bin/languages2reply.py?language=a%26b

Here,
%26represents
&—the
&is replaced with a
%followed by the hexadecimal value (0x26) of
its ASCII code value (38). Similarly, as we suggested at the end of
Chapter 13
, to name C++ as a query
parameter in an explicit URL,
+must
be escaped as
%2b:

http://localhost/cgi-bin/languages2reply.py?language=C%2b%2b

Sending
C++unescaped will not
work, because
+is special in URL
syntax—it represents a space. By URL standards, most nonalphanumeric
characters are supposed to be translated to such escape sequences, and
spaces are replaced by
+signs.
Technically, this convention is known as the
application/x-www-form-urlencoded
query string
format, and it’s part of the magic behind those bizarre URLs you often
see at the top of your browser as you surf the Web.

Python HTML and URL Escape Tools

If you’re like me, you probably don’t have the hexadecimal value
of the ASCII code for
&committed
to memory (though Python’s
hex(ord(c))can help). Luckily, Python
provides tools that automatically implement URL escapes, just as
cgi.escapedoes for HTML escapes. The
main thing to keep in mind is that HTML code and URL strings are written
with entirely different syntax, and so employ distinct escaping
conventions. Web users don’t generally care, unless they need to type
complex URLs explicitly—browsers handle most escape code details
internally. But if you write scripts that must generate HTML or URLs,
you need to be careful to escape characters that are reserved in either
syntax.

Because HTML and URLs have different syntaxes, Python provides two
distinct sets of tools for escaping their text. In the standard Python
library:

cgi.escapeescapes text to
be embedded in HTML.
urllib.parse.quoteand
quote_plusescape text to be
embedded in URLs.

The
urllib.parsemodule
also has tools for undoing URL escapes (
unquote,
unquote_plus), but HTML escapes are undone
during HTML parsing at large (e.g., by Python’s
html.parsermodule). To illustrate the two
escape conventions and tools, let’s apply each tool set to a few simple
examples.

Note

Somewhat inexplicably, Python 3.2 developers have opted to move
and rename the
cgi.escapefunction used throughout
this book to
html.escape, to make use of its
longstanding original name deprecated, and to alter its quoting
behavior slightly. This is despite the fact that this function has
been around for ages and is used in almost every Python CGI-based web
script: a glaring case of a small group’s notion of aesthetics
trouncing widespread practice in 3.X and breaking working code in the
process. You may need to use the new
html.escapename in a future Python version; that is, unless Python users complain
loudly enough (yes, hint!).

Escaping HTML Code

As we saw earlier,
cgi.escapetranslates code for inclusion within HTML. We normally call this utility
from a CGI script, but it’s just as easy to explore its behavior
interactively:

>>>
import cgi
>>>
cgi.escape('a < b > c & d "spam"', 1)
'a < b > c & d "spam"'
>>>
s = cgi.escape("1<2 hello")
>>>
s
'1<2 <b>hello</b>'

Python’s
cgimodule
automatically converts characters that are special in HTML syntax
according to the HTML convention. It translates
<,
>,
and
&with an extra true
argument,
", into escape sequences of
the form
&X;, where the
Xis a mnemonic that denotes the original
character. For instance,
<stands for the “less than” operator (
<) and
&denotes a literal ampersand
(
&).

There is no
un
escaping tool in the CGI
module, because HTML escape code sequences are recognized within the
context of an HTML parser, like the one used by your web browser when a
page is downloaded. Python comes with a full HTML parser, too, in the
form of the standard module
html.parser. We won’t go into details on the
HTML parsing tools here (they’re covered in
Chapter 19
in conjunction with text processing), but
to illustrate how escape codes are eventually undone, here is the HTML
parser module at work reading back the preceding output:

>>>
import cgi, html.parser
>>>
s = cgi.escape("1<2 hello")
>>>
s
'1<2 <b>hello</b>'
>>>
>>>
html.parser.HTMLParser().unescape(s)
'1<2 hello'

This uses a utility method on the HTML parser class to unquote. In
Chapter 19
, we’ll see that using this class
for more substantial work involves subclassing to override methods run
as callbacks during the parse upon detection of tags, data, entities,
and more. For more on full-blown HTML parsing, watch for the rest of
this story in
Chapter 19
.

Escaping URLs

By contrast, URLs reserve other characters as special and must
adhere to different escape conventions. As a result, we use different
Python library tools to escape URLs for transmission. Python’s
urllib.parsemodule provides two tools that do
the translation work for us:
quote,
which implements the standard
%XXhexadecimal URL escape code sequences for most nonalphanumeric
characters, and
quote_plus, which
additionally translates spaces to
+signs. The
urllib.parsemodule also
provides functions for unescaping quoted characters in a URL string:
unquoteundoes
%XXescapes, and
unquote_plusalso changes plus signs back to
spaces. Here is the module at work, at the interactive prompt:

>>>
import urllib.parse
>>>
urllib.parse.quote("a & b #! c")
'a%20%26%20b%20%23%21%20c'
>>>
urllib.parse.quote_plus("C:\stuff\spam.txt")
'C%3A%5Cstuff%5Cspam.txt'
>>>
x = urllib.parse.quote_plus("a & b #! c")
>>>
x
'a+%26+b+%23%21+c'
>>>
urllib.parse.unquote_plus(x)
'a & b #! c'

URL escape sequences embed the hexadecimal values of nonsafe
characters following a
%sign (this
is usually their ASCII codes). In
urllib.parse, nonsafe characters are usually
taken to include everything except letters, digits, and a handful of
safe special characters (any in
'_.-'), but the two tools differ on forward
slashes, and you can extend the set of safe characters by passing an
extra string argument to the quote calls to customize the
translations:

>>>
urllib.parse.quote_plus("uploads/index.txt")
'uploads%2Findex.txt'
>>>
urllib.parse.quote("uploads/index.txt")
'uploads/index.txt'
>>>
>>>
urllib.parse.quote_plus("uploads/index.txt", '/')
'uploads/index.txt'
>>>
urllib.parse.quote("uploads/index.txt", '/')
'uploads/index.txt'
>>>
urllib.parse.quote("uploads/index.txt", '')
'uploads%2Findex.txt'
>>>
>>>
urllib.parse.quote_plus("uploads\index.txt")
'uploads%5Cindex.txt'
>>>
urllib.parse.quote("uploads\index.txt")
'uploads%5Cindex.txt'
>>>
urllib.parse.quote_plus("uploads\index.txt", '\\')
'uploads\\index.txt'

Note that Python’s
cgimodule
also translates URL escape sequences back to their original characters
and changes
+signs to spaces during
the process of extracting input information. Internally,
cgi.FieldStorageautomatically calls
urllib.parsetools which unquote if needed to
parse and unescape parameters passed at the end of URLs. The upshot is
that CGI scripts get back the original, unescaped URL strings, and don’t
need to unquote values on their own. As we’ve seen, CGI scripts don’t
even need to know that inputs came from a URL at all.

Escaping URLs Embedded in HTML Code

We’ve seen how to escape text inserted into both HTML and URLs.
But what do we do for URLs inside HTML? That is, how do we escape when
we generate and embed text inside a URL, which is itself embedded inside
generated HTML code? Some of our earlier examples used hardcoded URLs
with appended input parameters inside
hyperlink tags; the file
languages2.py
, for instance, prints HTML that
includes a URL:

Because the URL here is embedded in HTML, it must at least be
escaped according to HTML conventions (e.g., any
<characters must become
<), and any spaces should be translated
to
+signs per URL conventions. A
cgi.escape(url)call followed by the
string
url.replace(" ", "+")would
take us this far, and would probably suffice for most cases.

That approach is not quite enough in general, though, because HTML
escaping conventions are not the same as URL conventions. To robustly
escape URLs embedded in HTML code, you should instead call
urllib.parse.quote_pluson the URL string, or
at least most of its components, before adding it to the HTML text. The
escaped result also satisfies HTML escape conventions, because
urllib.parsetranslates more characters than
cgi.escape, and the
%in URL escapes is not special to
HTML.

HTML and URL conflicts: &

But there is one more
astonishingly subtle (and thankfully rare) wrinkle: you
may also have to be careful with
&characters in URL strings that are
embedded in HTML code (e.g., within
hyperlink tags). The
&symbol is both a query parameter
separator in URLs (
?a=1&b=2)
and the start of escape codes in HTML (
<). Consequently, there is a
potential for collision if a query parameter name happens to be the
same as an HTML escape sequence code. The query parameter name
amp, for instance, that shows up as
&=1in parameters two and
beyond on the URL may be treated as an HTML escape by some HTML
parsers, and translated to
&=1.

Even if parts of the URL string are URL-escaped, when more than
one parameter is separated by a
&, the
&separator might also have to be
escaped as
&according to
HTML conventions. To see why, consider the following HTML hyperlink
tag with query parameter names
name,
job,
amp,
sect, and
lt:

hello

When rendered in most browsers tested, including Internet
Explorer on Windows 7, this URL link winds up looking incorrectly like
this (the
Scharacter in the first
of these is really a non-ASCII section marker):

file.py?name=a&job=b&=cS=d<=e
result in IE
file.py?name=a&job=b&=c%A7=d%3C=e
result in Chrome (0x3C is <)

The first two parameters are retained as expected (
name=a,
job=b), because
nameis not preceded with an
&and
&jobis not recognized as a valid HTML
character escape code. However, the
&,
§, and
<parts are interpreted as special
characters because they do name valid HTML escape codes, even without
a trailing semicolon.

To see this for yourself, open the example package’s
test-escapes.html
file in your browser, and
highlight or select its link; the query names may be taken as HTML
escapes
. This text appears to
parse correctly in Python’s own HTML parser module described earlier
(unless the parts in question also end in a semicolon); that might
help for replies fetched manually with
urllib.request, but not when rendered in
browsers:

>>>
from html.parser import HTMLParser
>>>
html = open('test-escapes.html').read()
>>>
HTMLParser().unescape(html)
'\nhello\n'

Avoiding conflicts

What to do then? To make this work as expected in all cases, the
&separators should generally
be escaped if your parameter names may clash with an HTML escape
code:

hello

Browsers render this fully escaped link as expected (open
test-escapes2.html
to test), and
Python’s HTML parser does the right thing as well:

file.py?name=a&job=b&=c§=d<=e
result in both IE and Chrome
>>>
h = 'hello'
>>>
HTMLParser().unescape(h)
'hello'

Because of this conflict between HTML and URL syntax, most
server tools (including Python’s
urlib.parsequery-parameter parsing tools
employed by Python’s
cgimodule)
also allow a semicolon to be used as a separator instead of
&. The following link, for example,
works the same as the fully escaped URL, but does not require an extra
HTML escaping step (at least not for the
;):

file.py?name=a;job=b;amp=c;sect=d;lt=e

Python’s
html.parserunescape
tool allows the semicolons to pass unchanged, too, simply because they
are not significant in HTML code. To fully test all three of these
link forms for yourself at once, place them in an HTML file, open the
file in your browser using its
http://localhost/badlink.html
URL, and view the
links when followed. The HTML file in
Example 15-24
will suffice.

Example 15-24. PP4E\Internet\Web\badlink.html


"cgi-bin/badlink.py?name=a&job=b&=c§=d<=e">unescaped
"cgi-bin/badlink.py?name=a&job=b&amp=c&sect=d&lt=e">escaped
"cgi-bin/badlink.py?name=a;job=b;amp=c;sect=d;lt=e">alternative

When these links are clicked, they invoke the simple CGI script
in
Example 15-25
. This script
displays the inputs sent from the client on the standard error stream
to avoid any additional translations (for our locally running web
server in
Example 15-1
, this
routes the printed text to the server’s console window).

Example 15-25. PP4E\Internet\Web\cgi-bin\badlink.py

import cgi, sys
form = cgi.FieldStorage()      # print all inputs to stderr; stodout=reply page
for name in form.keys():
print('[%s:%s]' % (name, form[name].value), end=' ', file=sys.stderr)

Following is the (edited for space) output we get in our local
Python-coded web server’s console window for following each of the
three links in the HTML page in turn using Internet Explorer. The
second and third yield the correct parameters set on the server as a
result of the HTML escaping or URL conventions employed, but the
accidental HTML escapes cause serious issues for the first unescaped
link—the client’s HTML parser translates these in unintended ways
(results are similar under Chrome, but the first link displays the
non-ASCII section mark character with a different escape
sequence
):

mark-VAIO - - [16/Jun/2010 10:43:24] b'[:c\xa7=d<=e] [job:b] [name:a] '
mark-VAIO - - [16/Jun/2010 10:43:24] CGI script exited OK
mark-VAIO - - [16/Jun/2010 10:43:27] b'[amp:c] [job:b] [lt:e] [name:a] [sect:d]'
mark-VAIO - - [16/Jun/2010 10:43:27] CGI script exited OK
mark-VAIO - - [16/Jun/2010 10:43:30] b'[amp:c] [job:b] [lt:e] [name:a] [sect:d]'
mark-VAIO - - [16/Jun/2010 10:43:30] CGI script exited OK

The moral of this story is that unless you can be sure that the
names of all but the leftmost URL query parameters embedded in HTML
are not the same as the name of any HTML character escape code like
amp, you should generally either
use a semicolon as a separator, if supported by your tools, or run the
entire URL through
cgi.escapeafter
escaping its parameter names and values with
urllib.parse.quote_plus:

>>>
link = 'file.py?name=a&job=b&=c§=d<=e'
# escape for HTML
>>>
import cgi
>>>
cgi.escape(link)
'file.py?name=a&job=b&amp=c&sect=d&lt=e'
# escape for URL
>>>
import urllib.parse
>>>
elink = urllib.parse.quote_plus(link)
>>>
elink
'file.py%3Fname%3Da%26job%3Db%26amp%3Dc%26sect%3Dd%26lt%3De'
# URL satisfies HTML too: same
>>>
cgi.escape(elink
)
'file.py%3Fname%3Da%26job%3Db%26amp%3Dc%26sect%3Dd%26lt%3De'

Having said that, I should add that some examples in this book
do not escape
&URL separators
embedded within HTML simply because their URL parameter names are
known not to conflict with HTML escapes. In fact, this concern is
likely to be rare in practice, since your program usually controls the
set of parameter names it expects. This is not, however, the most
general solution, especially if parameter names may be driven by a
dynamic database; when in doubt, escape much
and often.

“How I Learned to Stop Worrying and Love the Web”

Lest the HTML and URL formatting rules sound too clumsy (and
send you screaming into the night!), note that the HTML and URL
escaping conventions are imposed by the Internet itself, not by
Python. (As you’ve learned by now, Python has a different mechanism
for escaping special characters in string constants with
backslashes.) These rules stem from the fact that the Web is based
on the notion of shipping formatted text strings around the planet,
and are almost surely influenced by the tendency of different
interest groups to develop very different notations.

You can take heart, though, in the fact that you often don’t
need to think in such cryptic terms; when you do, Python automates
the translation process with library tools. Just keep in mind that
any script that generates HTML or URLs dynamically probably needs to
call Python’s escaping tools to be robust. We’ll see both the HTML
and the URL escape tool sets employed frequently in later examples
in this chapter and the next. Moreover, web development frameworks
and tools such as Zope and others aim to get rid of some of the
low-level complexities that CGI scripters face. And as usual in
programming, there is no substitute for brains; amazing technologies
like the Internet come at an inevitable cost in complexity.

BOOK: Programming Python

7.85Mb size Format: txt, pdf, ePub

Read Book Download Book

ads

Other books

Don't Say You Love Me (Boundless Love Book 1) by Deorre, Iris

The Fighter by Arnold Zable

Rescue Me: A Valentine's Day Story - Smashwords by Serena Bell

Showstopper by Sheryl Berk

Twin Passions: 3 by Lora Leigh

Dead Level by Sarah Graves

Lost Melody by Roz Lee

Dreaming in Chinese by Deborah Fallows

Awake by Riana Lucas

Blind Panic by Graham Masterton