Read Programming Python Online

Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

Programming Python (161 page)

BOOK: Programming Python
4.06Mb size Format: txt, pdf, ePub
ads

This file uploaded and saved in the uploads directory is identical
to the original (run an
fc
command on Windows to
verify this). Incidentally, we can also verify the upload with the
getfile
program we wrote in the prior
section. Simply access the selection page to type the pathname of the
file on the server, as shown in
Figure 15-35
.

Figure 15-35. Verifying putfile with getfile—selection

If the file upload is successful, the resulting viewer page we
will obtain looks like
Figure 15-36
. Since the user
“nobody” (CGI scripts) was able to write the file, “nobody” should be
able to view it as well (bad grammar perhaps, but true
nonetheless).

Figure 15-36. Verifying putfile with getfile—response

Notice the URL in this page’s address field—the browser translated
the
/
character we typed into the
selection page to a
%2F
hexadecimal
escape code before adding it to the end of the URL as a parameter. We
met URL escape codes like this earlier in this chapter. In this case,
the browser did the translation for us, but the end result is as if we
had manually called one of the
urllib.parse
quoting functions on the file
path string.

Technically, the
%2F
escape
code here represents the standard URL translation for non-ASCII
characters, under the default encoding scheme browsers employ. Spaces
are usually translated to
+
characters as well. We can often get away without manually translating
most non-ASCII characters when sending paths explicitly (in typed URLs).
But as we saw earlier, we sometimes need to be careful to escape
characters (e.g.,
&
) that have
special meaning within URL strings with
urllib.parse
tools.

Handling client path formats

In the end, the
putfile.py
script stores
the uploaded file on the server within a hardcoded
uploaddir
directory, under the filename at the
end of the file’s path on the client (i.e., less its client-side
directory path). Notice, though, that the
splitpath
function in this script needs to
do extra work to extract the base name of the file on the right. Some
browsers may send up the filename in the directory path format used on
the
client
machine; this path format may not be
the same as that used on the server where the CGI script runs. This
can vary per browser, but it should be addressed for
portability.

The standard way to split up paths,
os.path.split
, knows how to extract the base
name, but only recognizes path separator characters used on the
platform on which it is running. That is, if we run this CGI script on
a Unix machine,
os.path.split
chops
up paths around a
/
separator. If a
user uploads from a DOS or Windows machine, however, the separator in
the passed filename is
\
, not
/
. Browsers running on some
Macintosh platforms may send a path that is more different
still.

To handle client paths generically, this script imports
platform-specific path-processing modules from the Python library for
each client it wishes to support, and tries to split the path with
each until a filename on the right is found. For instance,
posixpath
handles paths sent from Unix-style
platforms, and
ntpath
recognizes
DOS and Windows client paths. We usually don’t import these modules
directly
since
os.path.split
is automatically loaded with the correct one for the underlying
platform, but in this case, we need to be specific since the path
comes from another machine. Note that we could have instead coded the
path splitter logic like this to avoid some split calls:

def splitpath(origpath):                                    # get name at end
basename = os.path.split(origpath)[1] # try server paths
if basename == origpath: # didn't change it?
if '\\' in origpath:
basename = origpath.split('\\')[-1] # try DOS clients
elif '/' in origpath:
basename = origpath.split('/')[-1] # try Unix clients
return basename

But this alternative version may fail for some path formats
(e.g., DOS paths with a drive but no backslashes). As is, both options
waste time if the filename is already a base name (i.e., has no
directory paths on the left), but we need to allow for the more
complex cases generically.

This upload script works as planned, but a few caveats are worth
pointing out before we close the book on this example:

  • Firstly,
    putfile
    doesn’t
    do anything about cross-platform incompatibilities in
    file
    names
    themselves. For instance,
    spaces in a filename shipped from a DOS client are not translated
    to nonspace characters; they will wind up as spaces in the
    server-side file’s name, which may be legal but are difficult to
    process in some scenarios.

  • Secondly, reading line by line means that this CGI script is
    biased toward uploading text files, not binary datafiles. It uses
    a
    wb
    output open mode to retain
    the binary content of the uploaded file, but it assumes the data
    is text in other places, including the reply page. See
    Chapter 4
    for more about binary file
    modes. This is all largely a moot point in Python 3.1, though, as
    binary file uploads do not work at all (see
    ); in future
    release, though, this would need to be addressed.

If you run into any of these limitations, you will have crossed
over into the domain of suggested exercises.

CGI File Upload Limitations in 3.1

Regrettably, I need to document the fact that Python’s
standard library support for CGI file uploads is partially broken in
Python 3.1, the version used for this edition. In short, the
cgi
module’s internal parsing step fails today
with an exception if any binary file data or incompatible text file
data is uploaded. This exception occurs before the script has a
chance to intervene, making simple workarounds nonviable. CGI
uploads worked in Python 2.X because strings handled bytes, but fail
in 3.X today.

This regression stems in part from the fact that the
cgi
module uses the
email
package’s parser to extract incoming
multipart data for files, and is thus crippled by some of the very
same
email
package issues we
explored in detail in
Chapter 13
—its
email parser requires
str
for the
full text of a message to be parsed, but this is invalid for some
CGI upload data. As mentioned in
Chapter 13
, the data transmitted for CGI
file uploads might have
mixed
text and binary
data—including raw binary data that is
not
MIME-encoded
, text of any
encoding, and even arbitrary combinations of these. The current
email
package’s requirement to
decode this to
str
for parsing is
utterly incompatible, though the
cgi
module’s own code seems suspect for
some cases as well.

If you want to see for yourself how data is actually uploaded
by browsers, see and run the HTML and Python files named
test-cgiu-uploads-bug*
in the examples
package to upload text, binary, and mixed type files:

  • test-cgi-uploads-bug.html/py
    attempts
    to parse normally, which works for some text files but always
    fails for binary files with a Unicode decoding error

  • test-cgi-uploads-bug0.html/py
    tries
    binary mode for the input stream, but always fails with type
    errors for both text and binary because of
    email
    ’s
    str
    requirement

  • test-cgi-uploads-bug1.html/py
    saves
    the input stream for a single file

  • test-cgi-uploads-bug.html/py
    saves
    the input stream for multiple files

The last two of these scripts simply read the data in binary
mode and save it in binary mode to a file for inspection, and
display two headers passed in environment variables which are used
for parsing (a “multipart/form-data” content type and boundary,
along with a content length). Trying to parse the saved input data
with the
cgi
module fails unless
the data is entirely text that is compatible with that module’s
encoding assumptions. Really, because the data can mix text and raw
binary arbitrarily, a correct parser will need to read it as bytes
and switch between text and binary processing freely.

It seems likely that this will be improved in the future, but
perhaps not until Python 3.3 or later. Nearly two years after 3.0’s
release, though, this book project has found itself playing the role
of beta tester more often than it probably should. This primarily
derives from the fact that implications of the Python 3.X
str
/
bytes
dichotomy were not fully resolved in
Python’s own libraries prior to release. This isn’t meant to
disparage people who have contributed much time and effort to 3.X
already, of course. As someone who remembers 0.X, though, this
situation seems less than ideal.

Writing a replacement for the
cgi
module and the
email
package code it uses—the only true
viable workaround—is not practical given this book project’s
constraints. For now, the CGI scripts that perform file uploads in
this book will only work with text files, and then only with text
files of compatible encodings. This extends to email attachments
uploaded to the PyMailCGI webmail case study of the next chapter—yet
another reason why that example was not expanded with new
functionality in this edition as much as the preceding chapter’s
PyMailGUI. Being unable to attach images to emails this way is a
severe functional limitation, which limits scope in general.

For updates on the probable fix for this issue in the future,
watch this book’s website (described in the
Preface
). A fix seems likely to be incompatible with
current library module APIs, but short of writing every new system
from scratch, such is reality in the real world of software
development. (And no, “running away more” is not an option…)

More Than One Way to Push Bits over the Net

Finally, let’s discuss some context. We’ve seen three
getfile
scripts at this point in the book. The
one in this chapter is different from the other two we wrote in earlier
chapters, but it accomplishes a similar goal:

  • This chapter’s
    getfile
    is
    a server-side CGI script that displays files over the
    HTTP protocol (on port 80).

  • In
    Chapter 12
    , we built a client-
    and server-side
    getfile
    to
    transfer with raw sockets (on port 50001).

  • In
    Chapter 13
    , we implemented a
    client-side
    getfile
    to ship over
    FTP (on port 21).

Really, the
getfile
CGI script
in this chapter simply displays files only, but it can be considered a
download tool when augmented with cut-and-paste operations in a web
browser. Moreover, the CGI- and HTTP-based
putfile
script here is also different from the
FTP-based
putfile
in
Chapter 13
, but it can be considered an
alternative to both socket and FTP uploads.

The point to notice is that there are a variety of ways to ship
files around the
Internet—
sockets,
FTP, and HTTP (web pages) can move files between computers. Technically
speaking, we can transfer files with other techniques and protocols,
too—Post Office Protocol (POP) email,
Network News Transfer Protocol (NNTP) news, Telnet, and
so on
.

Each technique has unique properties but does similar work in the
end: moving bits over the Net. All ultimately run over sockets on a
particular port, but protocols like FTP and HTTP add additional
structure to the socket layer, and application models like CGI add both
structure and programmability.

In the next chapter, we’re going to use what we’ve learned here to
build a more substantial application that runs entirely on the
Web—PyMailCGI, a web-based email tool, which allows us to send and view
emails in a browser, process email attachments, and more. At the end of
the day, though, it’s mostly just bytes over sockets, with a user
interface.

CGI Downloads: Forcing the Issue

In
Example 15-27
, we
wrote a script named
getfile.py
, a Python CGI
program designed to display any public server-side file, within a web
browser (or other recipient) on the requesting client machine. It uses
a Content type of
text/plain
or
text/html
to make the requested
file’s text show up properly inside a browser. In the description, we
compared
getfile.py
to a generalized CGI download
tool, when augmented with cut-and-paste or save-as
interactions.

While true,
getfile.py
was intended to
mostly be a file display tool only, not a CGI download demo. If you
want to truly and directly download a file by CGI (instead of
displaying it in a browser or opening it with an application), you can
usually force the browser to pop up a Save As dialog for the file on
the client by supplying the appropriate Content-type line in the CGI
reply.

Browsers decide what to do with a file using either the file’s
suffix (e.g.,
xxx.jpg
is interpreted as an
image), or the Content-type line (e.g.,
text/html
is HTML code). By using various
MIME header line settings, you can make the datatype unknown and
effectively render the browser clueless about data handling. For
instance, a Content type of
application/octet-stream
in the CGI reply
generally triggers the standard Save As dialog box in a
browser.

This strategy is sometimes frowned on, though, because it leaves
the true nature of the file’s data ambiguous—it’s usually better to
let the user/client decide how to handle downloaded data, rather than
force the Save As behavior. It also has very little to do with Python;
for more details, consult a CGI-specific text, or try a web search on
“CGI download.”

BOOK: Programming Python
4.06Mb size Format: txt, pdf, ePub
ads

Other books

A Matter of Breeding by J Sydney Jones
FIRE AND FOG by Unknown
Night After Night by Phil Rickman
Tempted by the Highland Warrior by Michelle Willingham
Remember by Karen Kingsbury
The Accidental TV Star by Evans, Emily
Airships by Barry Hannah, Rodney N. Sullivan
Lady of the Gun by Adams, Faye