
|
Version 1135 (2006-06-25 18:03:00)
| |
License
Copyright © 2005-06 Python Software Foundation
Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documentation files
(the "Software"), to deal in the Software without restriction,
including without limitation the rights to use, copy, modify, merge,
publish, distribute, sublicense, and/or sell copies of the Software,
and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Introduction
Introduction
- This course will teach you how to design, build, maintain, and share programs efficiently
- Focus on the equivalent of good laboratory technique
- The 20% of ideas that account for 80% of real world use
- Software carpentry, rather than software engineering
- About putting an extension on the house, rather than building the Channel Tunnel
- Everything that will make you more productive will also improve the quality of what you build
- Help computational science deserve the second half of its name
- Send comments
Self Assessment
- 1 for “yes”, 0 for “no” or “not relevant”, and -1 if you don't understand the question
- Do you use version control?
- Can you rebuild everything with one command?
- Do you build the software from scratch daily?
- Do you have an automated test suite?
- Do you run the suite before checking in changes?
- Do you know how much code your tests cover?
- Do you have a bug database?
- Do you use a symbolic debugger?
- Do you use assertions and other defensive programming techniques?
- Can you trace everything you release (not just software) back to its origins?
- Do you document as you program?
- Do you keep your documentation in your source files?
- Can you set up a development environment on a fresh machine without heroic effort?
- Is there a searchable archive of discussions about the project?
- Do you use a style checker to ensure that your software is written in a uniform, readable way?
- Your score:
- Negative: you will find this course challenging, but rewarding
- 0-5: this course will fill in the gaps in what you already know
- 6-10: you will be able to apply the ideas in this course to your own projects immediately
- 11 and up: you should probably go straight to the primary material in the Bibliography
- Send comments
The State of Play
- Computers are as important to scientists as telescopes and test tubes
- Simulate things that are too big, too small, too fast, too slow, or too dangerous to study in the lab
- Analyze volumes of data that would overwhelm human “computers”
- Many scientists now spend much of their time writing and maintaining software
- But most have never been taught how to do this efficiently
- It's a long way from the loops and arrays of first year to simulating bone development in foetal marsupials
- Like being shown how to differentiate polynomials, then expected to invent the rest of calculus
- As a result, scientists spend far more time programming than they ought to
- And they still don't know how trustworthy their programs are
- Send comments
Meeting Standards
- Experimental results are only publishable if they are believed to be correct and reproducible
- Equipment calibrated, samples uncontaminated, relevant steps recorded, etc.
- In practice, rely on expectations and cultural norms
- Drilled into people starting with their first high school chemistry class
- Only actually check work that is already under suspicion
- How well do computational scientists meet these standards?
- Correctness of code rarely questioned
- We all know programs are buggy…
- …but when was the last time you saw a paper rejected because of concerns over software quality?
- Reproducibility often nonexistent
- How many people can reproduce, much less trace, each computational result in their thesis?
- Send comments
The Grass Isn't That Much Greener
- The bad news is that things aren't that much better in industry
- Commercial projects of all sizes routinely go over time and over budget
- What they deliver is often incomplete, riddled with bugs, and not what the customer actually wanted
- How is this possible?
- Low expectations
- Like American cars in the 1970s
- Lack of accountability
- Hard to sue software developers
- Most shrink-wrap licenses effectively say, “This CD could be blank, and we wouldn't have to give you back your money.”
- Send comments
Hidden in Plain Sight
- The good news is that we've had solutions for these problems for years
- They just aren't evenly distributed
- This is one of the reasons good programmers are up to 28 times better than bad ones [Glass 2002]
- Improving quality improves productivity
- The more effort you put into making sure it's right the first time, the less total time it'll take to get it built
- The tools and techniques that help you write better code also help you write more code, faster
- Version control
- Test-driven development
- Task automation
- Symbolic debuggers
- And more that we'll meet later
- Send comments
The Times They Are A-Changin'
- The current situation is clearly unstable
- The only direction standards and expectations can go is up
- One high-profile lawsuit could set a precedent for the whole industry
- Change can happen almost overnight
- Like the American car market when German and Japanese imports appeared in the 1970s
- Offshoring is currently the biggest pressure
- Jobs that can move overseas will
- As in manufacturing, only work that adds significant value will remain
- Which means that programmers in affluent societies either move up, or move out
- This course is just as relevant if you're in Mexico, India, China, or Hungary
- “Ship, then fix” doesn't work if you're eleven time zones away from the customer
- Your chances of getting a second project are much better if you deliver the first one on spec and on time
- Send comments
This Course
- Introduce some basic tools
- Instant gratification
- Serve as a guide to good practice
- Every software tool implicitly embodies someone's ideas about how a task should be done
- Show how to build tools like these
- What goes into the software
- How to create it
- Don't expect you to write your own version control system or bug tracker
- But do expect you to automate common tasks, make your products extensible, etc.
- Show how to apply these ideas to other tasks
- Most modern well-engineered applications are implemented as specialized sets of tools
- (Where “well-engineered” means flexible, reliable, and maintainable)
- Keep in mind that it's impossible to cram an undergraduate degree and several years of industry experience into one course
- If you really want to improve, you have to treat this course as a starter's block, not a finishing line
- Send comments
Setting Up
- Some previous programming experience
for loops, if/then/else, …- Function calls, arrays, file I/O, …
- If you haven't done at least this much, you probably won't understand the problems this course is trying to solve
- Some tools
Python (version 2.3 or higher)- A Unix shell
- Use
Cygwin if you're on Windows
- An editor (preferably one with syntax highlighting and macros)
- We'll look at integrated development environments later in the course
Subversion
- Some time
- You can only learn by doing
- So expect to spend 2-3 hours outside class for each lecture
- Send comments
A Note on Tool Choice
- Most of the tools used in this course are freely available under various open source licenses
- That doesn't mean that free tools are always best
- “Linux is only free if your time has no value.” (Jamie Zawinski)
- Commercial tools are often more complete than their free counterparts
- Not least because there are some jobs people will only do if they're paid to
- If you're willing to spend $200 on a good chair, you should be willing to spend $200 on software, too
- If you want to use open source software:
- Be aware that many companies provide for-pay technical support
- Read [Fogel 2005] (and possibly [Rosen 2005])
- Think about contributing something yourself
- Great way to make contacts and learn new skills
- Best way to get exactly what you want
- Send comments
Contributing
- This course is open source, too
- Contributions are very welcome
- Send comments
Recommended Reading
- [Glass 2002] summarizes what we actually know about programmers' productivity
- [Hunt & Thomas 1999] and [Gunderloy 2004] are about the things that distinguish good programmers from bad ones
- A book on the Unix shell
- A book on
Python- [Lutz & Ascher 2003] is the standard introduction
- [Fehily 2006] is more approachable
- [Pilgrim 2004] is also good, and is available on-line for free
- [Langtangen 2004] is a comprehensive introducton for scientists and engineers
- Goes into much more detail than this course will
- But doesn't address broader issues, such as programming practices
- [Doar 2005] describes what a complete development environment ought to contain
- By the time this course is over, you should understand what each tool he describes does, and why you want to use it
- See the Bibliography for (many) others
- Check out the Online Resources as well
- Send comments
Typographic Conventions
Summary
- Twenty-five hours of instruction, and one hundred hours of practice, can change your life for the better
- Let's get started
- Send comments
Exercises
Exercise 2.1:
What is the largest software project you have ever worked on?
How well did it meet its original objectives? What is the most
important thing you learned from it?
Exercise 2.2:
Write a point-form list of the programming tools you use on a
regular basis. When and how did you learn each one? How
proficient do you think you are with each? Compared to
whom?
Exercise 2.3:
Suppose you have been given one week to write a program to
translate old-style configuration files to a new syntax. Write
a point-form description of how you would go about it.
Exercise 2.4:
Rewrite the following fragment of code to make it more
readable. Don't worry about the fact that you don't know the
language it's written in; feel free to use any functions or
language features you're familiar with from other languages.
i = open('oldconfig.cnf', 'r');
ll = i.readlines();
for j in 0..len(ll) {
if len(j) > 0 {
if not defined(r) r = new list;
r.append(j);
}
}
sort(r);
print 'longest line is', r[0];
Exercise 2.5:
What are the errors in the function shown below? Don't worry
about the lack of variable declarations: this language doesn't
need them. Note that, like C and Java, this language uses 0 as
the first index for lists.
# Calculate a running sum of a list of numbers.
# If the input values are [1, 2, 3], the final values are [1, 3, 6].
def running_sum(values) {
i = 1;
while (i < len(values)) {
values[i] = values[i] + values[i-1];
}
}
Exercise 2.6:
A sub-contractor in Euphoristan has just written a function
that takes two lists of phone numbers (represented as strings),
and returns all those in the first list that are not in
the second. You only have a few minutes to test it before she
goes off-line for the weekend; what are the first half-dozen
test cases you would try?
Send comments
Shell Basics
Introduction
- Most modern tools have a graphical user interface (GUI)
- They're easier to use
- A picture is worth a thousand words
- But command-line user interfaces (CLUIs) still have their place
- Easier to build a simple CLUI than a simple GUI
- Higher action-to-keystroke ratio
- Once you're over the learning curve
- Easier to see and understand what the computer is doing on your behalf
- Which is part of what this course is about
- Most important: it's easier to combine CLUI tools than GUI tools
- Small tools, combined in many ways, can be very powerful
- This lecture focuses on Unix
- Because while there are good Unix emulators for Windows, there aren't good Windows emulators for Unix
- Send comments
You Can Skip This Lecture If...
The Shell
- The most important command-line tool is the command shell
- Usually just called “the shell”
- Looks (and works) like an interactive terminal circa 1980
![[A Shell in Action]](./img/shell01/shell_screenshot.png)
Figure 3.1: A Shell in Action
- Manages a user's interactions with the operating system by:
- Reading commands from the keyboard
- Executing those commands…
- …or running another program
- Displaying the output
- The shell is just one program among many
- Many different shells have been written
- The Bourne shell, called
sh, is an ancestor of many of them- It's still a lowest common denominator that you can always rely on
- We'll use
bash (the Bourne Again Shell) in this course- Even on Windows (thanks to
Cygwin)
- Send comments
The Shell is Not the Operating Sytsem
- The operating system is not just another program
- Automatically loaded when the computer boots up
- Runs everything else (including shells)
![[Operating System]](./img/shell01/operating_system.png)
Figure 3.2: Operating System
- The OS manages the computer's hardware
- Provides a common interface to different chips, disks, network cards, etc.
- So that user-level applications can run anywhere
- The OS also keeps track of what programs are running, what privileges they have, etc.
- Which makes it crucial to security
- Send comments
The File System
- The file system is the set of files and directories the computer can access
- “Everything that doesn't go away when you reboot”
- Data is stored in files
- By convention, files have two part names, like
notes.txt or home.html - Most operating systems allow you to associate a filename extension with an application
- E.g.,
.txt is associated with an editor, and .html with a web browser
- But this is all just convention: you can call files (almost) anything you want
- Files are stored in directories (often called folders)
- Directories can contain other directories, too
- Results in the familiar directory tree
![[A Directory Tree]](./img/shell01/directory_tree.png)
Figure 3.3: A Directory Tree
- Everything in a particular directory must have a unique name
- Otherwise, how would you identify it?
- But items in different directories can have the same name
- On Unix, the file system has a unique root directory called
/- Every other directory is a child of it, or a child of a child, etc.
- On Windows, every drive has its own root directory
- So
C:\home\gvwilson\notes.txt is different from J:\home\gvwilson\notes.txt - When you're using Cygwin, you can also write
C:\home\gvwilson as c:/home/gvwilson - Or as
/cygdrive/c/home/gvwilson- Some Unix programs give
":" a special meaning, so Cygwin needed a way to write paths without it…
- Note: file and directory names are case-sensitive on Unix, but case-insensitive on Windows
- So don't ever rely on case differences when naming things
- Send comments
Paths
- A path is a description of how to find something in a file system
- An absolute path describes a location from the root directory down
- Equivalent to a street address
- Always starts with
"/" /home/hpotter is Harry Potter's home directory/courses/swc/web/lec/shell.html is this file
- A relative path describes how to find something from some other location
- Equivalent to saying, “Four blocks north, and seven east”
- From
/courses/swc, the relative path to this file is web/lec/shell.html
- Every program (including the shell) has a current working directory
- “Where am I?”
- Relative paths are deciphered relative to this location
- It can change while a program is running
- Two special directory names:
"." (pronounced “dot”) is the current directory".." (pronounced “dot dot”) is the directory one level up- Also called the parent directory
- In
/courses/swc/data, .. is /courses/swc - In
/courses/swc/data/elements, .. is /courses/swc/data ![[Parent Directories]](./img/shell01/parent_directory.png)
Figure 3.4: Parent Directories
- Send comments
Navigating the File System
- Easiest way to learn basic Unix commands is to see them in action
- Type
pwd (short for "print working directory”) to find out where you are- Unfortunately, most Unix commands have equally cryptic names
- Then type
ls (for “listing”) to see what's in the current directory - To see what's in the
data directory, type ls data - Or:
- Send comments
Execution Cycle
- When you type a command like
ls, the OS:- Reads characters from the keyboard
- Passes them to the shell (because it's the currently active window)
- The shell:
- Breaks the line of text it receives into words
- Looks for a program with the same name as the first word
- See in a moment how the shell knows where to look
- Runs that program
- That program's output goes back to the shell…
- …which gives it to the OS…
- …which displays it on the screen
- (Actually, the OS hands it to the window manager, which takes care of the display)
![[Running a Program]](./img/shell01/running_program.png)
Figure 3.5: Running a Program
- All well-designed software systems work this way
- Break the task down into pieces
- Write a tool that solves each sub-problem
- Hook 'em up
- Allows you to:
- Develop and test components independently
- Replace or re-use components incrementally
- Add new components as you go along
- Send comments
Providing Options
- Can make
ls produce more informative output by giving it some flags- By convention, flags for Unix tools start with
"-", as in "-c" or "-l" - Some flags take arguments (such as filenames)
- Show directories with trailing slash
- Show all files and directories, including those whose names begin with
. - Flags can be combined
- Send comments
Creating Files and Directories
- Rather than messing with the course files, let's create a temporary directory and play around in there
- Note: no output (but
-v (“verbose”) would tell mkdir to print a confirmation message)
- Go into that directory: no files there yet
- Use the editor of your choice to create a file called
earth.txt with the following contents: - Easiest way to create a similar file
venus.txt is to copy earth.txt and edit it - Send comments
Looking at Files
- Check the contents of the file using
cat (short for “concatenate”) - Compare the sizes of the two files using:
ls -l (“-l” meaning “long form”)wc (for “word count”)
- Send comments
Basic Tools
man | Documentation for commands. |
cat | Concatenate and display text files. |
cd | Change working directory. |
clear | Clear the screen. |
cp | Copy files and directories. |
date | Display the current date and time. |
diff | Show differences between two text files. |
echo | Print arguments. |
head | Display the first few lines of a file. |
ls | List files and directories. |
mkdir | Make directories. |
more | Page through a text file. |
mv | Move (rename) files and directories. |
od | Display the bytes in a file. |
passwd | Change your password. |
pwd | Print current working directory. |
rm | Remove files. |
rmdir | Remove directories. |
sort | Sort lines. |
tail | Display the last few lines of a file. |
uniq | Remove adjacent duplicate lines. |
wc | Count lines, words, and characters in a file. |
|
Table 3.1: Basic Command-Line Tools |
|---|
- Exercise: what are the native Windows equivalents of each of these?
- Send comments
Summary
- Command-line tools will be with us for a long time
- Easiest way to do many simple tasks
- Easiest way to see what the computer is actually doing
- Often the only thing you can rely on having on a new machine
- A handful of basic commands will get you a long way
- Send comments
Exercises
Exercise 3.1:
Suppose ls shows you this:
Makefile biography.txt data enrolment.txt programs thesis
What argument(s) will make it print the names in reverse, like this:
thesis programs enrolment.txt data biography.txt Makefile
Exercise 3.2:
What does the command cd ~ do? What about cd ~hpotter?
Exercise 3.3:
What command will show you the first 10 lines of a file? The first 25? The last 12?
Exercise 3.4:
What do the commands pushd, popd,
and dirs do? Where do their names come from?
Exercise 3.5:
How would you send the file earth.txt to the
default printer? How would you check it made it (other than
wandering over to the printer and standing there)?
Exercise 3.6:
The instructor wants you to use a hitherto unknown command
for manipulating files. How would you get help on this command?
Exercise 3.7:
diff finds and displays the differences between
two text files. For example, if you modify earth.txt to
create a new file earth2.txt that contains:
Name: Earth
Period: 365.26 days
Inclination: 0.00 degrees
Eccentricity: 0.02
Satellites: 1
you can then compare the two files like this:
$ diff earth.txt earth2.txt
3c3
< Inclination: 0.00
---
> Inclination: 0.00 degrees
4a5
> Satellites: 1
(The rather cryptic header "3c3" means that line 3 of
the first file must be changed to get line 3 of the second;
"4a5" means that a line is being added after line 4 of the
original file.)
What flag(s) should you give diff to tell it to
ignore changes that just insert or delete blank lines? What if
you want to ignore changes in case (i.e., treat lowercase and
uppercase letters as the same)?
Send comments
More Shell
Introduction
- The shell is more than just a clumsy way to move around a file system
- It's a component-based programming environment
- Small tools that each do one job…
- …can be connected together to create ad hoc solutions to larger problems
- A good model, even when you're building large GUI or web applications
- Send comments
You Can Skip This Lecture If...
- You know what
stdin and stdout are - You know what a process is
- You know what a pipe is
- You know what
$PATH is - You know what
-rwxr-xr-x means - Send comments
Wildcards
- Some characters (called wildcards) mean special things to the shell
- Note
- The shell expands wildcards, not individual applications
ls can't tell whether it was invoked as ls *.txt or as ls earth.txt venus.txt
- Wildcards only work with filenames, not with command names
ta* does not find the tabulate command
- Send comments
Redirecting Input and Output
- A running program is called a process
- Every process automatically has three connections to the outside world:
- You can tell the shell to connect standard input and standard output to files instead
command < input_file reads from input_file instead of from the keyboard- Don't need to use this very often, because most Unix commands let you specify the input file (or files) as command-line arguments
command > output_file writes to output_file instead of to the screen- Only “normal” output goes to the file, not error messages
command < input_file > output_file does both![[Redirecting Standard Input and Output]](./img/shell02/redirection.png)
Figure 4.1: Redirecting Standard Input and Output
- Note that redirection takes effect command-by-command, rather than permanently
- Send comments
Redirection Examples
- Save number of words in all text files to
words.len: - Try typing
cat > junk.txt- No input file specified, so
cat reads from the keyboard - Output sent to a file
- The world's dumbest text editor
- When you're done, use
rm junk.txt to get rid of the file- Don't type
rm * unless you're really, really sure that's what you want to do…
- Don't redirect out to the same file, e.g.
sort words >words- The shell sets up redirection before running the command
- Redirecting out to an existing file truncates it make it empty
sort then goes and reads the empty file- Contents of
words are lost
- Send comments
Pipes
- Suppose you want to use the output of one program as the input of another
- E.g., use
wc -w *.txt to count the words in some files, then sort -n to sort numerically
- The obvious solution is to send output of first command to a temporary file, then read from that file
- The right answer is to use a pipe
- Can chain any number of commands together
- Any program that reads from standard input and writes to standard output can use redirection and pipes
- Such programs are often called filters
- If your programs work like filters, you (and other people) can combine them with standard tools
- A combinatorial explosion of goodness
- Send comments
Environment Variables
- The OS stores some environment variables for every process
- Like variables in a program, each has a name and a value
- By convention, names are all upper case
- Values are always strings
- Type
set at the command prompt to get a listing: - Get a variable's value by putting
"$" in front of its name- So
ls $HOME is the same as ls /home/rweasley (if you're Ron Weasley) - Use the
echo command to print out a variable's value - If a variable hasn't been defined, its value is the empty string (rather than an error)
| Name | Typical Value | Notes |
|---|
COLUMNS | 80 | The width in characters of the current display window |
EDITOR | /bin/edit | Preferred editor |
HOME | /home/rweasley | The current user's home directory |
HOMEDRIVE | C: | The current user's home drive (Windows only) |
HOSTNAME | "ishad" | This computer's name |
HOSTTYPE | "i686" | What kind of computer this is |
LINES | 60 | The height in characters of the current display |
OS | "Windows_NT" | What operating system is running |
PATH | "/home/rweasley/bin:/usr/local/bin:/usr/bin:/bin:/Python24/" | Where to look for programs |
PWD | /home/rweasley/swc/lec | Present working directory (sometimes CWD, for current working directory) |
SHELL | /bin/bash | What shell is being run |
TEMP | /tmp | Where to store temporary files |
USER | "rweasley" | The current user's ID |
|
Table 4.1: Important Environment Variables |
|---|
- Send comments
Setting Environment Variables
- Different shells have different syntaxes for setting environment variables
- For Bash, use this:
- Setting an environment variable only affects that program (i.e., that shell)
- If you want programs run from that shell to inherit the value, you must export it:
- Send comments
Configuration
- To set a variable's value automatically when you log in, edit
~/.bashrc - Many applications look for personal configuration files in the user's home directory
- By convention, their names begin with “.” so that a normal
ls won't show them - Once upon a time, the “rc” at the end meant “run commands”
- Send comments
How the Shell Finds Programs
- The
PATH environment variables defines the shell's search path - When you run a command like
broom, the shell:- Splits
$PATH into components to get a list of directories- Unix uses “:” as a separator
- Windows uses “;”
- Looks for the program in each directory in left-to-right order
- Runs the first one that it finds
- Example
PATH is /home/rweasley/bin:/usr/local/bin:/usr/bin:/bin:/Python24- Both
/usr/local/bin/broom and /home/rweasley/bin/broom exist /home/rweasley/bin/broom will be run when you type broom at the command prompt- Can run the other one by specifying the path, instead of just the command name
- Send comments
Common Search Path Entries
/bin, /usr/bin: core tools like ls- Note: the word “bin” comes from “binary”, which is geekspeak for “a compiled program”
/usr/local/bin: optional (but common) tools, like the gcc C compiler$HOME/bin: tools you have built for yourself- Remember,
$HOME is your home directory
- It is also common to include
. (the current working directory) in your path- Allows you to run a program in the current directory using
whatever, instead of ./whatever
- Send comments
Cygwin on Windows
Cygwin does things a little differently- Uses the notation
/cygdrive/c/somewhere instead of Windows' C:/somewhere- Because the colon in
C:/somewhere would clash with the colons in the PATH variable
- By default, Cygwin treats
C:/cygwin as the root of its file system- So
/home/rweasley is a synonym for C:/cygwin/home/rweasley
- Yes, it can be confusing
- But then, it is trying to make one operating system look like another
- Send comments
File Ownership and Permissions
- On Unix, every user belongs to one or more groups
- The
groups command will show you which ones you are in
- Every file is owned by a particular user and a particular group
- Can assign read (r), write (w), and execute (x) permissions independently to user, group, and others
- Read: can look at contents, but not modify them
- Write: can modify contents
- Execute: can run the file (e.g., it's a program)
ls -l shows this information- Along with the file's size and a few other things
- Permissions displayed as three
rwx triples- “Missing” permissions shown by
"-" - So
rw-rw-r-- means:- User and group can read and write
- Everyone else can read, but not write
- No one can execute
- Send comments
Directory Permissions
- Execute permission means something different for directories
- Allows you to “go into” a directory, but does not mean you can read its contents
- If
tools has permission rwx--x--x, then:- If someone other than the owner does
ls tools, permission is denied - But anyone who wants to can run
tools/pfold
- Send comments
Changing Permissions
- Change permissions using
chmodchmod u+x broom allows broom's owner to run itchmod o-r notes.txt takes away the world's read permission for notes.txt
- Any set of shell commands can be turned into a program!
- If it's worth doing again, it's worth automating
- Create a file called
nojunk - Change permissions to
rwxr-xr-x - Run it with
./nojunk- Or if
$HOME/bin is in your search path, move it there
- Don't call your temporary test programs
test- There's already
/usr/bin/test - Your PATH may cause that program to run instead of yours
- Confusion results, so use something else, e.g.
./try
- Send comments
Ownership and Permission: Windows
- Of course, it all works differently on Windows
- Not better or worse, just differently
- Windows XP uses access control lists
- Every file and directory has a list of (who, what) pairs
- “Who” can be a group
- Some versions of Unix provide ACLs as well, but many tools don't understand them
- Older versions of Windows (such as Windows 95 and Windows 2000) are fundamentally insecure, and shouldn't be used
- Cygwin does its best to make the Windows model look like Unix's
- When you trip over the differences, please consult a system administrator
- Send comments
More Advanced Tools
chmod | Change file and directory permissions. |
du | Print the disk space used by files and directories. |
find | Find files with names that match patterns, that are of a certain age or size, etc. |
grep | Print lines matching a pattern. |
gunzip | Uncompress a file. |
gzip | Compress a file. |
lpr | Send a file to a printer. |
lprm | Remove a print job from a printer's queue. |
lpq | Check the status of a printer's queue. |
ps | Display running processes. |
tar | Archive files. |
which | Find the path to a program. |
who | See who is logged in. |
xargs | Execute a command for each line of input. |
|
Table 4.2: Advanced Command-Line Tools |
|---|
- Send comments
Summary
- The shell is as powerful as most programming languages
- Actually has features that most programming languages don't
- But its limits are as important as its capabilities
- As soon as you need functions or data structures, you should switch to Basic Scripting
- Send comments
Exercises
Exercise 4.1:
-rwxr-xr-x 1 aturing cambridge 69 Jul 12 09:17 mars.txt
-rwxr-xr-x 1 ghopper usnavy 71 Jul 12 09:15 venus.txt
According to the listing of the data directory above,
who can read the file earth.txt? Who can write it (i.e.,
change its contents or delete it)? When was earth.txt
last changed? What command would you run to allow everyone to
edit or delete the file?
Exercise 4.2:
Suppose you want to remove all files whose names (not including
their extensions) are of length 3, start with the letter a, and
have .txt as extension. What command would you use? For
example, if the directory contains three files a.txt,
abc.txt, and abcd.txt, the command should remove
abc.txt , but not the other two files.
Exercise 4.3:
You're worried your data files can be read by your
nemesis, Dr. Evil. How would you check whether or not he can,
and if necessary change permissions so only you can read or
write the files?
Exercise 4.4:
What's the difference between the commands cd HOME
and cd $HOME?
Exercise 4.5:
Suppose you want to list the names of all the text files in the
data directory that contain the word "carpentry". What
command or commands could you use?
Exercise 4.6:
Suppose you have written a program called analyze. What
command or commands could you use to display the first ten lines of
its output? What would you use to display lines 50-100? To send
lines 50-100 to a file called tmp.txt?
Exercise 4.7:
The command ls data > tmp.txt writes a listing of
the data directory's contents into tmp.txt. Anything
that was in the file before the command was run is overwritten. What
command could you use to append the listing to tmp.txt
instead?
Exercise 4.8:
What command(s) would you use to find out how many
subdirectories there are in the lectures directory?
Exercise 4.9:
What does rm *.ch? What about rm
*.[ch]?
Exercise 4.10:
What command(s) could you use to find out how many instances of
a program are running on your computer at once? For example, if you
are on Windows, what would you do to find out how many instances of
svchost.exe are running? On Unix, what would you do to
find out how many instances of bash are running?
Exercise 4.11:
A colleague asks for your data files. How would you
archive them to send as one file? How could you compress them?
Exercise 4.12:
You have changed a text file on your home PC, and mailed
it to the university terminal. What steps can you take to see
what changes you may have made, compared with a master copy in
your home directory?
Exercise 4.13:
How would you change your password?
Exercise 4.14:
grep is one of the more useful tools in the
toolbox. It finds lines in files that match a pattern and
prints them out. For example, assume the files
earth.txt and venus.txt contain lines like
this:
Name: Earth
Period: 365.26 days
Inclination: 0.00
Eccentricity: 0.02
grep can extract lines containing the text
"Period" from all the files:
$ grep Period *.txt
earth.txt:Period: 365.26 days
venus.txt:Period: 224.70 days
Search strings can use regular
expressions, which will be discussed in a Regular Expressions. grep takes many
options as well; for example, grep -c /bin/bash
/etc/passwd reports how many lines in /etc/passwd
(the Unix password file) that contain the string
/bin/bash, which in turn tells me how many users are
using bash as their shell.
Suppose all you wanted was a list of the files that
contained lines matching a pattern, rather than the matches
themselves—what flag or flags would you give to
grep? What if you wanted the line numbers of
matching lines?
Exercise 4.15:
Suppose you wanted ls to sort its output by
filename extension, i.e., to list all .cmd files before
all .exe files, and all .exe's before all
.txt files. What command or commands would you
use?
Exercise 4.16:
What does the alias command do? When would
you use it?
Send comments
Version Control
Introduction
- Four things distinguish professional programmers from amateurs:
- Using a version control system
- Automating repetitive tasks
- Systematic testing
- Using debugging aids rather than
print statements
- This lecture introduces the first of these
- Even if it's the only thing you take away from this course, you'll be more productive than you are now
- Send comments
You Can Skip This Lecture If...
- You know what a repository is
- You know how to commit changes
- You know how to merge conflicts
- You know how to roll back a set of changes
- You know what a branch is
- Send comments
Problem #1: Collaboration
- What if two or more people want to edit the same file at the same time?
- Option 1: make them take turns
- But then only one person can be working at any time
- And how do you enforce the rule?
- Option 2: patch up differences afterwards
- Requires a lot of re-working
- Stuff always gets lost
- Send comments
Solution: Version Control
- The right solution is to use a version control system
- Keep the master copy of the file in a central repository
- Each author edits a working copy
- When they're ready to share their changes, they commit them to the repository
- Other people can then do an update to get those changes
![[Managing Multi-Author Collaboration]](./img/version/multi_author_collab.png)
Figure 5.1: Managing Multi-Author Collaboration
- This is also a good way for one person to manage files on multiple machines
- Keep one working copy on your personal laptop, the lab machine, and the departmental server
- No more mailing yourself files, or carrying around a USB drive (and forgetting to copy things onto it)
- Send comments
Problem #2: Undoing Changes
- Often want to undo changes to a file
- Start work, realize it's the wrong approach, want to get back to starting point
- Like “undo” in an editor…
- …but keep the whole history of every file, forever
- Also want to be able to see who changed what, when
- The best way to find out how something works is often to ask the person who wrote it
- Send comments
Solution: Version Control (Again)
- Have the version control system keep old revisions of files
- And have it record who made the change, and when
- Authors can then roll back to a particular revision or time
![[Version Control as a Time Machine]](./img/version/time_machine.png)
Figure 5.2: Version Control as a Time Machine
- Send comments
Which Version Control System?
- Many systems are available commercially
- If you have a large group, or a budget,
Perforce is excellent
CVS and Subversion are:- Open source
- Reliable
- Well documented
CVS has been around since the 1980s- Very popular, but showing its age
- Flaw #1: it keeps track of each file separately
- So there's no reliable way to ask, “Which files were changed together?”
- Flaw #2: you can create new directories, but can't delete old ones
Subversion developed from 2000 onward as a workalike replacement- Feels the same, but eliminates CVS's major weaknesses
- Many projects have already switched
- Send comments
Basic Use
- Ron and Hermione each has a working copy of the
solarsystem project repository- See below how they got it
- Ron wants to add some information about Jupiter's moons
- That afternoon, Hermione runs
svn update on her working copy ![[The Basic Edit/Update Cycle]](./img/version/edit_update_cycle.png)
Figure 5.3: The Basic Edit/Update Cycle
- Send comments
How To Do It
- One way to use Subversion is to type commands in a shell
- A lowest common denominator that will work almost everywhere
RapidSVN is a GUI that runs on Windows, Linux, and Mac- Well, maybe “walks” is a better description—Version 0.9 isn't particularly fast
![[RapidSVN]](./img/version/rapidsvn.png)
Figure 5.4: RapidSVN
TortoiseSVN is a Windows shell extension- Integrates with the file browser, rather than running separately
![[TortoiseSVN]](./img/version/tortoisesvn.png)
Figure 5.5: TortoiseSVN
- Send comments
Resolving Conflicts
- Back to the problem of conflicting edits (or, more simply, conflicts)
- Option 1: only allow one person to have a writeable copy at any time
- Option 2: let people edit, and resolve conflicts afterward by merging files
- Send comments
Example of Resolving
- Ron and Hermione are both synchronized with version 151 of the repository
- Ron edits
moons.txt and commits his changes to create version 152 - Simultaneously, Hermione edits her copy of
moons.txt - When she tries to commit,
Subversion tells her there's a conflict- A race condition: two or more would-be writers racing to get their changes in first
![[Merging Conflicts]](./img/version/conflict_merge.png)
Figure 5.6: Merging Conflicts
- Send comments
Example of Resolving (continued)
Subversion puts Hermione's changes and Ron's in moons.txtSubversion also creates:moons.txt.mine: contains Hermione's changesmoons.txt.151: the file before either set of changesmoons.txt.152: the most recent version of the file in the repository
- Send comments
Example of Resolving (continued)
- At this point, Hermione can:
- Run
svn revert moons.txt to throw away her changes - Copy one of the three temporary files on top of
moons.txt - Edit
moons.txt to remove the conflict markers
- Once she's done, she runs:
svn resolved moons.txt to let Subversion know she's donesvn commit to commit her changes (creating version 153 of the repository)
- Send comments
Starvation
- But what happens if Ginny commits another set of changes while Hermione is resolving?
- And then Harry commits yet another set?
- Starvation: Hermione never gets a turn because someone else always gets there first
- This is a management problem, not a technical one
- Break the file(s) up into smaller pieces
- Give people clearer responsibilities
- The version control system is trying to tell you that people on your team are working at cross purposes
- Send comments
Binary Files
- Subversion can only merge conflicts in text files
- Source code, HTML—basically, anything you can edit with Notepad, Vi, or Emacs
- But images, video clips, Microsoft Word, and other formats aren't plain text
- When there's a conflict, Subversion saves your copy and the master copy side by side in your working directory
- Up to you to resolve the differences
- It's not Subversion's fault
- Most creators of non-text formats don't provide a way to find or merge differences between files
- Send comments
Reverting
- After doing some more work, Ron decides he's on the wrong path
svn diff shows him which files he has changed, and what those changes are- He hasn't committed anything yet, so he uses
svn revert to discard his work- I.e., throw away any differences between his working copy and the master as it was when he started
- Synchronizes with where he was, not with any changes other people have made since then
- If you find yourself reverting repeatedly, you should probably go and do something else for a while…
- Send comments
Rolling Back
- Now Ron decides that he doesn't like the changes Harry just made to
moons.txt- Wants to do the equivalent of “undo”
svn log shows recent history- Current revision is 157
- He wants to revert to revision 156
svn merge -r 157:156 moons.txt will do the trick- The argument to the
-r flag specifies the revisions involved - Merging allows him to keep some of Harry's changes if he wants to
- Revision 157 is still in the repository
![[Rolling Back]](./img/version/rollback.png)
Figure 5.7: Rolling Back
- Send comments
Creating and Checking Out
- To create a repository:
- Decide where to put it (e.g.,
/svn/rotor) - Go into the containing directory:
cd /svn svnadmin create rotor
- Can then check out a working copy
- Directly through the file system:
svn checkout file:///svn/rotor - Through a web server:
svn checkout http://www.hogwarts.edu/svn/rotor- Note: requires your system administrator to configure the web server properly
- Only use
svn checkout once, to initialize your working copy- After that, use
svn update in that directory
- If you only want part of the repository, use
svn co http://www.hogwarts.edu/svn/rotor/engine/dynamics - Send comments
Subversion Command Reference
| Name | Purpose | Example |
|---|
svn add | Add files and/or directories to version control. | svn add newfile.c newdir |
svn checkout | Get a fresh working copy of a repository. | svn checkout https://your.host.name/rotor/repo rotorproject |
svn commit | Send changes from working copy to repository (inverse of update). | svn commit -m "Comment on the changes" |
svn delete | Delete files and/or directories from version control. | svn delete oldfile.c |
svn help | Get help (in general, or for a particular command). | svn help update |
svn log | Show history of recent changes. | svn log --verbose *.c |
svn merge | Merge two different versions of a file into one. | svn merge -r 18:16 spin.c |
svn mkdir | Create a new directory and put it under version control. | svn mkdir newmodule |
svn rename | Rename a file or directory, keeping track of history. | svn rename temp.txt release_notes.txt |
svn revert | Undo changes to working copy (i.e., resynchronize with repository). | svn revert spin.h |
svn status | Show the status of files and directories in the working copy. | svn status |
svn update | Bring changes from repository into working copy (inverse of commit). | svn update |
|
Table 5.1: Common Subversion Commands |
|---|
- Send comments
Reading Subversion Output
svn status compares your working copy with the repositorysvn update prints one line for each file or directory it does something to- Send comments
Summary
- Version control is one of the things that distinguishes professionals from amateurs
- And successful projects from failures
- Everything that a human being had to create should be under version control
- You'll see the benefits almost immediately
- Send comments
Exercises
Exercise 5.1:
Follow the instructions given to you by your instructor to
check out a copy of the Subversion repository you'll be using in
this course. Unless otherwise noted, the exercises below
assume that you have done this, and that your working copy is in
a directory called course. You will submit all of your
exercises in this course by checking files into your
repository.
Exercise 5.2:
Create a file course/ex01/bio.txt (where
course is the root of your working copy of your
Subversion repository), and write a short biography of yourself
(100 words or so) of the kind used in academic journals,
conference proceedings, etc. Commit this file to your
repository. Remember to provide a meaningful comment when
committing the file!
Exercise 5.3:
What's the difference between mv and svn
mv? Put the answer in a file called
course/ex01/mv.txt and commit your changes.
Once you have committed your changes, type svn
log in your course directory. If you didn't know
what you'd just done, would you be able to figure it out from
the log messages? If not, why not?
Exercise 5.4:
In this exercise, you'll simulate the actions of two
people editing a single file. To do that, you'll need to check
out a second copy of your repository. One way to do this is to
use a separate computer (e.g., your laptop, your home computer,
or a machine in the lab). Another is to make a temporary
directory, and check out a second copy of your repository there.
Please make sure that the second copy isn't inside the first, or
vice versa—Subversion will become very confused.
Let's call the two working copies Blue and Green. Do the
following:
a) Create Blue/ex01/planets.txt, and add the
following lines:
Mercury
Venus
Earth
Mars
Jupiter
Saturn
Commit the file.
b) Update the Green repository. (You should get a copy of
planets.txt.)
c) Change Blue/ex01/planets.txt so that it reads:
1. Mercury
2. Venus
3. Earth
4. Mars
5. Jupiter
6. Saturn
Commit the changes.
d) Edit Green/ex01/planets.txt so that its contents
are as shown below. Do not do svn update
before editing this file, as that will spoil the
exercise.
Mercury 0
Venus 0
Earth 1
Mars 2
Jupiter 16 (and counting)
Saturn 14 (and counting)
e) Now, in Green, do svn update. Subversion
should tell you that there are conflicts in planets.txt.
Resolve the conflicts so that the file contains:
1. Mercury 0
2. Venus 0
3. Earth 1
4. Mars 2
5. Jupiter 16
6. Saturn 14
Commit the changes.
f) Update the Blue repository, and check that
planets.txt now has the same content as it has in the
Green repository.
Exercise 5.5:
Add another line or two to course/ex01/bio.txt and
commit those changes. Then, use svn merge to restore
the original contents of your biography
(course/ex01/bio.txt), and commit the result. When you
are done, bio.txt should look the way it did at the end
of the first part of the previous exercise.) Note: the purpose
of this exercise is to teach you how to go back in time to get
old versions of files—while it would be simpler in this
case just to edit bio.txt, you can't (reliably) do that
when you've made larger changes, to multiple files, over a
longer period of time.
Exercise 5.6:
Subversion allows users to set properties on files and
directories using svn propset, and to inspect their
values using svn propget. Describe three properties
you might want to change on a file or directory, and how you
might use them in your current project.
Send comments
Automated Builds
Introduction
- Most languages require you to compile programs before running them
- Typing
gcc -c -Wall -ansi -I/pkg/chempak/include dat2csv.c once is bad enough - Typing it dozens of times as you edit and debug is tedious and error-prone
- Most large programs contain dependencies
- Module A uses modules B and C, B uses D and E, C uses E and F, etc.
- If E changes, ought to recompile B and C, then A
- Rule #2: Anything worth repeating is worth automating
- A standard way and place to save project-related commands…
- …that keeps track of what depends on what
- Send comments
You Can Skip This Lecture If...
- You know what a Makefile is
- You know how to write a rule
- You know how dependencies affect the order of command execution
- You know how to define macros
- You know how to use automatic variables
- You know how to write a generic rule
- Send comments
Automate, Automate, Automate
- Tools that manage repetitive tasks and their dependencies are usually called build tools
- Originally developed to rebuild software packages
- Can equally well be used to update web site content, run backups, etc.
- Such a tool must have:
- A way to describe what things to do
- A way to specify the dependencies between them
- Send comments
Make
- Most widely used build tool is
Make- Invented at Bell Labs in 1975 by Stuart Feldman [Feldman 1979]
- He went on to become a vice-president at IBM, which shows you how far a good tool can take you
- The good news:
Make is freely available for every major platform, and very well documented - The bad news is
Make's syntax- Over 30 years, it has grown into a little programming language (see Rule #11)
- We will ignore advanced features for now
- Look at a better way to solve these problems in Backward, Forward, and Sideways
- Send comments
Our Example
- Running example: Nigel is studying organic fullerene production
- Each experiment produces 20-30 files
- Want to:
- Generate tables showing the results for particular trials using a program called
dat2csv - Update a file showing the correlation between concentrations and yields based on those tables
- Send comments
Hello, Make
- Put the following into a Makefile called
hello.mk: - Must indent with a tab character: not eight spaces, or a mix of spaces and tabs
- Yes, it's a wart, but we're stuck with it
- Run
make -f hello.mkMake sees that the CSV file depends on the data file- Since the CSV file doesn't exist,
Make runs dat2csv hydroxyl_422.dat > hydroxyl_422.csv
- Run
make -f hello.mk againhydroxyl_422.csv is newer than hydroxyl_422.dat, Make does not run the command again
- Send comments
Terminology
![[Structure of a Make Rule]](./img/build/rule_structure.png)
Figure 6.1: Structure of a Make Rule
hydroxyl_422.csv is the target of the rulehydroxyl_422.dat is its prerequisite- The compilation command is the rule's action
Make runs them on your behalf, just as the shell runs the command you type
- Send comments
Multiple Targets
- Makefiles usually contain multiple rules
- When you run
make -f double.mk, only hydroxyl_422.csv is compiled- The first rule in the Makefile specifies the default target
- Unless you tell it otherwise, that's all
Make will update
- Have to run
make -f double.mk methyl_422.csv to build methyl_422.csv - Send comments
Phony Targets
Dependencies
- Note how one target can depend on others
all depends on hydroxyl_422.csv and methyl_422.csv- Each of these depends on (i.e., must be newer than) the corresponding
.dat file
- Can visualize dependencies as a directed graph
- Each file is represented by a node
- Dependencies are then the graph's arcs
![[Visualizing Dependencies]](./img/build/visualize_depend.png)
Figure 6.2: Visualizing Dependencies
- Send comments
Updating Dependencies
Make's built-in processing cycle:- Follow links top-down to find direct and indirect dependencies
- Execute actions bottom-up to update
Make can execute actions in any order it wants to, as long as it doesn't violate dependency ordering- Could update either
hydroxyl_422.cv or methyl_422.csv first - But has to update both before “updating”
all
- Send comments
Conventions
- If you run
make with no arguments, it automatically looks for a file called Makefile- So most projects use that name for their Makefile
- And remember, without an explicit target name,
make only updates the first one it finds
- Typical phony targets in a typical Makefile include:
"all": recompile everything"clean": delete all temporary files, and everything produced by compilation"install": copy files to system directories
- Many open source packages can be installed by typing:
make configuremakemake testmake install
- Send comments
Automatic Variables
Make defines automatic variables to represent parts of rules- Values re-set for each rule
- Unfortunately, names are very cryptic
"$@" | The rule's target |
"$<" | The rule's first prerequisite |
"$?" | All of the rule's out-of-date prerequisites |
"$^" | All prerequisites |
|
Table 6.1: Automatic Variables in Make |
|---|
- Send comments
Automatic Variables Example
Pattern Rules
- Most files of similar type in a project are processed the same way
- E.g., typically compile all C# or Java files with the same options
- Write a pattern rule to describe the general case
- Send comments
Adding More Dependencies
- Now create a summary for each set of experiments
- Use
summarize to combine data from hydroxyl_422.csv and hydroxyl_480.csv - Output is
hydroxyl_all.csv - Perform same calculation for methyl files
- Updated Makefile is a simple extension of what we've seen before:
- Send comments
Tidying Up
- What happens when this file is executed for the first time?
Make automatically removes intermediate files created by pattern rules when it's done- Question: how do you prevent this?
- Send comments
Defining Macros
Passing Values to Make
- Sometimes useful to pass values into
Make when invoking it- E.g., change the input directory
- Instead of editing the Makefile, specify
name=value pairs on the command line- Define a macro with the default value
- Override it when you want to
- So:
make -f macro.mk sets INPUT_DIR to /lab/gamma2100- But
make INPUT_DIR=/newlab -f macro.mk uses /newlab
Make also looks at environment variables- You can refer to
${HOME} in a Makefile without having defined it
VAL = original
echo :
@echo "VAL is" ${VAL}
$ make -f env.mk echo
VAL is original
$ make VAL=changed -f env.mk echo
VAL is changed
- Send comments
Functions
- GNU Make has many built-in functions
- Not part of the standard, but GNU Make is the most widely used version around
- Example: use
addprefix and addsuffix to build a list of filenames- Turn
hydroxyl into /tmp/hydroxyl_all.csv and methyl into /tmp/methyl_all.csv INPUT_DIR = /lab/gamma2100
OUTPUT_DIR = /tmp
CHEMICALS = hydroxyl methyl
SUMMARIES = $(addprefix ${OUTPUT_DIR}/,$(addsuffix _all.csv,${CHEMICALS}))
all : ${SUMMARIES}
${OUTPUT_DIR}/%_all.csv : ${OUTPUT_DIR}/%_422.csv ${OUTPUT_DIR}/%_480.csv
@summarize $^ > $@
${OUTPUT_DIR}/%.csv : ${INPUT_DIR}/%.dat
@dat2csv $< > $@
clean :
@rm -f *.csv
- Send comments
Commonly-Used Functions
| Function | Purpose |
|---|
$(addprefix prefix,filenames) | Add a prefix to each filename in a list |
$(addsuffix suffix,filenames) | Add a suffix to each filename in a list |
$(dir filenames) | Extract the directory name portion of each filename in a list |
$(filter pattern,text) | Keep words in text that match pattern |
$(filter-out pattern,text) | Keep words in text that don't match pattern |
$(patsubst pattern,replacement,text) | Replace everything that matches pattern in text |
$(sort text) | Sort the words in text, removing duplicates |
$(strip text) | Remove leading and trailing whitespace from text |
$(subst from,to,text) | Replace from with to in text |
$(wildcard pattern) | Create a list of filenames that match a pattern |
|
Table 6.2: Commonly-Used Functions |
|---|
- Send comments
Pros and Cons
- Pro
- Simple things are simple to do…
- …and not too difficult to read…
- …especially compared to the alternatives
- Con
- The syntax is unpleasant
- Complex things are difficult to read…
- …and even more difficult to debug
- Best you can do is use
echo to print things as Make executes
- Not really very portable
- Hands commands to the shell for execution
- But commands use different flags on different operating systems
- Do you use
del or rm to delete files?
- Send comments
Alternatives
Ant: primary for Java, but equivalent tools now exist for .NET- Less platform-dependent, but just as hard to read and debug
- Integrated development environments
- Most hide the details in idiosyncratic configuration files
- Even harder than Makefiles to customize if you're not using the GUI
SCons- Let users describe dependencies and actions in a real programming language
- More powerful and debuggable, but steeper learning curve
- Once builds are automated, the next step is to run them continuously
- Every time someone checks something into version control, rebuild the software (or site), and re-run tests
- See
CruiseControl and Bitten
- Send comments
Summary
- Two rules for healthy software projects:
- Every repetitive task is done through the build system
- Never commit anything to version control repository that breaks the build
- Remember: a Makefile is a program
- So give your build the same careful attention you'd give any other programming problem
- Send comments
Exercises
Exercise 6.1:
Make gets definitions from environment variables,
command-line parameters, and explicit definitions in Makefiles.
What order does it check these in?
Send comments
Basic Scripting
Introduction
- Two things determine time to solution:
- How long it takes to write a program (human time)
- How long it takes that program to run (machine time)
- Different languages make different tradeoffs between these [Prechelt 2000]
- High-level languages let programmers express their thoughts more quickly…
- …but the more abstract the language, the more slowly it runs
![[Human Time vs. Machine Time]](./img/py01/human_vs_machine_time.png)
Figure 7.1: Human Time vs. Machine Time
- This series of lectures introduces a versatile high-level language called
Python- A good way to build tools and crunch data
- Send comments
You Can Skim This Lecture If...
- You know Python, Perl, Ruby, Tcl, or Rexx
- Same basic ideas
- Different syntax
- Send comments
Python's Strengths
- As flexible as the shell, but with real data structures
- Don't have to squeeze everything into lists of strings
- Freely available for many platforms
- Widely used and well documented
- Much easier to learn and read than Perl
- Material that took three days to teach in Perl took only two to teach in Python
- Follow-up surveys showed significantly higher retention rates
- Send comments
Python's Weaknesses
- Nimble languages like Python are slower than sturdy languages like Fortran, C/C++, or Java
- Factor of 20 is common
- But it's relatively easy to call libraries in those other languages from Python
- Doesn't have as many numerical or statistical tools as MATLAB
- But its
Numeric package isn't bad
- Not as widely used as Perl
- Send comments
Why Another Language?
- This course isn't really about Python
- It's about solving software engineering problems
- But we have to write the examples in something
- Might as well choose something useful
- And it puts everyone on an equal footing
- For more information:
- Send comments
Execution Cycle
- Sturdy languages use a two-step execution cycle
- Compile source code, putting machine-oriented form in file
- Run the contents of that file on top of an operating system or virtual machine
- Nimble languages combine these two steps
- Compiler and virtual machine are the same program
- Load source code, translate into more compact form if necessary, and execute
![[Sturdy vs. Nimble Execution]](./img/py01/sturdy_vs_nimble.png)
Figure 7.2: Sturdy vs. Nimble Execution
- This is why sturdy programs run faster, but nimble programs are faster to write
- Compiling gives the computer a chance to optimize
- Load-and-go makes human turnaround faster
- And as we'll see More on Objects, it permits some powerful high-level programming tricks
- Send comments
Running Python Programs
- Interactively, like the shell
$ python
Python 2.4.2 (#67, Sep 28 2005, 12:41:11) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print 124/28
4
>>> print 124.0/28.0
4.4285714285714288
>>> ^D
"^D" represents control-D, which is Unix's way of saying “end of input”
- Obviously don't have to retype program every time you want to run it
- Send comments
Execution Shortcuts
- On Unix, make
#!/usr/bin/python the first line of the program- This tells Unix to execute
/usr/bin/python with the rest of the file as its input
- Of course, this doesn't work if Python is installed somewhere else
- Use
which python to find out
- Better: use
#!/usr/bin/env python as the first line - On Windows, associate
.py files with Python- Happens automatically when you run the Python Windows installer
- Double-clicking on anything ending in
.py will then run it
- Send comments
Variables
- Variables are names for values
- No types: a variable is just a name, and can refer to different types of values at different times
- Send comments
Possible Mistakes
- Must give a variable a value before using it
planet = "Sedna"
print plant # note the misspelling
Traceback (most recent call last):
File "lec/inc/py01/undefined_var.py", line 2, in ?
print plant # note the misspelling
NameError: name 'plant' is not defined
- Unlike some languages, Python doesn't provide a default value
- Because doing so can hide a lot of errors
- Note: anything from
"#" to the end of the line is a comment
- Variables don't have types, but values do
- Python complains if you try to operate on incompatible values
x = "two" # "two" is a string
y = 2 # 2 is an integer
print x * y # multiplying a string concatenates it repeatedly
print x + y # but you can't add an integer and a string
twotwo
Traceback (most recent call last):
File "lec/inc/py01/add_int_str.py", line 4, in ?
print x + y # but you can't add an integer and a string
TypeError: cannot concatenate 'str' and 'int' objects
- Send comments
Printing
- The
print statement prints zero or more values to standard output - Automatically puts a newline at the end
- So
print on its own just prints a blank line - Putting a comma at the end of the line suppresses the newline
planet = "Pluto"
num_moons = 1
moon = "Charon"
print planet, "has", num_moons, "satellite",
print "and its name is", moon
Pluto has 1 satellite and its name is Charon
- Send comments
Quoting
- Use either single or double quotes to create strings
- Each string must start and end with the same kind of quote
- But different strings in the same program can use different kinds of quotes
print "He said, \"It ain't what you know, it's what you can.\""
He said, "It ain't what you know, it's what you can."
- Use triple quotes (of either kind) to create a multi-line string
print "Sedna was discovered in 2004"
print 'It takes 10,500 years to circle the sun.'
print '''The tiny world may be part of the Oort Cloud,
a shell of icy proto-comets left over from
the formation of the Solar System.'''
- Send comments
Converting Values to Strings
- The built-in function
str converts things to strings
print "Diameter: " + str(1280) + "-" + str(1760) + " km"
Diameter: 1280-1760 km
- Use
int, float, etc. to convert values to other types
print int(12.3)
print float(4)
12
4.0
- Send comments
Escape Sequences
- Use escape sequences to put special characters in strings
- Borrowed directly from C
- Most common are tab
"\t" and newline "\n"
| Expression | Meaning |
|---|
\\ | backslash |
\' | single quote |
\" | double quote |
\b | backspace |
\n | newline |
\r | carriage return |
\t | tab |
|
Table 7.1: Character Escape Sequences |
|---|
- Send comments
Numbers
14 is an integer (32 bits long on most machines)14.0 is double-precision floating point (64 bits long)1+4j is a complex number (two 64-bit values)- Use
x.real and x.imag to get the real and imaginary parts
123456789L is a long integer- Arbitrary length: uses as much memory as it needs to
- Operations are several times slower
- Send comments
Arithmetic
- Python borrows C's numeric operators
- And adds
** for exponentiation | Name | Operator | Use | Value | Notes |
|---|
| Addition | + | 35 + 22 | 57 | |
| | | 'Py' + 'thon' | 'Python' | |
| Subtraction | - | 35 - 22 | 13 | |
| Multiplication | * | 3 * 2 | 6 | |
| | | 'Py' * 2 | 'PyPy' | 2 * 'Py' is illegal |
| Division | / | 3.0 / 2 | 1.5 | |
| | | 3 / 2 | 1 | Integer division rounds down: -3 / 2 is -2, not -1 |
| Exponentiation | ** | 2 ** 0.5 | 1.4142135623730951 | |
| Remainder | % | 13 % 5 | 3 | |
|
Table 7.2: Numeric Operators in Python |
|---|
- Python also has C's in-place operators
x += 3 does the same thing as x = x + 35 += 3 is an error, since you can't assign a new value to 5…
- Send comments
Booleans
True and False are true and false (d'oh)- Empty string and 0 are considered false
- Just as 3 is equivalent to 3.0
- (Almost) everything else is true
- Combine Booleans using
and, or, not| Expression | Result | Notes |
|---|
True or False | True | |
True and False | False | |
'a' or 'b' | 'a' | or is true if either side is true, so it stops after evaluating 'a' |
'a' and 'b' | 'b' | and is only true if both sides are true, so it doesn't stop until it has evaluated 'b' |
0 or 'b' | 'b' | 0 is false, but 'b' is true |
0 and 'b' | 0 | Since 0 is false, Python can stop evaluating there |
0 and (1/0) | 0 | 1/0 would be an error, but Python never gets that far |
(x and 'set') or 'not set' | It depends | If x is true, this expression's value is 'set'; if x is false, it is 'not set' |
|
Table 7.3: Boolean Operators in Python |
|---|
- Send comments
Short-Circuit Evaluation
and and or are short-circuit operators- Evaluate expressions left to right
- Stop as soon as answer is known
- Result is the last thing evaluated (rather than
True or False) ![[Short-Circuit Evaluation]](./img/py01/short_circuit_eval.png)
Figure 7.5: Short-Circuit Evaluation
- Can be used to create expressions like
val = cond and left or right- If
cond is True, val is assigned left - If
cond is False, val is assigned right - It works, but it's hard to read
- Clever isn't always smart
- Send comments
Comparisons
- Python borrows C's comparison operators, too
- But allows you to chain comparisons together, just as in mathematics
| Expression | Value |
|---|
3 < 5 | True |
3.0 < 5 | True |
3 != 5 | True |
3 == 5 | False |
3 < 5 <= 7 | True |
3 < 5 >= 2 | True (but please don't write this—it's hard to read) |
3+2j < 5 | Error: can only use == and != on complex numbers |
|
Table 7.4: Comparison Operators in Python |
|---|
- Note the difference between assignment and testing for equality
- Use a single equals sign
= for assignment - Use a double equals sign
== to test if two things have equal values
- Send comments
String Comparisons
- Characters are encoded as numbers: digits come before uppercase letters, all of which come before lowercase letters
- Punctuation is mixed in between, just to make matters difficult
- Strings are compared character by character from first to last until:
- One character is less than another
- One string runs out of characters
| Expression | Value |
|---|
'abc' < 'def' | True |
'abc' < 'Abc' | False |
'ab' < 'abc' | True |
'0' < '9' | True |
'100' < '2' | True |
|
Table 7.5: String Comparisons in Python |
|---|
- Send comments
Conditionals
- Python uses
if, elif (not else if), and else - Use a colon and indentation to show nesting
a = 3
if a < 0:
print 'less'
elif a == 0:
print 'equal'
else:
print 'greater'
greater
- Send comments
Why Indentation?
- Why doesn't Python use
begin/end or {…}?- Because studies showed that indentation is what everyone actually pays attention to
- Just count the number of warnings in C/Java books about misleading indentation
- Doesn't matter how much you use, but:
- Everything in the block must be indented the same amount
- And most people use four spaces
- Please do not use tabs
- Python interprets them as (up to) eight characters
- But different editors may display them differently
- Send comments
While Loops
- Do something repeatedly as long as some condition is true
- Again, use colon and indentation to show nesting
num_moons = 3
while num_moons > 0:
print num_moons
num_moons -= 1
3
2
1
- Do the “something” zero times if the condition is false the first time it is tested
print 'before'
num_moons = -1
while num_moons > 0:
print num_moons
num_moons -=1
print 'after'
before
after
- If the condition is always true, the loop never ends
num_moons = 3
while num_moons > 0:
print num_moons
# oops --- forgot to subtract one
3
3
3
⋮
- Send comments
Break and Continue
- Can break out of the middle of a loop using
break
num_moons = 3
while True: # Looks like an infinite loop...
print num_moons
num_moons -= 1
if num_moons <= 1:
break # ...but there's a way out
3
2
- Can skip to the next iteration using
continue
num_moons = 5
while num_moons > 0:
print 'top:', num_moons
num_moons -= 1
if (num_moons % 2) == 0:
continue
print '...bottom:', num_moons
top: 5
top: 4
...bottom: 3
top: 3
top: 2
...bottom: 1
top: 1
- Don't abuse these
- A single test at the top of the loop makes code much easier to read
- Send comments
String Formatting
- Python's
% operator formats output- Left side is a string specifying how things are to be formatted
- Right side is the values to be formatted
- One value on its own…
- …or several values in parentheses, separated by commas
'here %s go' % 'we' creates "here we go"- The
"%s" in the left string means “insert a string here” - Creates a new string (since strings are immutable)
- Send comments
Format Specifiers
'left %d right %d' % (-1, 1) creates "left -1 right 1""%d" stands for “decimal integer”
'%04d' % 13 creates "0013"- The leading zero means “pad with zeroes”
- The 4 means “at least 4 characters wide”
'[%-4d]' % 13 creates "[13 ]"
'%6.4f %%' % 37.2 creates "37.2000 %"- A floating point number, at least six characters wide, padded on the left with spaces, with four characters after the decimal point
"%%" is translated into a single "%"- Just as
"\\" is how you represent a single "\" in a string
- Send comments
Supported Formats
| Format | Meaning | Example | Output |
|---|
"d" | Signed decimal integer | '%d %d' % (13, 15) | "13 15" |
"o" | Unsigned octal (base-8) | '%o %o' % (13, 15) | "15 17" |
"x" | Lower case unsigned hexadecimal (base-16) | '%x %x' % (13, 15) | "d f" |
"X" | Upper case unsigned hexadecimal (base-16) | '%X %X' % (13, 15) | "D F" |
"e" | Lower case exponential floating point | '%e' % 123.45 | "1.234500e+02" |
"E" | Upper case exponential floating point | '%E' % 123.45 | "1.234500E+02" |
"f" | Decimal floating point | '%f' % 123.45 | "123.4500" |
"s" | String (converts other types using str()) | '%s %s %s' % ('nickel', 28, 58.69) | "nickel 28 58.69" |
|
Table 7.6: String Formats in Python |
|---|
- Send comments
Summary
- Scripting languages are increasingly popular because they optimize human time
- Which is now more expensive than machine time in all but a handful of cases
- Python is one of the cleanest scripting languages around
- A good tool in its own right
- An excellent way to build other tools
- Send comments
Strings, Lists, and Files
Introduction
- To recap:
- Python is a nimble language
- Ideal for building tools and crunching data
- Has the usual data types and control flow constructs
- This lecture describes:
- After this lecture, you will be able to use Python to crunch simple data formats
- Send comments
You Can Skip This Lecture If...
- You know that you can't modify a string in place
- You know what
str[1:-1] means - You know what a method is
- You know how to read data from a file line by line
- Send comments
Strings
- A string is an immutable sequence of characters
- Sequence means that it can be indexed
- Indices start at 0 (as in C, Java, and C#)
- So
text[0] is the first character of text
- The built-in function
len returns the length of a string- So the last character of
text has index len(text)-1
element = "boron"
i = 0
while i < len(element):
print element[i]
i += 1
b
o
r
o
n
- Note: there is no separate data type for characters
- A character is simply a string of length 1
- Send comments
Immutability
- Immutable means that it cannot be modified once it has been created
- Why?
- Of course, you can assign a new string value to a variable
element = 'gold'
print 'element is', element
element = 'lead'
print 'element is now', element
element is gold
element is now lead
- Send comments
Slicing
text[start:end] takes a slice out of text- Creates a new string containing the characters of
text from start up to (but not including) end
element = "helium"
print element[1:3], element[:2], element[4:]
el he um
- Sometimes helps to think of indices as being between elements
![[Visualizing Indices]](./img/py02/indices.png)
Figure 8.1: Visualizing Indices
- Send comments
Bounds Checking
Negative Indices
- Negative indices count backward from the end of the string
x[-1] is the last characterx[-2] is the second-to-last character
element = "carbon"
print element[-2], element[-4], element[-6]
o r c
- A lot easier to read than
x[len(x)-1]- Again, it helps to visualize the indices as lying between the characters
![[Visualizing Negative Indices]](./img/py02/negative_indices.png)
Figure 8.2: Visualizing Negative Indices
- Send comments
Consequences
text[1:2] is either:- The second character in
text… - …or the empty string (if
text doesn't have a second character)
- So
text[2:1] is always the empty string - So is
text[1:1]- From index 1, up to but not including index 1
text[1:-1] is everything except the first and last characters- Which may again be the empty string
- Send comments
Methods
- A method is a function that is tied to a particular object
- Almost everything in Python has methods
- Numbers are the only important exception
- To call a method
meth of object obj, type obj.meth() - Send comments
String Methods
| Method | Purpose | Example | Result |
|---|
capitalize | Capitalize first letter of string | "text".capitalize() | "Text" |
lower | Convert all letters to lowercase. | "aBcD".lower() | "abcd" |
upper | Convert all letters to uppercase. | "aBcD".upper() | "ABCD" |
strip | Remove leading and trailing whitespace (blanks, tabs, newlines, etc.) | " a b ".strip() | "a b" |
lstrip | Remove whitespace at left (leading) edge of string. | " a b ".lstrip() | "a b " |
rstrip | Remove whitespace at right (trailing) edge of string. | " a b ".rstrip() | " a b" |
count | Count how many times one string appears in another. | "abracadabra".count("ra") | 2 |
find | Return the index of the first occurrence of one string in another, or -1. | "abracadabra".find("ra") | 2 |
| | | "abracadabra".find("xyz") | -1 |
replace | Replace occurrences of one string with another. | "abracadabra".replace("ra", "-") | "ab-cadab-" |
|
Table 8.1: String Methods |
|---|
- Send comments
Notes on String Methods
- These methods don't have to be called on constant strings
- In fact, they usually aren't
element = 'helium'
print element.upper()
print element.replace('el', 'afn')
print 'element after calls:', element
HELIUM
hafnium
element after calls: helium
- These methods create new strings
- They cannot change the strings they're called on because strings are immutable
- Send comments
Chaining Method Calls
- Method calls can be chained together
- If the result of one method call is an object, you can immediately call a method on it
element = "cesium"
print ':' + element.upper()[4:7].center(10) + ':'
: UM :
- Use this in moderation
- Long chains of method calls are hard to read and debug
- Send comments
Testing for Membership
- Use
in to check whether one string appears in another- Simpler than the
find method - But it doesn't tell you where the substring occurs
print "ant" in "tantalum"
print "mat" in "tantalum"
True
False
- Send comments
Lists
- A list is a mutable sequence of objects
- Mutable means that, unlike a string, it can be changed in place
- Of objects means that lists can hold anything and everything
- Think of it as a one-dimensional array, or vector, that automatically resizes itself as needed
- Write lists by putting values in square brackets
- The empty list is written
[]
- Index and slice as you would a string
- As with strings, Python checks bounds when indexing, but truncates when slicing
gases = ['He', 'Ne', 'Ar', 'Kr']
print gases
print gases[0], gases[-1]
['He', 'Ne', 'Ar', 'Kr']
He Kr
- Send comments
Modifying Lists
- Assign a new value to a list element using
x[i] = v
gases = ['He', 'Ne', 'Ar', 'Kr']
print 'before:', gases
gases[0] = 'H'
gases[-1] = 'Xe'
print 'after:', gases
before: ['He', 'Ne', 'Ar', 'Kr']
after: ['H', 'Ne', 'Ar', 'Xe']
- The slot must already exist
$ python
>>> gases = ['He', 'Ne', 'Ar', 'Kr']
>>> print 'before:', gases
before: ['He', 'Ne', 'Ar', 'Kr']
>>> gases[10] = 'Ra'
IndexError: list assignment index out of range
- Use
append to add an element to the end of a list- Grows the list as needed
characters = []
print characters
for c in 'aeiou':
characters.append(c)
print characters
[]
['a']
['a', 'e']
['a', 'e', 'i']
['a', 'e', 'i', 'o']
['a', 'e', 'i', 'o', 'u']
- Send comments
Concatenation
- Adding strings (or lists) creates a new string (or list) with all the content of the originals
element = 'carbon'
mass = '14'
print element + '-' + mass
lanthanides = ['Ce', 'Pr', 'Nd']
actinides = ['Th', 'Pa', 'U']
all = lanthanides + actinides
print all
carbon-14
['Ce', 'Pr', 'Nd', 'Th', 'Pa', 'U']
- Can't concatenate a string and a list
- But
list(text) creates a list whose elements are the characters of the string text
water = 'H2O'
print 'before conversion:', water
water = list(water)
print 'after conversion:', water
before conversion: H2O
after conversion: ['H', '2', 'O']
- Send comments
Deleting List Elements
del deletes a list element- Shortens the list (which can cause problems if you're looping over it at the time)
organics = ['H', 'C', 'O', 'N']
print 'original:', organics
del organics[2]
print 'after deleting item 2:', organics
del organics[-2:]
print 'after deleting the last two remaining items:', organics
original: ['H', 'C', 'O', 'N']
after deleting item 2: ['H', 'C', 'N']
after deleting the last two remaining items: ['H']
- Can delete slices, too
organics = ['H', 'C', 'O', 'N']
print 'original:', organics
del organics[1:-1]
print 'after deleting the middle:', organics
original: ['H', 'C', 'O', 'N']
after deleting the middle: ['H', 'N']
del is a statement, not an operator- Doesn't “return” the modified list
- Send comments
List Methods
- Like strings, lists are objects, and have methods
- In the examples below,
metals is initially ['gold', 'iron', 'lead', 'gold']| Method | Purpose | Example | Result |
|---|
append | Add to the end of the list. | metals.append('tin') | ['gold', 'iron', 'lead', 'gold', 'tin'] |
count | Count how many times something appears in the list. | metals.count('gold') | 2 |
find | Find the first occurrence of something in the list. | metals.find('iron') | 1 |
| | | metals.find('sulfur') | -1 |
insert | Insert something into the list. | metals.insert(2, 'silver') | ['gold', 'iron', 'silver', 'lead', 'gold'] |
remove | Remove the first occurrence of something from the list. | metals.remove('gold') | ['iron', 'lead', 'gold'] |
reverse | Reverse the list in place. | metals.reverse() | ['gold', 'lead', 'iron', 'gold'] |
sort | Sort the list in place. | metals.sort() | ['gold', 'gold', 'iron', 'lead'] |
|
Table 8.2: List Methods |
|---|
- Send comments
Notes on List Methods
index reports an error if the item can't be foundreverse and sort change the list, and return None- The object equivalent of zero
- Like 0 and the empty string, it is equivalent to
False
x = x.reverse() is a common error- It reverses
x, but then sets x to None, so all data is lost
- Send comments
For Loops
- Python's
for loops over the content of a collection (such as a string or list)for c in some_string assigns c each character of some_stringfor v in some_list assigns v each value of some_list
for c in 'lead':
print '/' + c + '/',
print
for v in ['he', 'ar', 'ne', 'kr']:
print v.capitalize()
/l/ /e/ /a/ /d/
He
Ar
Ne
Kr
- Of course, you can use any name you like for the loop index variable
- Why? Because it's usually what you want to do
- Send comments
Ranges
- The built-in function
range creates the list [start, start+1, ..., end-1]end-1 to be consistent with x[start:end]
print 'up to 5:', range(5)
print '2 to 5:', range(2, 5)
print '2 to 10 by 2:', range(2, 10, 2)
print '10 to 2:', range(10, 2)
print '10 to 2 by -2:', range(10, 2, -2)
up to 5: [0, 1, 2, 3, 4]
2 to 5: [2, 3, 4]
2 to 10 by 2: [2, 4, 6, 8]
10 to 2: []
10 to 2 by -2: [10, 8, 6, 4]
- Note the special cases
range(end) and range(start, end, step) - Note also that
range may generate an empty list - Send comments
Ranged Loops
- To loop from 0 to N-1, use
for i in range(N) - To loop over the indices of a list or string, use
for i in range(len(sequence))
element = 'sulfur'
for i in range(len(element)):
print i, element[i]
0 s
1 u
2 l
3 f
4 u
5 r
- Send comments
Membership
x in c works element-by-element on lists- So
3 in [1, 2, 3, 4] is True - But
[2, 3] in [1, 2, 3, 4] is False
- Send comments
Nesting Lists
- Lists can contain other lists
- E.g., use a list containing two lists to represent a line
![[Line Segment]](./img/py02/line_segment.png)
Figure 8.3: Line Segment
- Indexing from left to right selects elements from the outside in
elements = [['H', 'Li', 'Na'], ['F', 'Cl']]
print 'first item in outer list:', elements[0]
print 'second item of second sublist:', elements[1][1]
first item in outer list: ['H', 'Li', 'Na']
second item of second sublist: Cl
- Send comments
Aliasing
- Nested lists are objects in their own right
- The outer list stores a reference to the inner list
- But the inner list does not know that it's being referred to
- Subscripting the outer list creates an alias for the inner list
- Another name for the same data
- Changes made through either reference update the same data
elements = [['H', 'Li'], ['F', 'Cl']]
gases = elements[1]
print 'before'
print 'elements:', elements
print 'gases:', gases
gases[1] = 'Br'
print 'after'
print 'elements:', elements
before
elements: [['H', 'Li'], ['F', 'Cl']]
gases: ['F', 'Cl']
after
elements: [['H', 'Li'], ['F', 'Br']]
![[Aliasing In Action]](./img/py02/aliasing.png)
Figure 8.4: Aliasing In Action
- Send comments
Indexing vs. Slicing
- Indexing and slicing return different types of things for lists
- Indexing a list returns a reference to the list element
- Slicing returns a new list containing the selected elements of the original list
- Changes to a slice do not affect the original list
metals = ['Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn']
middle = metals[2:-2]
print 'before'
print 'metals:', metals
print 'middle:', middle
middle[0] = 'Al'
del middle[1]
print 'after'
print 'metals:', metals
print 'middle:', middle
before
metals: ['Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn']
middle: ['Fe', 'Co', 'Ni']
after
metals: ['Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn']
middle: ['Al', 'Ni']
![[Slicing Lists]](./img/py02/slice_copy.png)
Figure 8.5: Slicing Lists
- Note that copying only goes one level deep
- Don't have to worry about this with strings, since they're immutable
- Draw pictures when you have to
- And if your pictures are complicated, simplify your code
- Send comments
Tuples
- Python has a second type of list, called a tuple
- Just like a normal list, but immutable (i.e., can't be changed after creation)
- Written using parentheses instead of square brackets:
(1, 2, 3) instead of [1, 2, 3] - Empty tuple is
() - Tuple with one element must be written with a comma, as in
(55,)- Because
(55) has to be just the integer 55, or the mathematicians will get upset
- Why? Because there are times when Python needs to know that a sequences values aren't going to change
- One of Python's few warts…
- Send comments
Multi-Valued Assignment
- Don't actually need the parentheses around a tuple
1, 2, 3 is the same as (1, 2, 3)
- Allows multi-valued assignment
left, right = "gold", "lead" assigns "gold" to left, and "lead" to right
- Python converts lists to tuples when necessary
left, middle, right = ["gold", "iron", "lead"] worksleft, right = ["gold", "iron", "lead"] doesn't- Number of targets on the left must match the number of values on the right
- Often used to exchange values
left, right = right, left does a safe swap
- Send comments
Unpacking Structures in Loops
- Use multi-valued assignment in
for loops to unpack structures on the fly
elements = [
['H', 'hydrogen', 1.008],
['He', 'helium', 4.003],
['Li', 'lithium', 6.941],
['Be', 'beryllium', 9.012]
]
for (symbol, name, weight) in elements:
print name + ' (' + symbol + '): ' + str(weight)
hydrogen (H): 1.008
helium (He): 4.003
lithium (Li): 6.941
beryllium (Be): 9.012
- Two of the reasons nimble languages are productive are that they:
- Let you write complex data structures directly
- Let you take them apart easily
- Send comments
Files
- Use the built-in function
open to open a file- First argument is path
- Second is
"r" (for read) or "w" for write
input_file = open('count_bytes.py', 'r')
content = input_file.read()
input_file.close()
print len(content), 'bytes in file'
121 bytes in file
- Result is a file object with methods for input and output
| Method | Purpose | Example |
|---|
close | Close the file; no more reading or writing is allowed | input_file.close() |
read | Read N bytes from the file, returning the empty string if the file is empty. | next_block = input_file.read(1024) |
| | If N is not given, read the rest of the file. | rest = input_file.read() |
readline | Read the next line of text from the file, returning the empty string if the file is empty. | line = input_file.readline() |
readlines | Return the remaining lines in the file as a list, or an empty list at the end of the file. | rest = input_file.readlines() |
write | Write a string to a file. | output_file.write("Element 8: Oxygen") |
| | | write does not automatically append a newline. |
writelines | Write each string in a list to a file (without appending newlines). | output_file.writelines(["H", "He", "Li"]) |
|
Table 8.3: File Methods |
|---|
- Send comments
Copying a File
input_file = open('file.txt', 'r')
output_file = open('copy.txt', 'w')
line = input_file.readline()
while line:
output_file.write(line)
line = input_file.readline()
input_file.close()
output_file.close()
- First statement opens
file.txt for reading, and assigns the file object to input_file - Second statement opens
copy.txt for writing, and assigns the file object to output_file - Program then tries to read a line from
input_file- If the file is empty,
line is assigned the empty string
- As long as there are lines in the input, the program:
- Writes the most recent line to
output_file - Reads the next line
- After the input file is exhausted, the program closes the files
- Python will close the files automatically when the program exits…
- …but it's good practice to tidy up your toys when you're done playing with them
- Send comments
Looping Over Files
- Looping over an input file returns lines of text
input_file = open('count_lines.py', 'r')
count = 0
for line in input_file:
count += 1
input_file.close()
print count, 'lines in file'
6 lines in file
- Including the terminating newline (or carriage return + newline on Windows)
- Meaningless to loop over an output file
- Send comments
Other Ways To Copy Files
- Could also use this:
- Or this:
input_file = open('file.txt', 'r')
output_file = open('copy.txt', 'w')
for line in input_file:
line = line.rstrip()
print >> output_file, line
input_file.close()
output_file.close()
print >> file sends print's output to a file- Remember that it automatically appends an end-of-line marker
- Which is why the program above strips whitespace off the end of the string before printing it
- Send comments
Summary
- The basic features of most modern programming languages are the same
- Strings, lists, file I/O, …
- Only issue is how they're presented
- Python's syntax is clean and consistent
- You'll soon start to wonder why other languages still rely on curly braces…
- Send comments
Exercises
Exercise 8.1:
What does "aaaaa".count("aaa") return? Why?
Exercise 8.2:
What do each of the following five code fragments do? Why?
x = ['a', 'b', 'c', 'd']
x[0:2] = []
|
x = ['a', 'b', 'c', 'd']
x[0:2] = ['q']
|
x = ['a', 'b', 'c', 'd']
x[0:2] = 'q'
|
x = ['a', 'b', 'c', 'd']
x[0:2] = 99
|
x = ['a', 'b', 'c', 'd']
x[0:2] = [99]
|
Exercise 8.3:
What does 'a'.join(['b', 'c', 'd']) return? If you have
a list of strings, how can you concatenate them in a single statement?
Why do you think join is written this way, rather than as
['b', 'c', 'd'].join('a')?
Send comments
Functions and Libraries
Introduction
- A language should not include everything anyone could ever want
- Instead, it should allow developers to express every abstraction they want [Steele 1999]
- Define functions to create higher-level operations
- Group them in libraries to keep them manageable
- Send comments
You Can Skip This Lecture If...
- You know how to define a function in Python
- You know what default parameter values are
- You know what
import does - You are familiar with the
sys, math, and os libraries - Send comments
Defining Functions
- Define a new function using
def- Parameter names follow in parentheses
def double(x):
return x * 2
print double(5)
print double(['basalt', 'granite'])
10
['basalt', 'granite', 'basalt', 'granite']
- Cannot declare types for parameters
- Send comments
Returning Values
- Finish the function at any time using
return - Functions with
return statements scattered through them are hard to understand- Have to read the function line by line to figure out what it might do
- In general:
- Use early returns at the start of the function to handle special cases
- And then one return at the end to handle the general case
- Send comments
Everything Returns Something
- Functions without explicit
return statements return None - And
return on its own is the same as return None
def hello():
print 'HELLO'
def world():
print 'WORLD'
return
print hello()
print world()
HELLO
None
WORLD
None
- The more consistent functions are about the types of the things they return, the better
- If a function can return
None, an integer, or a list, the caller will have to write an if statement
- Send comments
Scope
- Python manages variables using a call stack
# Global variable.
rock_type = 'unknown'
# Function that creates local variable.
def classify(rock_name):
if rock_name in ['basalt', 'granite']:
rock_type = 'igneous'
elif rock_name in ['sandstone', 'shale']:
rock_type = 'sedimentary'
else:
rock_type = 'metamorphic'
print 'in function, rock_type is', rock_type
# Call the function to prove that it uses its local 'x'.
print "before function, rock_type is", rock_type
classify('sandstone')
print "after function, rock_type is", rock_type
before function, rock_type is unknown
in function, rock_type is sedimentary
after function, rock_type is unknown
![[Call Stack]](./img/py03/call_stack.png)
Figure 9.1: Call Stack
- When a function is called, Python creates a new stack frame
- A table of name/value pairs
- Parameters are just local variables that are automatically initialized
- When a variable is referenced, Python looks for it in:
- The top stack frame, then
- The global variables
- Send comments
Parameter Passing Rules
- Python copies variables' values when passing them to functions
- But remember: variables hold references to lists
- So the parameters are aliases
- Not an issue for strings, numbers, and Booleans, since they are immutable
def add_salt(first, second):
first += "salt"
second += ["salt"]
str = "rock"
seq = ["gneiss", "shale"]
print "before"
print "str is:", str
print "seq is:", seq
add_salt(str, seq)
print "after"
print "str is:", str
print "seq is:", seq
before
str is: rock
seq is: ['gneiss', 'shale']
after
str is: rock
seq is: ['gneiss', 'shale', 'salt']
![[Parameter Passing]](./img/py03/parameter_passing.png)
Figure 9.2: Parameter Passing
- Send comments
Making Copies
- To pass a copy of a list into a function, slice it
values[:] is the same as values[0:len(values)]…- …which is a slice of
values that includes the entire list… - …and slicing creates a new list
def add_salt(first, second):
first += "salt"
second += ["salt"]
str = "rock"
seq = ["gneiss", "shale"]
print "before"
print "str is:", str
print "seq is:", seq
add_salt(str, seq[:])
print "after"
print "str is:", str
print "seq is:", seq
before
str is: rock
seq is: ['gneiss', 'shale']
after
str is: rock
seq is: ['gneiss', 'shale']
![[Passing Slices]](./img/py03/passing_slices.png)
Figure 9.3: Passing Slices
- Send comments
Default Parameter Values
- You can specify default values for parameters when defining a function
- Just “assign” some value to the parameter in the definition
def total(values, start=0, end=None):
# If no values given, total is zero.
if not values:
return 0
# If no end specified, use the entire sequence.
if end is None:
end = len(values)
# Calculate.
result = 0
for i in range(start, end):
result += values[i]
return result
- The parameters actually passed when the function is called are matched up left to right
numbers = [10, 20, 30]
print "numbers being added:", numbers
print "total(numbers, 0, 3):", total(numbers, 0, 3)
print "total(numbers, 2):", total(numbers, 2)
print "total(numbers):", total(numbers)
numbers being added: [10, 20, 30]
total(numbers, 0, 3): 60
total(numbers, 2): 30
total(numbers): 60
- All parameters with defaults must come after all parameters without them
- Otherwise, matching values to parameters would be ambiguous
- Send comments
Functions Are Objects
- A function is just another object
- Happens to be an object you can call, just as strings and lists happens to be objects you can index
def is just a shorthand for “create a function, and assign it to a variable”
def circumference(r):
return 2 * 3.14159 * r
circ = circumference
print 'circumference(1.0):', circumference(1.0)
print 'circ(2.0):', circ(2.0)
circumference(1.0): 6.28318
circ(2.0): 12.56636
![[Functions As Objects]](./img/py03/function_objects.png)
Figure 9.4: Functions As Objects
- This means you can:
- Redefine functions (just as you can reassign values to variables)
- Create aliases for functions
- Pass functions as parameters
- Store functions in lists
- Send comments
Function Object Examples
- Example: apply a function to each value in a list
def apply_to_list(function, values):
result = []
for v in values:
temp = function(v)
result.append(temp)
return result
radii = [0.1, 1.0, 10.0]
print apply_to_list(circumference, radii)
[0.62831800000000004, 6.2831799999999998, 62.831800000000001]
- Example: apply several functions to a single value
def area(r):
return 3.14159 * r * r
def color(r):
return "unknown"
def apply_each(functions, value):
result = []
for f in functions:
temp = f(value)
result.append(temp)
return result
functions = [circumference, area, color]
print apply_each(functions, 1.0)
[6.2831799999999998, 3.1415899999999999, 'unknown']
- Send comments
Function Attributes
- Every function has an attribute called
__name__- The name it was originally defined with
- Handy when debugging
def sedimentary(rock_name):
return rock_name in ['sandstone', 'shale']
sed = sedimentary
print 'original name:', sedimentary.__name__
print 'name of alias:', sed.__name__
original name: sedimentary
name of alias: sedimentary
- Python uses double underscores to mark reserved names
- Send comments
Creating Modules
- Every Python file is automatically also a module (or library)
- Refer to its contents as
geology.thing- Just like the methods and attributes of an object
- Put this in
geology.py - And this in
analysis.py - When
analysis.py runs, it prints this - Send comments
Module Scope
- Each module is its own scope
- Functions search their module after looking at the call stack, but before searching the globals
- Put this in
outer.py - And this in
inner.py - Running
outer.py produces this: - Send comments
Other Ways to Import
import geology as g, then call g.print_version()from geology import print_version, then call print_version()from geology import * imports everything from geology- Almost always a bad idea
- The next version of the module might add a function with the same name as something you're importing from elsewhere
- Send comments
Import Executes Statements
import is a statement- Executed when Python encounters it, just like any other statement
- The statements in a module are executed as it is loaded
- Assignment and
def are statements - You can use conditionals, loops, and anything else, too
- Put this in
geology.pyprint 'loading geology module'
def rock_type(rock_name):
if rock_name in ['basalt', 'granite']:
return 'igneous'
elif rock_name in ['sandstone', 'shale']:
return 'sedimentary'
else:
return 'metamorphic'
print 'geology module loaded'
- Then run
analysis.py: - Send comments
Knowing Who You Are
- Inside a module,
__name__ is set to:- The module's name, if it is being imported
- Or the string
"__main__", if it is the main program
- Often used to include self-tests in the module
- When the module is run from the command line, the self-tests are executed
- When it's loaded by other code, the tests are skipped
def is_rock(name):
return name in ['basalt', 'granite', 'sandstone', 'shale']
if __name__ == '__main__':
tests = [['basalt', True], ['gingerale', False],
[12345678, False], ['sandstone', True]]
for (value, expected) in tests:
actual = is_rock(value)
if actual == expected:
print 'pass'
else:
print 'fail'
$ python self_test.py
pass
pass
pass
pass
$ python
>>> import self_test
>>> self_test.is_rock('sugar')
False
- Send comments
The System Library
- Most commonly used library in Python is the system library
sys- Information about the Python interpreter (e.g., version number and copyright notice)
- Information about the environment (e.g., what operating system the program is running on)
- Advanced features that mere mortals should never meddle with
| Type | Name | Purpose | Example | Result |
|---|
| Data | argv | The program's command line arguments | sys.argv[0] | "myscript.py" (or whatever your program is called) |
| | maxint | Largest positive value that can be represented by Python's basic integer type | sys.maxint | 2147483647 |
| | path | List of directories that Python searches when importing modules | sys.path | ['/home/greg/pylib', '/Python24/lib', '/Python24/lib/site-packages'] |
| | platform | What type of operating system Python is running on | sys.platform | "win32" |
| | stdin | Standard input | sys.stdin.readline() | (Typically) the next line of input from the keyboard |
| | stdout | Standard output | sys.stdout.write('****') | (Typically) print four stars to the screen |
| | stderr | Standard error | sys.stderr.write('Program crashing!\n') | Print an error message to the screen |
| | version | What version of Python this is | sys.version | "2.4 (#60, Feb 9 2005, 19:03:27) [MSC v.1310 32 bit (Intel)]" |
| Function | exit | Exit from Python, returning a status code to the operating system | sys.exit(0) | Terminates program with status 0 |
|
Table 9.1: The Python Runtime System Library |
|---|
- Send comments
Command-Line Arguments
sys.argv contains the program's command-line arguments- Program's name is always
sys.argv[0]
import sys
for i in range(len(sys.argv)):
print i, sys.argv[i]
$ python command_line.py
0 command_line.py
$ python command_line.py first second
0 command_line.py
1 first
2 second
- Send comments
Standard I/O
sys.stdin and sys.stdout are standard input and output- Normally connected to the keyboard and screen
- If you redirect, or use a pipe, the operating system connects them to files or other programs
sys.stderr is connected to standard error
import sys
count = 0
for line in sys.stdin.readlines():
count += 1
sys.stdout.write('read ' + str(count) + ' lines')
$ python standard_io.py < standard_io.py
$ read 7 lines
- Send comments
The Python Search Path
sys.path is the list of places Python is allowed to look to find modules for import- Initialized from the
PYTHONPATH environment variable - Directory containing the program being run is automatically put at the start of this list
- If
sys.path is ['/home/swc/lib', '/Python24/lib'], then import geology will try:./geology.py/home/swc/lib/geology.py/Python24/lib/geology.py- Then fail
- Send comments
Exiting
sys.exit terminates the program- Returns an integer status code to the operating system
- 0 indicates successful execution (“zero errors”)
- Non-zero is an error code
- Yes, it's the opposite of what you'd expect…
- If you don't exit explicitly, Python returns 0
- So please use
sys.exit(1) or something similar so that the operating system knows something's gone wrong
- Send comments
The Math Library
- Much of Python's standard library is just wrappers around standard C libraries
- Example: the
math library| Type | Name | Purpose | Example | Result |
|---|
| Constant | e | Constant | e | 2.71828… |
| | pi | Constant | pi | 3.14159… |
| Function | ceil | Ceiling | ceil(2.5) | 3.0 |
| | floor | Floor | floor(-2.5) | -3.0 |
| | exp | Exponential | exp(1.0) | 2.71828… |
| | log | Logarithm | log(4.0) | 1.38629… |
| | | | log(4.0, 2.0) | 2.0 |
| | log10 | Base-10 logarithm | log10(4.0) | 0.60205… |
| | pow | Power | pow(2.5, 2.0) | 6.25 |
| | sqrt | Square root | sqrt(9.0) | 3.0 |
| | cos | Cosine | cos(pi) | -1.0 |
| | asin | Arc sine | asin(-1.0) | -1.5707… |
| | hypot | Euclidean norm x2 + y2 | hypot(2, 3) | 3.60555… |
| | degrees | Convert from radians to degrees | degrees(pi) | 180 |
| | radians | Convert from degrees to radians | radians(45) | 0.78539… |
|
Table 9.2: The Python Math Library |
|---|
- Send comments
Working with the File System
- The
os module is an interface between Python and the operating system- Tries to hide the differences between different operating systems
- But there's only so much it can do
| Type | Name | Purpose | Example | Result |
|---|
| Constant | curdir | The symbolic name for the current directory. | os.curdir | . on Linux or Windows. |
| | pardir | The symbolic name for the parent directory. | os.pardir | .. on Linux or Windows. |
| | sep | The separator character used in paths. | os.sep | / on Linux, \ on Windows. |
| | linesep | The end-of-line marker used in text files. | os.linesep | \n on Linux, \r\n on Windows. |
| Function | listdir | List the contents of a directory. | os.listdir('/tmp') | The names of all the files and directories in /tmp (except . and ..). |
| | mkdir | Create a new directory. | os.mkdir('/tmp/scratch') | Make the directory /tmp/scratch. Use os.makedirs to make several directories at once. |
| | remove | Delete a file. | os.remove('/tmp/workingfile.txt') | Delete the file /tmp/workingfile.txt. |
| | rename | Rename (or move) a file or directory. | os.rename('/tmp/scratch.txt', '/home/swc/data/important.txt') | Move the file /tmp/scratch.txt to /home/swc/data/important.txt. |
| | rmdir | Remove a directory. | os.rmdir('/home/swc') | Probably not something you want to do… Use os.removedirs to remove several directories at once. |
| | stat | Get information about a file or directory. | os.stat('/home/swc/data/important.txt') | Find out when important.txt was created, how large it is, etc. |
|
Table 9.3: The Python Operating System Library |
|---|
import sys, os
print 'initial working directory:', os.getcwd()
os.chdir(sys.argv[1])
print 'moved to:', os.getcwd()
print 'contents:', os.listdir(os.curdir)
$ python os_example.py ~/swc
initial working directory: /home/dmalfoy/swc/lec/inc/py03
moved to: /home/dmalfoy/swc
contents: ['.svn', 'conf', 'config.mk', 'data', 'depend.mk', 'thesis']
- Send comments
File and Directory Status
os.stat returns an object whose members have information about a file or directory, including:st_size: size in bytesst_atime: time of most recent accessst_mtime: time of most recent modification
import sys
import os
for filename in sys.argv[1:]:
status = os.stat(filename)
print filename, status.st_size, status.st_atime
$ python stat_file.py . stat_file.py
. 0 1137971715
stat_file.py 141 1137971715
- Send comments
Manipulating Pathnames
os has a submodule called os.path- Manipulate pathnames correctly and efficiently
- Do not write your own functions for this—the rules are trickier than you think
| Type | Name | Purpose | Example | Result |
|---|
| Function | abspath | Create normalized absolute pathnames. | os.path.abspath('../jeevan/bin/script.py') | /home/jeevan/bin/script.py (if executed in /home/gvwilson) |
| | basename | Return the last portion of a path (i.e., the filename, or the last directory name). | os.path.basename('/tmp/scratch/junk.data') | junk.data |
| | dirname | Return all but the last portion of a path. | os.path.dirname('/tmp/scratch/junk.data') | /tmp/scratch |
| | exists | Return True if a pathname refers to an existing file or directory. | os.path.exists('./scribble.txt') | True if there is a file called scribble.txt in the current working directory, False otherwise. |
| | getatime | Get the last access time of a file or directory (like os.stat). | os.path.getatime('.') | 1112109573 (which means that the current directory was last read or written at 10:19:33 EST on March 29, 2005). |
| | getmtime | Get the last modification time of a file or directory (like os.stat). | os.path.getmtime('.') | 1112109502 (which means that the current directory was last modified 71 seconds before the time shown above). |
| | getsize | Get the size of something in bytes (like os.stat). | os.path.getsize('py03.swc') | 29662. |
| | isabs | True if its argument is an absolute pathname. | os.path.isabs('tmp/data.txt') | False |
| | isfile | True if its argument identifies an existing file. | os.path.isfile('tmp/data.txt') | True if a file called ./tmp/data.txt exists, and False otherwise. |
| | isdir | True if its argument identifies an existing directory.. | os.path.isdir('tmp') | True if the current directory has a subdirectory called tmp. |
| | join | Join pathname fragments to create a full pathname. | os.path.join('/tmp', 'scratch', 'data.txt') | "/tmp/scratch/data.txt" |
| | normpath | Normalize a pathname (i.e., remove redundant slashes, uses of . and .., etc.). | os.path.normpath('tmp/scratch/../other/file.txt') | "tmp/other/file.txt" |
| | split | Return both of the values returned by os.path.dirname and os.path.basename. | os.path.split('/tmp/scratch.dat') | ('/tmp', 'scratch.dat') |
| | splitext | Split a path into two pieces root and ext, such that ext is the last piece beginning with a ".". | os.path.splitext('/tmp/scratch.dat') | ('/tmp/scratch', '.dat') |
|
Table 9.4: The Python Pathname Library |
|---|
import os
print 'does /home/swc exist?', os.path.exists('/home/swc')
print 'is it a directory?', os.path.isdir('/home/swc')
print 'what is its configuration directory?', os.path.join('/home/swc', 'conf')
print 'where is the configuration file?', os.path.split('/home/swc/conf/current.conf')
does /home/swc exist? True
is it a directory? True
what is its configuration directory? /home/swc\conf
where is the configuration file? ('/home/swc/conf', 'current.conf')
- Send comments
Summary
- The real measure of a programming language is how well it supports modularization
- Use functions, libraries, and Object-Oriented Programming to keep programs comprehensible
- Remember, you're really writing them for other human beings
- Send comments
Exercises
Exercise 9.1:
Write a function that takes two strings called text and
fragment as arguments, and returns the number of times
fragment appears in the second half of text. Your
function must not create a copy of the second half of
text. (Hint: read the documentation for string.count.)
Exercise 9.2:
What does the Python keyword global do?
What are some reasons not to write code that uses it?
Exercise 9.3:
Python allows you to import all the functions and variables in a
module at once, making them local name. For example, if the
module is called values, and contains a variable called
Threshold and a function called limit, then after
the statement from values import *, you can then refer
directly to Threshold and limit, rather than having
to use values.Threshold or values.limit. Explain
why this is generally considered a bad thing to do, even though it
reduces the amount programmers have to type.
Exercise 9.4:
sys.stdin, sys.stdout, and sys.stderr are
variables, which means that you can assign to them. For example,
if you want to change where print sends its output, you can
do this:
import sys
print 'this goes to stdout'
temp = sys.stdout
sys.stdout = open('temporary.txt', 'w')
print 'this goes to temporary.txt'
sys.stdout = temp
Do you think this is a good programming practice? When and why
do you think its use might be justified?
Exercise 9.5:
os.stat(path) returns an object whose members describe
various properties of the file or directory identified by
path. Using this, write a function that will determine
whether or not a file is more than one year old.
Exercise 9.6:
Write a Python program that takes as its arguments two years (such
as 1997 and 2007), prints out the number of days between the 15th
of each month from January of the first year until December of the
last year.
Exercise 9.7:
Write a simple version of which in Python.
Your program should check each directory on the caller's path
(in order) to find an executable program that has the name given
to it on the command line.
Exercise 9.8:
In the default parameter value example, why does total
use a default value of None for end, rather than
an integer such as 0 or -1?
Exercise 9.9:
What does the * in front of the parameter extras
mean in the following code example?
def total(*extras):
result = 0
for e in extras:
result += e
return result
Hint: look at the following three examples:
print total()
print total(19)
print total(2, 3, 5)
Exercise 9.10:
Use the os.path, stat, and time modules
to write a program that finds all files in a directory whose
names end with a specific suffix, and which are more than a
certain number of days old. For example, if your program is run
as oldfiles /tmp .backup 10, it will print a list of
all files in the /tmp directory whose names end in
.backup that are more than 10 days old.
Exercise 9.11:
The Strings, Lists, and Files ended by
showing several different ways to copy files using Python. Read
the documentation for the shutil module, and see if
there's a simpler way.
Exercise 9.12:
Consider the short program shown below:
def add_and_max(new_value, collection=[]):
collection.append(new_value)
return max(collection)
print 'first call:', add_and_max(22)
print 'second call:', add_and_max(9)
print 'third call:', add_and_max(15)
What do you expect its output to be? What is its actual
output? Why?
Send comments
Style
Introduction
- Good programming style is hard to teach because the rules tend to be either banal or overly restrictive
- “Make methods short, but not too short” isn't useful
- Neither is, “No method shall be longer than sixty lines.”
- Doesn't help that the strength of programmers' opinions is inversely proportional to the amount of data they have
- But readable code is less likely to contain errors, and much easier to maintain
- This lecture presents and motivates style guidelines
- For a comprehensive guide to structuring large Unix applications, see [Spinellis 2003]
- Send comments
You Can Skip This Lecture If...
- You know what 7±2 has to do with readability
- You know what a docstring is, and what it should contain
- You know what traceability is
- You know how to automatically check for style violations
- Send comments
Reading is Learning
- People doing creative work in almost every field routinely inspect, dissect, and critique what's come before
- But most student programmers only ever read short fragments in textbooks
- Like reading sonnets, then trying to write a novel
- Knowing how to read code is as useful as knowing how to read a proof
- Have to do it in order to figure out how to make specific changes to specific programs
- A good way to learn new things
- Send comments
Seven Plus or Minus
- The average person's short term memory can hold 7±2 items [Hock 2004]
- Seven random digits (as in phone numbers)
- Seven tasks that still have to be done
- If we try to remember more than that, we:
- Make mistakes, or
- Create chunks so that we can remember things at a higher level
- Common chord progressions in music
- “Castled kingside” instead of the positions of five separate pieces
![[Chunking in Short-Term Memory]](./img/style/castling_chunked.png)
Figure 10.1: Chunking in Short-Term Memory
- Send comments
The Mind's Eye
- [Chase & Simon 1973] studied what happened when novice and master chess players were shown actual and random positions
- Masters remember actual positions better
- But they were worse at remembering random ones
- Their minds “see” patterns that aren't there
![[Actual Chess Position]](./img/style/chess_actual.png)
Figure 10.2: Actual Chess Position
![[Retention of Actual Chess Position]](./img/style/chess_actual_graph.png)
Figure 10.3: Retention of Actual Chess Position
![[Random Chess Position]](./img/style/chess_random.png)
Figure 10.4: Random Chess Position
![[Retention of Random Chess Position]](./img/style/chess_random_graph.png)
Figure 10.5: Retention of Random Chess Position
- Send comments
What Does This Have to Do With Programming?
- When reading and writing code, you have to keep a bunch of facts straight for a short period of time
- What do this function's parameters mean?
- What does this loop's index refer to?
- The more odds and ends readers have to keep track of, the more errors they will make
- Goal of style rules is therefore to reduce the number of things the reader has to juggle mentally
- The greater a difference is, the more likely we are to notice it
- So every semantic difference ought to be visually distinct…
- …and every difference in naming or layout ought to mean something
- Most important thing is to be consistent
- Anything consistent is readable after a while
- Just watch kids learning to read French, Punjabi, and Korean
- Send comments
Python Style Guide
- Taken from
PEP-008: Python Style Guide- Stick to this unless you have hard data that proves something else is better
- Basic layout
- Indent blocks using four spaces
- Keep lines less than 80 characters long
- Separate functions with two blank lines
- Separate logical chunks of long functions with a single blank line
- Put comments on lines of their own, rather than to the right of code
| Rule | Good | Bad |
|---|
| No whitespace immediately inside parentheses | max(candidates[sublist]) | max( candidates[ sublist ] ) |
| …or before the parenthesis starting indexing or slicing | | max (candidates [sublist] ) |
| No whitespace immediately before comma or colon | if limit > 0: print minimum, limit | if limit > 0 : print minimum , limit |
| Use space around arithmetic and in-place operators | x += 3 * 5 | x+=3*5 |
| No spaces when specifying default parameter values | def integrate(func, start=0.0, interval=1.0) | def integrate(func, start = 0.0, interval = 1.0) |
Never use names that are distinguished only by "l", "1", "0", or "O" | tempo_long and tempo_init | tempo_l and tempo_1 |
| Short lower-case names for modules (i.e., files) | geology | Geology or geology_package |
| Upper case with underscores for constants | TOLERANCE or MAX_AREA | Tolerance or MaxArea |
| Camel case for class names | SingleVariableIntegrator | single_variable_integrator |
| Lowercase with underscores for function and method names | divide_region | divRegion |
| …and member variables | max_so_far | maxSoFar |
Use is and is not when comparing to special values | if current is not None: | if current != None: |
Use isinstance when checking types | if isinstance(current, Rock): | if type(current) == Rock: |
|
Table 10.1: Basic Python Style Rules |
|---|
- Rules relating to classes will become clear Object-Oriented Programming
- Send comments
Naming
- Names of files, classes, methods, variables, and other things are the most visible clue to purpose
- A variable called
temperature shouldn't be used to store the number of pottery shards found at a dig site
- Choose names that are both meaningful and readable
current_surface_temperature_of_probe is meaningful, but not readablecstp is easier to read, but hard to understand…- …and easy to confuse with
ctsp
- If you must abbreviate, be consistent
curr_ave_temp instead of current_average_temperature is OK…- …but only if no one else is using
curnt_av_tmp
- Send comments
Scope and Size
- The smaller the scope of a name, the more compact it can be
- It's OK to use
i and j for indices in tightly-nested for loops - But not OK if the loop bodies are several pages long
- Of course, they shouldn't be anyway…
- The wider the scope of a name, the more descriptive it has to be
- Call a class
ExperimentalRecord, rather than ER or ExpRec
- Send comments
The Difference It Makes
- Before:
import sys, os
import reader, splitter, transpose
a=[]
b=[]
c=[]
d=sys.argv[1]
a=reader.rdlines(d)
b=splitter.splitsec(a)
c=d.split('.')
for i in range(len(b)):
if os.path.isfile('%s.%d.dat'%(c[0],i+1)):
print '%s.%d.dat already exists!'%(c[0],i+1)
break
else:
output=file('%s.%d.dat'%(c[0],i+1),'w')
print>>output,transpose.txpose(b[i])
output.close()
- After:
import sys, os
import reader, splitter, transpose
input_file_name = sys.argv[1]
lines = reader.read_lines_from_file(input_file_name)
sections = splitter.split_into_sections(lines)
file_name_stem = input_file_name.split('.')[0]
for i in range(len(sections)):
output_file_name = '%s.%d.dat' % (file_name_stem, i+1)
if os.path.isfile(output_file_name):
print '%s already exists!' % output_file_name
break
else:
output = file(output_file_name, 'w')
print >> output, transpose.transpose(sections[i])
output.close()
- Send comments
Function Length
- Every function should do exactly one job
- Should be able to describe that job in a single memorable sentence
- If that sentence is five phrases joined with “and”, the function should be split up
- A good way to judge size and scope is the notion of a program slice
- The subset of names in scope at a particular statement that are needed to understand what that statement does
- If the slice is much smaller than the method itself, the method is probably bloated
- A thousand one-line methods are not an improvement over one thousand-line method
- Send comments
What Does This Function Do?
- Key questions:
- What are the function's arguments for?
- What does it return?
- What side effects does it have?
# What's missing, and what's extra?
def diff_filelist(dir_path, manifest,
ignore=[os.curdir, os.pardir, '.svn']):
def show_diff(title, diff):
if diff:
print title
for d in diff:
print '\t' + d
expected = Set()
inf = open(manifest, 'r')
for line in inf:
expected.add(line.strip())
inf.close()
actual = Set()
contents = os.listdir(dir_path)
for c in contents:
if c not in ignore:
actual.add(c)
show_diff('missing:', expected - actual)
show_diff('surplus:', actual - expected)- Send comments
Ways to Answer the Question
- Read comments
- Not particularly helpful in this case
- Guess based on names
- Find the difference between two file lists?
- But
dir_path suggests “directory path”
- Read the function body
- Define a helper function that displays
diff if it isn't empty - Make a set holding the lines in the
manifest file - Make a set holding the contents of the directory specified by
dir_path - Use the helper function to show the differences between these two sets
- Notice how “explaining” is the same as “creating chunks”
- Send comments
Other Sources of Information
- External documentation
- Calls to the function
- Use
grep or “Find in Files” to search for others- A few examples of how something is used is often sufficient
- Except when it's misleading
- Send comments
Idioms
- Every language (human or computer) has idioms
- E.g., “face the music” does not mean “look at the orchestra”
- Older versions of Python couldn't iterate directly over a file using
for line in input: - You can figure out what the code is doing even if you don't know the idiom…
- …but knowing the idiom makes it a lot simpler for your brain to chunk what it's reading
- Learn idioms by reading:
- Books (e.g., the “Effective” series from Addison-Wesley)
- Other people's code
- Send comments
Style Tools
- Many tools exist to check and enforce style guidelines
- Make these tools part of your project's build
- Don't check anything into version control unless it passes a style check
- Take their warnings very seriously when debugging
- If someone is sloppy enough to make style mistakes, they're probably making serious mistakes too
- Send comments
Python Style Tools
PyLint parses programs to create an abstract syntax tree![[Abstract Syntax Tree]](./img/style/annotated_syntax_tree.png)
Figure 10.6: Abstract Syntax Tree
- Then searches the tree for matches to problem patterns
PyChecker imports the module (or modules)- If the code doesn't run,
PyChecker can't analyze it
- Both tools allow users to:
- Customize built-in checks (e.g., specify maximum length of a function)
- Define entirely new checks
- Send comments
Documentation
- Requirements
- What needs the software is supposed to meet
- “If more than 100 events arrive in a second, the seismograph interface must store them in a queue“ tells you to look for a seismograph interface, and a queue that feeds it data
- User guide
- Actually just another way to specify requirements
- Architectural descriptions
- Architecture is what you draw on the whiteboard when explaining the program to other people
- Shows relationships between major modules, data flow, etc.
- Gives other programmers a mental map of how everything fits together
- Send comments
More On Documentation
- At start of functions (or classes, or methods) to explain what they're for
- “This function returns the first pair of non-overlapping subsequences that match the input pattern, or null otherwise”
- The programmer's guide to the software
- Embedded in code to explain tricky bits
- “This is a while loop, instead of a for, because it may delete items from the list as it goes”
- In general, if you need to explain the code, you ought to simplify it instead
- Send comments
Traceability
- Traceability is the key to reproducibility
- …which is the key to debuggability
- Where did your data come from?
- What did you do to it?
- Using which versions of which programs?
- Always include version control information in source files
- Most version control systems will update text like
$Revision: 421$ when you submit changes - The file carries its identity with it when it's printed, archived, or emailed
- Usually put the keyword inside a string, so that version information is available inside the program
- Send comments
Tracing Data
- Every original data file should go under version control
- Record author and date of last change as well as revision number
- Carry this information through when processing files
- Send comments
Embedding Documentation
- Embedded documentation is more likely to be up to date than external documentation
- Javadoc translates specially-formatted comments into HTML
- Java
/**
* Returns the least common ancestor of two species based on DNA
* comparison, with certainty no less than the specified threshold.
* Note that getConcestor(X, X, t) returns X for any threshold.
*
* @param left one of the base species for the search
* @param right the other base species for the search
* @param threshold the degree of certainty required
* @return the common ancestor, or null if none is found
* @see Species
*/
public Species getConcestor(Species left, Species right, float threshold) {
...implementation...
}
- Documentation web page
getConcestor
public Species getConcestor(Species left, Species right, float threshold)
Returns the least common ancestor of two species based on DNA
comparison, with certainty no less than the specified threshold. Note
that getConcestor(X, X, t) returns X for any threshold.
Parameters:
left - one of the base species for the search
right - the other base species for the search
threshold - the degree of certainty required
Parameters:
the common ancestor, or null if none is found
See Also:
Image
- Send comments
Docstrings
- Python uses documentation strings (or docstrings) instead of comments
- A string at the start of a module or function that isn't assigned to anything becomes the object's
__doc__ attribute - Unlike a comment, it's there at runtime
'''This module provides functions that search and compare genomes.
All functions assume that their input arguments are in valid CCSN-2
format; unless specifically noted, they do not modify their arguments,
print, or have other side effects.
'''
__version__ = '$Revision: 497$'
def get_concestor(left, right, threshold):
'''Find the least common ancestor of two species.
This function searches for a least common ancestor based on DNA
comparison with certainty no less than the specified threshold.
If one can be found, it is returned; otherwise, the function
returns None. get_concestor(X, X, t) returns X for any threshold.
left : one of the base species for the search
right : the other base species for the search
threshold : the degree of certainty required
'''
pass # implementation would go here
$ python
>>> import genome
>>> print genome.__doc__
This module provides functions that search and compare genomes.
All functions assume that their input arguments are in valid CCSN-2
format; unless specifically noted, they do not modify their arguments,
print, or have other side effects.
>>> print genome.get_concestor.__doc__
Find the least common ancestor of two species.
This function searches for a least common ancestor based on DNA
comparison with certainty no less than the specified threshold.
If one can be found, it is returned; otherwise, the function
returns None. get_concestor(X, X, t) returns X for any threshold.
left : one of the base species for the search
right : the other base species for the search
threshold : the degree of certainty required
- Python's
Docutils will extract, format, and cross-reference docstrings - Send comments
Summary
- Code and documentation decay over time [Eick et al 2001]
- To prevent this, must:
- Make good style a habit
- Back it up with automated checks
- Remember, we don't actually write programs for the benefit of computers
- It takes a lot of very sophisticated software to translate our programs into a form computers understand
- We write programs for other people
- Our colleagues
- Our future selves
- Send comments
Quality Assurance
Introduction
- The more you invest in quality, the less time it takes to develop working software [Glass 2002]
- Quality is not just testing
- “Trying to improve the quality of software by doing more testing is like trying to lose weight by weighing yourself more often.” (Steve McConnell)
- Quality is:
- Designed in
- Monitored and maintained through the whole software lifecycle
- This lecture looks at basic things every developer can do to maintain quality
- Send comments
You Can Skip This Lecture If...
- You know that no amount of testing can prove that software is correct
- You know what unit testing, integration testing, and regression testing are
- You know what a fixture is
- You know what an exception is, and how to raise one
- You know what test-driven design is
- You know what defensive programming is
- You know what design by contract is
- Send comments
Limits to Testing
- Suppose you have a function that compares two 7-digit phone numbers, and returns
True if the first is greater than the second- (107)2 possible inputs
- At ten million tests per second, that's 155 days
- If they're 7-character alphabetic strings, it's 254 years
- Then you move on to the second function…
- And how do you know that your tests are correct?
- All a test can do is show that there may be a bug
- Send comments
Terminology
- A unit test exercises one component in isolation
- Developer-oriented: tests the program's internals
- An integration test exercises the whole system
- User-oriented: tests the software's overall behavior
- Regression testing is the practice of rerunning tests to check that the code still works
- I.e., make sure that today's changes haven't broken things that were working yesterday
- Programs that don't have regression tests are difficult (sometimes impossible) to maintain [Feathers 2005]
- Send comments
Test Results and Specifications
- Any test can have one of three outcomes:
- Pass: the actual outcome matches the expected outcome
- Fail: the actual outcome is different from what was expected
- Error: something went wrong inside the test (i.e., the test contains a bug)
- Don't know anything about the system being tested
- A specification is something that tells you how to classify a test's result
- You can't test without some sort of specification
- Discuss ways of creating specifications below, and in a The Development Process
- Send comments
Structuring Tests
- How to write tests so that:
- It's easy to add or change tests
- It's easy to see what's been tested, and what hasn't
- A test consists of a fixture, an action, and an expected result
- A fixture is something that a test is run on
- Can be as simple as a single value, or as complex as a networked database
- Every test should be independent
- I.e., the outcome of one test shouldn't depend on what happened in another test
- Otherwise, faults in early tests can distort the results of later ones
- So each test:
- Creates a fresh instance of the fixture
- Performs the operation
- Checks and records the result
- Send comments
A Simple Example
- Test
string.startswith- Specification: returns
True if the string starts with the given prefix, and False otherwise - Hm… What if the prefix is the empty string?
- Store the tests in a table
- Easy to read and add to
Tests = [
# String Prefix Expected
['a', 'a', True],
['a', 'b', False],
['abc', 'a', True],
['abc', 'ab', True],
['abc', 'abc', True],
['abc', 'abcd', False],
['abc', '', True]
]- String and prefix are fixture
- Now run them
passes = 0
failures = 0
for (s, p, expected) in Tests:
actual = s.startswith(p)
if actual == expected:
passes += 1
else:
failures += 1
print 'passed', passes, 'out of', passes+failures, 'tests'- Hm… Where's the code to handle and report errors in the tests themselves?
- Send comments
Catching Errors
- Python uses exceptions for error handling
- Separates normal operation from error handling
- Makes both easier to read
- Structured like
if/else- Code for healthy case goes in a
try block - Error handling code goes in a matching
except block
- When something goes wrong in the
try block, Python raises an exception - Can add an optional
else block- Executed when things don't go wrong inside the
try block
- Send comments
Simple Exception Example
- Try dividing by zero and some non-zero values:
for num in [-1, 0, 1]:
try:
inverse = 1/num
except:
print 'inverting', num, 'caused error'
else:
print 'inverse of', num, 'is', inverse
inverse of -1 is -1
inverting 0 caused error
inverse of 1 is 1
![[Flow of Control in Try/Except/Else]](./img/qa/try_except_else.png)
Figure 11.1: Flow of Control in Try/Except/Else
- Send comments
Exception Objects
- When Python raises an exception, it creates an object to hold information about what went wrong
- Typically contains an error message
- Can choose which errors to handle by specifying an exception type in the
except statement- E.g., handle division by zero, but not out-of-bounds list index
# Note: mix of numeric and non-numeric values.
values = [0, 1, 'momentum']
# Note: top index will be out of bounds.
for i in range(4):
try:
print 'dividing by value', i
x = 1.0 / values[i]
print 'result is', x
except ZeroDivisionError, e:
print 'divide by zero:', e
except IndexError, e:
print 'index error:', e
except:
print 'some other error:', e
dividing by value 0
divide by zero: float division
dividing by value 1
result is 1.0
dividing by value 2
some other error: float division
dividing by value 3
index error: list index out of range
- The
except blocks are tested in order—whichever matches first, wins- If a “naked”
except appears, it must come last (since it catches everything) - Generally better to use
except Exception, e so that you have the exception object
- Send comments
Exception Hierarchy
- Exceptions are organized in a hierarchy
- E.g.,
ZeroDivisionError, OverflowError, and FloatingPointError are all types of ArithmeticError - A handler for the general type catches all its specific sub-types
| Name | | | Purpose |
|---|
Exception | | | Root of exception hierarchy |
| | ArithmeticError | | Illegal arithmetic operation |
| | | FloatingPointError | Generic error in floating point calculation |
| | | OverflowError | Result too large to represent |
| | | ZeroDivisionError | Attempt to divide by zero |
| | IndexError | | Bad index to sequence (out of bounds or illegal type) |
| | TypeError | | Illegal type (e.g., trying to add integer and string) |
| | ValueError | | Illegal value (e.g., math.sqrt(-1)) |
| | EnvironmentError | | Error interacting with the outside world |
| | | IOError | Unable to create or open file, read data, etc. |
| | | OSError | No permissions, no such device, etc. |
|
Table 11.1: Common Exception Types in Python |
|---|
- Send comments
Functions and Exceptions
- Each time Python enters a
try/except block, it pushes the except handlers on a stack- Just like the function call stack
![[Stacking Exception Handlers]](./img/qa/exception_stack.png)
Figure 11.2: Stacking Exception Handlers
- When an exception is raised, Python searches this stack for the top-most matching handler
- Often means jumping out of the middle of a function
def invert(vals, index):
try:
vals[index] = 10.0/vals[index]
except ArithmeticError, e:
print 'inner exception handler:', e
def each(vals, indices):
try:
for i in indices:
invert(vals, i)
except IndexError, e:
print 'outer exception handler:', e
# Once again, the top index will be out of bounds.
values = [-1, 0, 1]
print 'values before:', values
each(values, range(4))
print 'values after:', values
values before: [-1, 0, 1]
inner exception handler: float division
outer exception handler: list index out of range
values after: [-10.0, 0, 10.0]
- Send comments
Raising Exceptions
- Use
raise to trigger exception processing- Specify the type of exception you're raising using
raise Exception('this is an error message') - Please make your error messages more informative…
for i in range(4):
try:
if (i % 2) == 1:
raise ValueError('index is odd')
else:
print 'not raising exception for %d' % i
except ValueError, e:
print 'caught exception for %d' % i, e
not raising exception for 0
caught exception for 1 index is odd
not raising exception for 2
caught exception for 3 index is odd
- Send comments
Exceptional Style
- Always use exceptions to report errors instead of returning
None, -1, False, or some other value- Allows callers to separate normal code from error handling
- And sooner or later, your function will probably actually want to return that “special” value
- Note: Python's own
list.find breaks this rule- Returns -1 if something can't be found
- Throw low, catch high
- I.e., throw lots of very specific exceptions…
- …but only catch them where you can actually take corrective action
- Because every application handles errors differently
- If someone is using your library in a GUI, you don't want to be printing to
stderr
- Send comments
Handling Errors in Tests
- Now know how to check for errors in tests
- Wrap the test in
try/except
Tests = [
['a', 'a', False], # wrong expected value
['a', 1, False], # wrong type
['abc', 'a', True] # everything legal
]
passes = failures = errors = 0
for (s, p, expected) in Tests:
try:
actual = s.startswith(p)
if actual == expected:
passes += 1
else:
failures += 1
except:
errors += 1
print 'tests:', passes + failures + errors
print 'passes:', passes
print 'failures:', failures
print 'errors:', errors
tests: 3
passes: 1
failures: 1
errors: 1
- Note the deliberate errors in the test cases to exercise the testing code
- Send comments
Test-Driven Design
- Tests are actually specifications
- “Given these inputs, this code should behave the following way”
- So write the tests first, then the application code
- Sounds backward, but:
- A great way to clarify specifications
- I write the tests
- “All” you have to do is write code that passes those tests
- Gives programmers a definite goal
- Coding is finished when all tests run
- Particularly useful when trying to fix bugs in old code, as it forces you to figure out how to re-create the bug
- Helps prevent the “one more feature” syndrome
- Ensures that tests actually get written
- People are often too tired, or too rushed, to test after coding
- Helps clarify the Application Programming Interface (API) before it is set in stone
- If something is awkward to test, it can be redesigned before it's written
- Send comments
TDD Example
- I want you to write a function that calculates a running sum of the values in a list
- Doesn't specify whether to create a new list, or overwrite the input
- Doesn't specify how to handle errors
- You'd probably prefer something like this:
Tests = [
[[], [], 'empty list'],
[[1], [1], 'single value'],
[[1, 3], [1, 4], 'two values'],
[[1, 3, 7], [1, 4, 11], 'three values'],
[[-1, 1], [-1, 0], 'negative values'],
[[1, 3.0], [1, 4.0], 'mixed types'],
["string", ValueError, 'non-list input'],
[['a'], ValueError, 'non-numeric value']
]- If the expected result is an exception, pass only if that exception is raised
- If the test doesn't pass, print the comment so that the programmer knows what to look at
- Send comments
Design by Contract
- Functions ought to carry their specifications around with them
- Keeping specification and implementation together makes both easier to understand
- And improves the odds that programmers will keep them in sync
- A function is defined by:
- Its pre-conditions: what must be true in order for the function to work correctly
- Its post-conditions: what the function guarantees will be true if its pre-conditions are met
- May also have invariants: things that are true throughout the execution of the function
- Leads to a style of programming called design by contract
- Pre- and post-conditions constrain how the function can evolve
- Can only ever relax pre-conditions (i.e., take a wider range of input)…
- …or tighten post-conditions (i.e., produce a narrower range of output)
- Tightening pre-conditions, or relaxing post-conditions, would violate the function's contract with its callers
- Send comments
Assertions
- Normally specify pre- and post-conditions using assertions
- A statement that something is true at a particular point in a program
- If the assertion's condition is not met, Python raises an
AssertionError exception
- For example:
- Note that the post-condition isn't as exacting as it should be
- Doesn't check that
left is less than or equal to all other values, or that right is greater than or equal to - The code to check the condition exactly is as likely to contain errors as the function itself
- Which is one of the reasons design by contract isn't as popular as it might be
- Send comments
Defensive Programming
- You can (and should) use
assert liberally- Even if you don't practice design by contract
- Defensive programming is like defensive driving
- Program as if the rest of the world is out to get you
- “Fail early, fail often”
- The less distance there is between the error and you detecting it, the easier it will be to find and fix
- Good practice: every time you fix a bug, put in an assertion and a comment
- Because if you made the error, the right code can't be obvious
- And you should protect yourself against someone “simplifying” the bug back in
def can_transmute(element):
'''Can this element be turned into gold?'''
# Bug #172: make sure the input is actually an element.
assert is_valid_element(element)
# Gold is trivial.
if element is Gold:
return True
# Trans-uranic metals and halogens are impossible.
if (element.atomic_number > Uranium.atomic_number) or \
(element in Halogens):
return False
# Look for a sequence of steps that leads to gold.
steps = search_transmutations(element, Gold)
if steps == []:
return False
else:
# Bug #201: must be at least two elements in sequence.
assert len(steps) >= 2
return True
- Send comments
Summary
- The real goal of quality assurance isn't to find bugs: it's to figure out where they're coming from, so that they can be prevented
- But without testing, no one (including you) has any right to rely on the program's output
- Only way to ensure quality is to design it in
- Send comments
Sets, Dictionaries, and Complexity
Introduction
- The world is not made of lists
- Just because it's the first data structure you meet, doesn't make it the right one for every task
- This lecture introduces two other fundamental data structures
- Allow you to create programs that are simpler and more efficient
- Also look at what “efficient” really means
- Send comments
You Can Skip This Lecture If...
- You know what a set is
- You understand why a set's elements must be immutable
- You know what O-notation is
- You know what a dictionary is
- Send comments
Sets
- A set is an unordered collection of distinct items
- Unordered: items are looked up by value, rather than location
- Distinct: any value appears at most once
- Fundamental in mathematics, but an afterthought in most programming languages
- The
set type is built in to Python 2.4 and higher- Create a new set by calling
set() - Then insert and remove values, test for membership, etc.
vowels = set()
for char in 'aieoeiaoaaeieou':
vowels.add(char)
print vowels
Set(['a', 'i', 'e', 'u', 'o'])
- Send comments
Set Operations
- Like other objects, sets have methods
- Many of which can be expressed using operators as well
| Method | Purpose | Example | Result | Alternative Form |
|---|
| Example values: | ten = set(range(10)) | lows = set([0, 1, 2, 3, 4]) | odds = set([1, 3, 5, 7, 9]) | |
add | Add an element to a set | lows.add(9) | None | lows is now set([0, 1, 2, 3, 4, 9]]) |
clear | Remove all elements from the set | lows.clear() | None | lows is now set() |
difference | Create a set with elements that are in one set, but not the other | lows.difference(odds) | set([0, 2, 4]]) | lows - odds |
intersection | Create a set with elements that are in both arguments | lows.intersection(odds) | set([1, 3]]) | lows & odds |
issubset | Are all of one set's elements contained in another? | lows.issubset(ten) | True | lows <= ten |
issuperset | Does one set contain all of another's elements? | lows.issuperset(odds) | False | lows >= odds |
remove | Remove an element from a set | lows.remove(0) | None | lows is now set([1, 2, 3, 4]]) |
symmetric_difference | Create a set with elements that are in exactly one set | lows.symmetric_difference(odds) | set([0, 2, 4, 5, 7, 9]]) | lows ^ odds |
union | Create a set with elements that are in either argument | lows.union(odds) | set([0, 1, 2, 3, 4, 5, 7, 9]]) | lows | odds |
|
Table 12.1: Set Methods and Operators |
|---|
- Send comments
Set Example
- Have several files with observations of birds
- Want to find out which species have been seen
- Program:
lines = [
'canada goose', 'canada goose', 'long-tailed jaeger', 'canada goose',
'snow goose', 'canada goose', 'canada goose', 'northern fulmar'
]
seen = set()
for line in lines:
seen.add(line.strip())
for bird in seen:
print bird
northern fulmar
snow goose
long-tailed jaeger
canada goose
- Note:
for loops over the values in the set
- Send comments
How Set Values Are Stored
- Implementation goal is to make lookup as quick as possible
- Without making insertion and removal expensive
- Use a hash table
![[Hashing]](./img/py04/hashing.png)
Figure 12.1: Hashing
- Calculate a hash code for the object being inserted
- Store the value at that location in an array
- If the hash function is good, collisions will be rare
- When they occur, chain values together in a sub-list
- Result: looking up a value takes constant time, regardless of how many values are being stored
- Send comments
Immutability
- This only works if a value's hash code never changes after it is inserted
- If it does, the value will be in the wrong place
![[Misplaced Values]](./img/py04/misplaced_values.png)
Figure 12.2: Misplaced Values
- Python therefore only allows sets to contain immutable values
- Booleans, numbers, strings, tuples…
- …but not lists
values = set()
values.add('birds')
print values
values.add(('Canada', 'goose'))
print values
values.add(['snow', 'goose'])
print values
Traceback (most recent call last):
File "mutable_in_set.py", line 8, in ?
values.add(['snow', 'goose'])
File "/usr/lib/python2.3/sets.py", line 521, in add
self._data[element] = True
TypeError: list objects are unhashable
- This is one of the reasons tuples were invented
- Allow you to store multi-part values like
("snow", "goose")
- Send comments
Frozen Sets
- What about sets of sets?
- Sets themselves have to be mutable, so that values can be inserted and removed
- Python also provides “frozen" sets
- No changes allowed after creation
- So they're almost always initialized from a collection of some kind
$ python
>>> birds = set()
>>> arctic = frozenset(['goose', 'tern'])
>>> birds.add(arctic)
>>> print birds
set([frozenset(['goose', 'tern'])])
>>> arctic.add('eider')
AttributeError: 'frozenset' object has no attribute 'add'
- Send comments
A Note on Language Design
- Many languages allow mutable elements in sets, and trust users not to modify them after insertion
- Which is a rich source of hard-to-find bugs
- Could also have values keep track of which sets they're in
- “Move” them when their values change
- Would make all programs slow for the benefit of a few
- Every software system contains tradeoffs like this
- Send comments
Efficiency
- So is using sets worthwhile?
- Imagine storing species names in a list instead of a set
- Each
if name in seen check requires N/2 comparisons on average - So building up those N values requires N(N+1)/4 comparisons
- Only requires N for a set
- Difference is dramatic
![[List vs. Set Performance]](./img/py04/list_vs_set.png)
Figure 12.3: List vs. Set Performance
- Send comments
Complexity Curves
- Can get better performance out of the list if we keep it sorted
- K checks is enough to find a value in a list of length 2K
- So a list containing N values can be searched in log2N computational steps
- Building a list of N values therefore requires roughly N log2 N steps
![[List vs. Set Performance Revisited]](./img/py04/logarithmic.png)
Figure 12.5: List vs. Set Performance Revisited
- Send comments
Algorithmic Complexity
- The relationship between problem size and running time is called algorithmic complexity
- Usually described in terms of upper bounds
- If f(x) < kg(x) for large x and some constant k, then f(x) is O(g(x))
- For example:
- Something that takes the same time, regardless of data size, is O(1)
- If the time grows as the logarithm of the data size, it is O(log N)
- If the time is proportional to the number of values, it is O(N)
- Storing species names in a list is O(N2)
- Because if you throw away the constant 4, the difference between N(N+1)/4 = (N2 + N)/4 and N2 becomes insignificant as N grows large
- Send comments
Motivating Dictionaries
- Suppose you want to count how often each species of bird is seen
- Can't store
(name, count) in set… - …because then you couldn't look up a species' count unless you already knew what it was
- Could fall back on lists of pairs…
- Better solution: store extra data with each element of a set
- A dictionary associates one value with each of its keys
- An unordered mutable collection
- Also called maps, hashes, and associative arrays
- Often visualized as two-column table
![[Dictionaries as Tables]](./img/py04/dict_as_table.png)
Figure 12.6: Dictionaries as Tables
- Send comments
Creating and Indexing
- Create a dictionary by putting key/value pairs inside
{}{'Newton':1642, 'Darwin':1809}- Empty dictionaries are written
{}
- Look up the value associated with a key using
[]
birthday = {
'Newton' : 1642,
'Darwin' : 1809
}
print "Darwin's birthday:", birthday['Darwin']
print "Newton's birthday:", birthday['Newton']
Darwin's birthday: 1809
Newton's birthday: 1642
- Can only access keys that are present
- Just as you can't index elements of a list that aren't there
birthday = {
'Newton' : 1642,
'Darwin' : 1809
}
print birthday['Turing']
Traceback (most recent call last):
File "key_error.py", line 5, in ?
print birthday['Turing']
KeyError: 'Turing'
- Send comments
Updating Dictionaries
- Assigning to a dictionary key:
- Creates a new entry if the key is not already in dictionary
- Overwrites the previous value if the key is already present
birthday = {}
birthday['Darwin'] = 1809
birthday['Newton'] = 1942 # oops
birthday['Newton'] = 1642
print birthday
{'Darwin': 1809, 'Newton': 1642}
- Remove an entry using
del d[k]- Can only remove entries that are actually present
birthday = {
'Newton' : 1642,
'Darwin' : 1809,
'Turing' : 1912
}
print 'Before deleting Turing:', birthday
del birthday['Turing']
print 'After deleting Turing:', birthday
del birthday['Faraday']
print 'After deleting Faraday:', birthday
Before deleting Turing: {'Turing': 1912, 'Newton': 1642, 'Darwin': 1809}
After deleting Turing: {'Newton': 1642, 'Darwin': 1809}
Traceback (most recent call last):
File "dict_del.py", line 10, in ?
del birthday['Faraday']
KeyError: 'Faraday'
- Send comments
Membership and Loops
- Test whether a key
k is in a dictionary d using k in d- Once again, inconsistent with behavior of lists, but useful
birthday = {
'Newton' : 1642,
'Darwin' : 1809
}
for name in ['Newton', 'Turing']:
if name in birthday:
print name, birthday[name]
else:
print 'Who is', name, '?'
Newton 1642
Who is Turing ?
for k in d loops over the dictionary's keys (rather than its values)- Different from lists, where
for loops over the values, rather than indices
birthday = {
'Newton' : 1642,
'Darwin' : 1809,
'Turing' : 1912
}
for name in birthday:
print name, birthday[name]
Turing 1912
Newton 1642
Darwin 1809
- Send comments
Dictionary Methods
- Yes, dictionaries are objects too…
| Method | Purpose | Example | Result |
|---|
clear | Empty the dictionary. | d.clear() | Returns None, but d is now empty. |
get | Return the value associated with a key, or a default value if the key is not present. | d.get('x', 99) | Returns d['x'] if "x" is in d, or 99 if it is not. |
keys | Return the dictionary's keys as a list. Entries are guaranteed to be unique. | birthday.keys() | ['Turing', 'Newton', 'Darwin'] |
items | Return a list of (key, value) pairs. | birthday.items() | [('Turing', 1912), ('Newton', 1642), ('Darwin', 1809)] |
values | Return the dictionary's values as a list. Entries may or may not be unique. | birthday.values() | [1912, 1642, 1809] |
update | Copy keys and values from one dictionary into another. | See the example below. | |
|
Table 12.2: Dictionary Methods in Python |
|---|
- Example:
birthday = {
'Newton' : 1642,
'Darwin' : 1809,
'Turing' : 1912
}
print 'keys:', birthday.keys()
print 'values:', birthday.values()
print 'items:', birthday.items()
print 'get:', birthday.get('Curie', 1867)
temp = {
'Curie' : 1867,
'Hopper' : 1906,
'Franklin' : 1920
}
birthday.update(temp)
print 'after update:', birthday
birthday.clear()
print 'after clear:', birthday
keys: ['Turing', 'Newton', 'Darwin']
values: [1912, 1642, 1809]
items: [('Turing', 1912), ('Newton', 1642), ('Darwin', 1809)]
get: 1867
after update: {'Curie': 1867, 'Darwin': 1809, 'Franklin': 1920, 'Turing': 1912, 'Newton': 1642, 'Hopper': 1906}
after clear: {}
- Send comments
Counting Frequency
- So, back to our birds…
- Use species names as keys in a dictionary
- The value associated with each key is the number of times it has been seen so far
# Data to count.
names = ['tern','goose','goose','hawk','tern','goose', 'tern']
# Build a dictionary of frequencies.
freq = {}
for name in names:
# Already seen, so increment count by one.
if name in freq:
freq[name] = freq[name] + 1
# Never seen before, so add to dictionary.
else:
freq[name] = 1
# Display.
print freq
{'goose': 3, 'tern': 3, 'hawk': 1}
- Send comments
A Slight Simplification
- Can simplify this code using
dict.get- Get either the count associated with the key, or 0, then add one to it
freq = {}
for name in names:
freq[name] = freq.get(name, 0) + 1
print freq
{'goose': 3, 'tern': 3, 'hawk': 1}
- Send comments
Imposing Order
- A dictionary's keys are unordered (just like the elements in a set)
- Remember, we deliberately randomize (hash) in order to make lookup fast
- So, to print counts in alphabetic order:
- Get the list of keys
- Sort that list
- Loop over it
keys = freq.keys()
keys.sort()
for k in keys:
print k, freq[k]
goose 3
hawk 1
tern 3
- Send comments
Inverting a Dictionary
- But how to print in order of frequency?
- Need to invert the dictionary
- I.e., swap the keys and values
- But there might be collisions, since values aren't guaranteed to be unique
- What is the inverse of
{'a':1, 'b':1, 'c':1}?
- Solution: store a list of values instead of just a single value
- Use
dict.get(key, []) instead of dict.get(key, 0)
inverse = {}
for (key, value) in freq.items():
seen = inverse.get(value, [])
seen.append(key)
inverse[value] = seen
keys = inverse.keys()
keys.sort()
for k in keys:
print k, inverse[k]
1 ['hawk']
3 ['goose', 'tern']
![[Inverting a Dictionary]](./img/py04/invert_dict.png)
Figure 12.7: Inverting a Dictionary
- Send comments
Another Way to Do It
Formatting Strings with Dictionaries
- Complex string formatting can be hard to understand
- Especially if one value needs to be used several times
- Instead of a tuple,
"%" can take a dictionary as its right argument- Use
"%(varname)s" inside the format string to identify what's to be substituted
birthday = {
'Newton' : 1642,
'Darwin' : 1809,
'Turing' : 1912
}
entry = '%(name)s: %(year)s'
for (name, year) in birthday.items():
temp = {'name' : name, 'year' : year}
print entry % temp
Turing: 1912
Newton: 1642
Darwin: 1809
- Send comments
Extra Keyword Arguments
- Consider this example:
def settings(title, **kwargs):
print 'title:', title
for key in kwargs:
print ' %s: %s' % (key, kwargs[key])
settings('nothing extra')
settings('colors', red=0.0, green=0.5, blue=1.0)
title: nothing extra
title: colors
blue: 1.0
green: 0.5
red: 0.0
- The
** in front of kwargs means “Put any extra keyword arguments in a dictionary, and assign it to kwargs“ - Allows you to create functions that can handle arbitrary arguments
- Send comments
Extra Positional Arguments
- Can do something similar with extra positional (unnamed) arguments
def sum(*values):
result = 0.0
for v in values:
result += v
return result
print "no values:", sum()
print "single value:", sum(3)
print "five values:", sum(3, 4, 5, 6, 7)
no values: 0.0
single value: 3.0
five values: 25.0
- The single
* in front of values means “Put any extra unnamed arguments in a tuple, and assign it to values“ - Can have at most one
* argumen per function - Question: what does
** mean? How and why would you use it? - Send comments
Summary
- The world isn't made of lists
- Other basic data structures can make your programs much simpler, and much more efficient
- And learning a few advanced features of whatever language you're using can do the same
- Send comments
Debugging
Introduction
- You're going to spend half your professional life debugging
- So you should learn how to do it systematically
- Talk about tools first
- They'll make everything else less painful
- Then some techniques
- Assume for now that you built the right thing the wrong way
- Requirements errors are actually a major cause of software project failure
- But out of scope for this course
- Send comments
You Can Skip This Lecture If...
- You know how to use a symbolic debugger
- Just knowing what one is doesn't count
- You know what a breakpoint is
- You know what conditional breakpoints are good for
- You know what logging is
- You follow Agans' Rules
- Send comments
What's Wrong with Print Statements
- Many people still debug by adding print statements to their programs
- It's error-prone
- Adding print statements is a good way to add typos
- Particularly when you have to modify the block structure of your program
- And time-consuming
- All that typing…
- And (if you're using Java, C++, or Fortran) all that recompiling…
- And can be misleading
- Moves things around in memory, changes execution timing, etc.
- Common for bugs to hide when print statements are added, and reappear when they're removed
- Send comments
Symbolic Debuggers
- A debugger is a program that runs another program on your behalf
- Sometimes called a symbolic debugger because it shows you the source code you wrote, rather than raw machine code
- While the target program (or debuggee) is running, the debugger can:
- Pause, resume, or restart the target
- Display or change values
- Watch for calls to particular functions, changes to particular variables, etc.
- Do not need to modify the source of the target program!
- Depending on your language, you may need to compile it with different flags
- And yes, the debugger modifies the target's layout in memory, and execution speed…
- …but a lot less than print statements…
- …with a lot less effort from you
- Send comments
Debugger Features
- Interactive debuggers typically show:
- The source code
- The call stack
- The values of variables that are currently in scope
- I.e., global variables, parameters to the current function call, and local variables in that function
- A panel displaying what your program has printed to standard output and/or standard error
- We'll use
WingIDE in this lecture![[A Debugger in Action]](./img/debugging/debugger_in_action.png)
Figure 13.1: A Debugger in Action
- Send comments
Kinds of Debuggers
- There may be several ways to get into the debugger
- Launch the debugger, load the target program, and start work
- Run the debugger with the target program as a command-line argument
- Switch into debugging mode in the middle of an interactive session
- Sometimes also do post mortem debugging
- When a program fails badly, it creates a core dump
- Copies all of its internal state to a file on disk
- Load that dump into the debugger, and see where the program was when it terminated
- Not as good as watching it run…
- …but sometimes the best you can do
- Send comments
Integrated Development Environments
- Debuggers are usually part of integrated development environments (IDEs)
- These usually contain many other tools as well, including:
- A class browser that presents an outline of the project's modules, classes, functions, variables, etc.
![[Source Browser]](./img/debugging/source_browser.png)
Figure 13.2: Source Browser
- A code assistant that presents context-sensitive help and documentation
![[Code Assistant]](./img/debugging/code_assistant.png)
Figure 13.3: Code Assistant
- Tools like this are available for every modern language
- Send comments
Command-Line Debuggers
- Many of today's debuggers are GUIs wrapped around older command-line debuggers
- Most widely used of these is
GDB- Supports many languages, on many platforms
- But no one ever said it was easy to learn
- Python comes with a simple debugger called
pdb- Can be invoked by calling
pdb.set_trace() inside a program
import pdb
base = "Na"
pdb.set_trace()
acid = "Cl"
salt = base + acid
print salt
$ python lec/inc/debugging/set_trace.py
> /swc/lec/inc/debugging/set_trace.py(7)?()
-> acid = "Cl"
(Pdb) n
> /swc/lec/inc/debugging/set_trace.py(8)?()
-> salt = base + acid
(Pdb) n
> /swc/lec/inc/debugging/set_trace.py(9)?()
-> print salt
(Pdb) n
NaCl
--Return--
- Send comments
Inspecting Values
- Use the debugger to set breakpoints in the target program
- Tells the target program to pause when it reaches particular lines of code
- When the target program is paused, the debugger can display the contents of its memory
![[Inspecting Values]](./img/debugging/inspecting_values.png)
Figure 13.4: Inspecting Values
- Most debuggers can also evaluate expressions using the current values of variables
- E.g., type in
2*x<0, debugger displays False
- Send comments
Controlling Execution
- The debugger can also:
- Allows you to see:
- How values are changing
- Which branches the program is actually taking
- Which functions are actually being called
- Debuggers really should be called “inspectors”
- Send comments
Under the Hood
- Programs are “just” data
![[Programs As Data]](./img/debugging/programs_as_data.png)
Figure 13.5: Programs As Data
- Some bytes hold values that represent instructions
- Each statement in the source program typically corresponds to several instructions
- The compiler (or interpreter) keeps track of which lines of code produced which instructions
- Other bytes hold constants and variables
- The static space contains constant strings, magic numbers, etc.
- The call stack holds function parameters
- The heap is all dynamically-created objects
- Some registers
- An instruction pointer that keeps track of what to execute next
- A stack pointer to keep track of the call stack
- Miscellaneous other registers for doing arithmetic, etc.
- Send comments
Implementing Breakpoints
- To set a breakpoint on a particular line, the debugger:
- Figures out which instructions were produced for the statement on that line
- Copies the first of those instructions to a safe place
- Replaces it with a
HALT instruction ![[Creating a Breakpoint]](./img/debugging/setting_breakpoint.png)
Figure 13.6: Creating a Breakpoint
- When the target program reaches the
HALT instruction, it signals the debugger- Which can then inspect the target program's memory
- When the user wants the program to resume, the debugger:
- Puts the instruction at the breakpoint location back in place
- Tells the program to execute that instruction
- Replaces the instruction with a
HALT once again - Tells the program to run normally
- Send comments
Inspecting More Values
- Debugger lets you move up and down the call stack
- Allows you to run to the problem, then figure out how you got there
- Also lets you modify values
- If you have a theory about why a bug is occurring:
- Run to that point
- Set variables' values (e.g., set
max_temp to -1) - Resume execution
- Sometimes used to test error handling code
- Easier to change
time_spent_waiting to 600 seconds in debugger than to pull out the network cable and wait…
- Send comments
Conditional Breakpoints and Watchpoints
- The program only stops at a conditional breakpoint if some condition is met
- E.g., loop index greater than 100, or filename argument is
None - Much more efficient than single-stepping from the start of the program
- Some debuggers also support watchpoints
- Have the debugger watch every write to memory
- Halt when anything, anywhere, modifies a particular variable
- Slow the program down a lot
- Typically a factor of 100 or more
- But sometimes the only practical way to find out when a particular list value is being overwritten
- Send comments
Logging
- Sometimes printing is the right thing to do
- Collecting information for later analysis (e.g., web server logs)
- Not expecting anything to go wrong, but want to be able to trace execution leading up to fault if one does occur
- Many systems use logging to record information in a structured, manageable way
- Separate different levels of information
- Debugging vs. warning vs. critical
- Separate information about different things
- Send information to different destinations
- Files vs. database vs. sys admin's pager
- Send comments
Logging Levels
- Every system is different, but the following are fairly standard
DEBUG: only want to see it when debugging a problem- Be careful not to leave anything in a released product that you don't want customers to be able to turn on
INFO: information about normal operations- The sort of thing that goes into a web server log
WARNING: something that a human being should pay attention to- E.g., failed login attempt, or site not found
ERROR: something has gone wrong inside the softwareCRITICAL: something has gone very wrong inside the software- System is about to crash, reactor is about to melt down, etc.
- Send comments
Logging Example
- Want to store
WARNING-level messages and above in a file- Format the date as year-month-day hour:minutes:seconds
- Automatically display the logging level as well
import logging
logging.basicConfig(level=logging.WARNING,
format='%(asctime)s %(levelname)s %(message)s',
datefmt='%Y-%b-%d %H:%M:%S',
filename='logging_example.out',
filemode='w')
logging.debug('Last file opened: %s', datafile)
logging.info('User %s logged in normally on %s', user_id, machine_name)
logging.warning('%s attempted to log in as %s', villain, user_id)
logging.error('No such spell (spell ID %04d)', spell_id)
logging.critical('Failed to cast %s', curse)
2006-Feb-02 16:19:02 WARNING dmalfoy attempted to log in as hpotter
2006-Feb-02 16:19:02 ERROR No such spell (spell ID 0172)
2006-Feb-02 16:19:02 CRITICAL Failed to cast Confusius
- Only messages at the requested level or higher appear
- And all the normal string formatting operations work
- Send comments
Agans' Rules
- Many people make debugging harder than it needs to be by:
- Using inadequate tools
- Not going about it systematically
- Becoming impatient
- Agans' Rules [Agans 2002] describe how to apply the scientific method to debugging
- Observe a failure
- Invent a hypothesis explaining the cause
- Test the hypothesis by running an experiment (i.e., a test)
- Repeat until the bug has been found
- Send comments
Rule 0: Get It Right the First Time
- The simplest bugs to fix are the ones that don't exist
- Design, reflect, discuss, then code
- “A week of hard work can sometimes save you an hour of thought.”
- Design and build your code with testing and debugging in mind
- Minimize the amount of “spooky action at a distance”
- Minimize the number of things programmers have to keep track of at any one time
- Train yourself to do things right, so that you'll code well even when you're tired, stressed, and facing a deadline
- “Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?” (Brian Kernighan)
- Send comments
Rule 1: What Is It Supposed to Do?
- First step is knowing what the problem is
- “It doesn't work” isn't good enough
- What exactly is going wrong?
- How do you know?
- You will learn a lot by following execution in a debugger and trying to anticipate what the program is going to do next
- Requires you to know how the software is supposed to behave
- Is this case covered by the specification?
- If not:
- Do you have enough knowledge to extrapolate?
- Do you have the right to do so?
- Try not to let what you want to see influence what you actually observe
- Send comments
Rule 2: Is It Plugged In?
- Are you actually exercising the problem that you think you are?
- Are you giving it the right test data?
- Is it configured the way you think it is?
- Is it the version you think it is?
- Has the feature actually been implemented yet?
- Why are you sure?
- Maybe the reason you can't isolate the problem is that it's not there
- Another argument in favor of automatic regression tests
- Guaranteed to rerun the test the same way each time
- Also a good argument against automatic regression tests
- If the test is wrong, it will generate the same misleading result each time
- Send comments
Rule 3: Make It Fail
- You can only debug things when they go wrong
- So find a test case that makes the code fail every time
- Then try to find a simpler one
- Or start with a trivially simple test case that passes, then add complexity until it fails
- Each experiment becomes a test case
- So that you can re-run all of them with a single command
- How else are you going to know that the bug has actually been fixed?
- Use the scientific method
- Formulate a hypothesis, make a prediction, conduct an experiment, repeat
- Remember, it's computer science, not computer flip-a-coin
- Send comments
Alternatives
- What if you can't make it fail reliably?
- Problem involves timing, network load, etc.
- Or you just don't know enough about the cause yet
- Use post-mortem inspection
- But then you have to reason backwards to figure out why the program crashed
- Or logging
- But this can distort the program's behavior
- And you'll have to wade through a lot of irrelevant information
- Send comments
Rule 4: Divide and Conquer
- The smaller the gap between cause and effect, the easier the relationship is to see
- So once you have a test that makes the system fail, use it isolate the faulty subsystem
- Examine the input of the code that's failing
- If that's wrong, look at the preceding code's input, and so on
- Use
assert to check things that ought to be right- “Fail early, fail often”
- A good way to stop yourself from introducing new bugs as you fix old ones
- When you do fix the bug, see whether you can add assertions to prevent it reappearing
- If you made the mistake once, odds are that you, or someone, will make it again
- Another argument against duplicated code
- Few things are as frustrating as fixing a bug, only to have it crop up again elsewhere
- Send comments
Rule 5: Change One Thing at a Time, For a Reason
- Replacing random chunks of code unlikely to do much good
- If you got it wrong the first time, what makes you think you'll get it right the second? Or the ninth?
- So always have a hypothesis before making a change
- Every time you make a change, re-run all of your tests immediately
- The more things you change at once, the harder it is to know what's responsible for what
- And the harder it is to keep track of what you've done, and what effect it had
- Changes can also often uncover (or introduce) new bugs
- Send comments
Rule 6: Write It Down
- Science works because scientists keep records
- “Did left followed by right with an odd number of lines cause the crash? Or was it right followed by left? Or was I using an even number of lines?”
- Records particularly useful when getting help
- People are more likely to listen when you can explain clearly what you did
- Send comments
Rule 7: Be Humble
- If you can't find it in 15 minutes, ask for help
- Just explaining the problem aloud is often enough
- “Never debug standing up.” (Gerald Weinberg)
- Don't keep telling yourself why it should work: if it doesn't, it doesn't
- Never debug while grinding your teeth, either…
- Keep track of your mistakes
- Just as runners keep track of their time for the 100 meter sprint
- “You cannot manage what you cannot measure.” (Bill Hewlett)
- And read [Zeller 2006] to learn more
- Send comments
Summary
- Debugging is not a black art
- Like medical diagnosis, it's a skill that can be studied and improved
- You're going to spend a lot of time doing it: you might as well learn how to do it well
- Send comments
Object-Oriented Programming
Introduction
- Suppose you want to simulate a small ecosystem, such as a tidal pool, that contains many different kinds of things
- Plants (don't move)
- Fish (swim in three dimensions)
- Crawly things (cling to the surface most of the time)
- The procedural way to do it uses a type-switch
for time in simulation_period:
for thing in world:
if type(thing) is plant:
update_plant(thing, time)
elif type(thing) is fish:
update_fish(thing, time)
elif type(thing) is creepy_crawly:
update_creepy_crawly(thing, time)
# marker:main:vdots
- But:
- Every time you add a new type of thing, you have to find and update all the type-switches
- It's very easy to make a cut-and-paste mistake
- Send comments
Objects to the Rescue
- Object-oriented programming (OOP) solves both problems
- Seems like a small change, but it allows programmers to think and design at a higher level
- Also allows them to make more powerful mistakes…
- Take two lectures to introduce OOP
- Ideas apply to all modern languages
- But there's more variation in form than there is with loops, conditionals, and functions
- Send comments
You Can Skip This Lecture If...
- You know what a class is
- You know what methods and member variables are
- You know what encapsulation, inheritance, and polymorphism are
- Send comments
Abstract Data Types
- Modern languages encourage programmers to define abstract data types (ADTs)
- “Abstract” because they hide the details of their implementation
- Programmers interact with them through a limited set of operations, rather than by manipulating data directly
- Fewer things can go wrong
- Easier to read resulting code
- Makes code easier to maintain, since internals can be changed without changing calling code
- Send comments
Classes and Instances
- An ADT is usually created by defining a class that specifies:
- How the ADT stores state (its member variables)
- What it can do (its methods)
- And yes, classes are also objects, just like functions
- Programmers then create objects that are instances of that class
- Each object of a particular ADT shares the class's methods, but has its own members
- So changes to one object do not affect the state of others
![[Memory Model for Classes and Objects]](./img/oop01/classes_and_objects.png)
Figure 14.1: Memory Model for Classes and Objects
- Send comments
Defining a Class
Creating an Instance
- Create a new instance of the class by calling the class name as if it were a function
if __name__ == '__main__':
first = Empty()
second = Empty()
print 'first has id', id(first)
print 'second has id', id(second)
first has id 5086860
second has id 5086892
id returns the object's hash code- Doesn't mean anything: just distinguishes objects
- Note how main body of program is put in a block under
if __name__ == '__main__':- Otherwise, it will be executed when other programs import the class
- Send comments
Methods
- Give the class methods by defining functions inside it
- The object itself is always passed to the method as its first argument
- Universally called
self - Unlike
this in C++ and Java, the name is just a convention - But everyone uses it, and you should too
object.method(argument) is equivalent to:- Find the class
C that object is an instance of - Call
C.method(object, argument)
class Greeting(object):
def say(self, name):
print 'Hello, %s!' % name
if __name__ == '__main__':
greet = Greeting()
greet.say('object')
Hello, object!
- Send comments
Creating Members
- Every object is a new scope for variable names
- Just like a module, or a function call
- The values in an object's scope are its members
- Create members use dotted notation:
self.x = 3 - Gives the current object a new member
x with the value 3 - Or overwrites the existing member
x with the value 3
class Point(object):
def set_values(self, x, y):
self.x = x
self.y = y
def get_values(self):
return (self.x, self.y)
def norm(self):
return math.sqrt(self.x ** 2 + self.y ** 2)
if __name__ == '__main__':
p = Point()
p.set_values(1.2, 3.5)
print 'p is', p.get_values()
print 'norm is', p.norm()
p is (1.2, 3.5)
norm is 3.7
![[Creating a Simple Point]](./img/oop01/simple_point.png)
Figure 14.2: Creating a Simple Point
- Send comments
Encapsulation
- Encapsulation is one of the three defining principles of OOP
- Programs are much easier to write, read, and maintain if object members are only ever accessed by methods
- But unlike C++, Java, and C#, Python doesn't allow programmers to hide methods or data members
p = Point()
p.x = 3.5
p.y = 4.25
print 'point is', p.get_values()
point is (3.5, 4.25)
- Any function or method can see and modify any object's internals
- Resist the temptation to program this way!
- If you manipulate an object's internals directly, you have to change your program when you change the object's implementation
- Send comments
Constructors
- If a class has a method called
__init__, Python will call it when building new instances- Hence the name constructor
- Simpler than creating a blank object, then initializing its members
- And there's less chance the programmer will forget to do the initialization
class Point(object):
def __init__(self, x=0, y=0):
self.reset(x, y)
def reset(self, x, y):
assert (type(x) is int) and (x >= 0), 'x is not non-negative integer'
assert (type(y) is int) and (y >= 0), 'y is not non-negative integer'
self.x = x
self.y = y
def get(self):
return (self.x, self.y)
def norm(self):
return math.sqrt(self.x ** 2 + self.y ** 2)
if __name__ == '__main__':
p = Point(1, 1)
print 'point is initially', p.get()
p.reset(1, 1)
print 'p moved to', p.get()
point is initially (1, 1)
p moved to (1, 1)
- Send comments
Constructor Style
- A class can only have one constructor
- Some languages allow classes to have several, distinguished by argument types
- But since Python doesn't use type declarations, this wouldn't work
- It's good style to create all of the object's members in the constructor
- So that people only have to look in one place to find what members exist
- Note how the class checks values before changing the object's state
- Remember: fail early, fail often
- Send comments
Special Methods
__init__ is just one example of a special method- All have names beginning and ending with double underscore
- Give programmers a way to make their data types look like those built into Python
- Most widely used is
__str__- When Python needs a text representation of an object, it:
- Calls
__str__ if it exists, or - Creates a default representation that shows the object's location in memory
class Point(object):
&vdots;
def __str__(self):
return '(%4.2f, %4.2f)' % (self.x, self.y)
if __name__ == '__main__':
p = Point(3, 4)
print 'point is', p
point is (3, 4)
- Send comments
New Classes from Old
- Suppose we have a class
Organism that represents living things- Common name, scientific name, …
- Want to create a class
Mammal- Body temperature, gestation period, …
- Wrong: copy
Organism's definition and add more members and methods- “Anything repeated in two or more places will eventually be wrong in at least one.”
- Right: use inheritance
- The second defining principle of OOP
- Derive a child class from a parent
- The child has all the members and methods of its parents, plus whatever else we give it
- Send comments
Inheritance Example
class Organism(object):
def __init__(self, common_name, sci_name):
self.common_name = common_name
self.sci_name = sci_name
def get_common_name(self):
return self.common_name
def get_sci_name(self):
return self.sci_name
def __str__(self):
return '%s (%s)' % (self.common_name, self.sci_name)
class Mammal(Organism):
def __init__(self, common_name, sci_name, body_temp, gest_period):
Organism.__init__(self, common_name, sci_name)
self.body_temp = body_temp
self.gest_period = gest_period
def get_body_temp(self):
return self.body_temp
def get_gest_period(self):
return self.gest_period
def __str__(self):
extra = ' %4.2f degrees / %d days' % (self.body_temp, self.gest_period)
return Organism.__str__(self) + extra
if __name__ == '__main__':
creature = Mammal('wolf', 'canis lupus', 38.7, 63)
print creature
wolf (canis lupus) 38.70 degrees / 63 days
![[Memory Model for Inheritance]](./img/oop01/inheritance.png)
Figure 14.3: Memory Model for Inheritance
- Send comments
Overriding Methods
Mammal's constructor calls Organism's to initialize the organism-ish bits of the object- And
Mammal defines its own __str__ method- Overrides the one defined by
Organism Mammal.__str__ calls Organism.__str__ for the same reason that Mammal.__init__ calls Organism.__init__
- Python always calls the most specific method
- Keep the memory model in mind when figuring out what this will be
- Send comments
Polymorphism
- Polymorphism means “having more than one form”
- In object-oriented programming, it means handling specific objects in generic ways
- The third and final defining principle of OOP
- Derive a new class
Bird from Organism- As long as it only uses common methods, a single piece of code can work with both mammals and birds
class Bird(Organism):
def __init__(self, common_name, sci_name, incubate_period):
Organism.__init__(self, common_name, sci_name)
self.incubate_period = incubate_period
def get_incubate_period(self):
return self.incubate_period
def __str__(self):
extra = ' %d days' % self.incubate_period
return Organism.__str__(self) + extra
if __name__ == '__main__':
creatures = [
Bird('loon', 'gavia immer', 27),
Mammal('grizzly bear', 'ursus arctos horribilis', 38.0, 210)
]
for c in creatures:
print c
loon (gavia immer) 27 days
grizzly bear (ursus arctos horribilis) 38.00 degrees / 210 days
- Send comments
Duck Typing
- Most languages only permit polymorphism via inheritance
- Lowest common ancestor of two classes defines how interchangeable they are
- In Python, any two classes that define the same set of methods can be used interchangeably
- Duck typing: “If it walks like a duck, and quacks like a duck, it might as well be a duck.”
class Mineral(object):
def __init__(self, common_name, sci_name, formula):
self.common_name = common_name
self.sci_name = sci_name
self.formula = formula
def get_common_name(self):
return self.common_name
def get_sci_name(self):
return self.sci_name
def __str__(self):
return '%s/%s: %s' % (self.common_name, self.sci_name, self.formula)
if __name__ == '__main__':
things = [
Mammal('arctic hare', 'Lepus arcticus', 40.1, 50),
Mineral("fool's gold", 'iron pyrite', 'FeS2')
]
for t in things:
print t.get_common_name(), 'is', t.get_sci_name()
arctic hare is Lepus arcticus
fool's gold is iron pyrite
- Allows you to create plug-in replacements for files, strings, and other classes after the fact
- But makes it harder to figure out exactly what can be used in place of what
- Send comments
The Liskov Substitution Principle
- The Liskov Substitution Principle states that it must always be possible to use an instance of a child class in place of an instance of its parent
- Means that
Child.meth may ignore some of Parent.meth's pre-conditions, but may not impose more- Equivalently,
Child.meth accepts everything thatParent.meth did, and possibly more - So any code that could call
Parent.meth correctly is guaranteed to call Child.meth correctly too
- And
Child.meth must satisfy all the post-conditions of Parent.meth, and may impose more- So
Child.meth's possible output is a subset of Parent.meth's - And any code that works correctly on the output of
Parent.meth will still work if given an instance of Child instead
- The same constraint applies when a class evolves over time
- Send comments
Tidal Pools Revisited
- How to represent the creatures in a tidal pool?
- Each species is a class
- Use inheritance to separate plants from animals
- Derive both
Plant and Animal from Organism
- How to handle movement?
- Give
Organism two methods: can_move and movePlant.can_move() returns FalsePlant.move() raises an exception
- Give
Organism one method: movePlant.move() does nothing
- Second simplifies code
- Uses polymorphism instead of a conditional
- On the other hand, the existence of
Plant.move implies that plants can do something they can't- Can't really choose between them without knowing what the rest of the code needs
- Usually doesn't make sense to design one class on its own
- Send comments
Class, Responsibility, Collaborator
- Many programmers use CRC cards when designing OO systems
- Stands for “class, responsibility, collaborator”
- Standard 3×5 index cards
- Top is the class name
- Left side is point-form description of what the class can do
- Right side lists other classes that this one interacts with
![[CRC Cards]](./img/oop01/crc.png)
Figure 14.4: CRC Cards
- Designed so that you won't take them too seriously
- Lay them out on a table
- Talk through your program's execution
- Move cards around, scribble new responsibilities and collaborators on them
- Create new cards as needed
- Send comments
Summary
- Classes and objects are just another way to modularize programs
- But used well, they can make programs much simpler, and much more adaptable
- Remember: the goal is to simplify, not to dazzle
- Send comments
More on Objects
Introduction
You Can Skip This Lecture If...
- You know what an overloaded operator is
- You know what a static member is
- You know what a design pattern is
- Send comments
Length
- We've already met some special methods, like
__init__ and __str__- Usually not called directly
- Instead, Python automatically invokes them at specific times (object creation and string creation)
- There are lots of other special methods
- For example, if
obj has a __len__ method, Python calls it whenever it sees len(obj)
class Recent(object):
def __init__(self, number=3):
self.number = number
self.items = []
def __str__(self):
return str(self.items)
def add(self, item):
self.items.append(item)
self.items = self.items[-self.number:]
def __len__(self):
return len(self.items)
if __name__ == '__main__':
history = Recent()
for era in ['Permian', 'Trassic', 'Jurassic', 'Cretaceous', 'Tertiary']:
history.add(era)
print len(history), history
1 ['Permian']
2 ['Permian', 'Trassic']
3 ['Permian', 'Trassic', 'Jurassic']
3 ['Trassic', 'Jurassic', 'Cretaceous']
3 ['Jurassic', 'Cretaceous', 'Tertiary']
- Send comments
Overloading Operators
- The expression
"a + b" is “just” a shorthand for add(a, b) - Or, if
a is an object, for a.add(b) - But since people might actually want to use the name
add, Python spells this method __add__ - If a class defines a method
__add__, it is called whenever something is +'d to the object- I.e.,
x + y calls x.__add__(y)
class Recent(object):
def __add__(self, item):
self.items.append(item)
self.items = self.items[-self.number:]
return self
if __name__ == '__main__':
history = Recent()
for era in ['Permian', 'Trassic', 'Jurassic', 'Cretaceous', 'Tertiary']:
history = history + era
print len(history), history
1 ['Permian']
2 ['Permian', 'Trassic']
3 ['Permian', 'Trassic', 'Jurassic']
3 ['Trassic', 'Jurassic', 'Cretaceous']
3 ['Jurassic', 'Cretaceous', 'Tertiary']
- Send comments
Commutativity
2 + x and x + 2 don't always do the same thing- Classes can define right-hand versions of operators, e.g.,
__radd__ instead of __add__- If the object on the left has an
__add__ method, call that - Otherwise, if the object on the right has an
__radd__ method, call that - Otherwise, try Python's built-ins
- Send comments
Other Special Methods
- (Almost) every aspect of an object's behavior can be overridden by defining the right method(s)
| Method | Purpose |
|---|
__lt__(self, other) | Less than comparison; __le__, __ne__, and others are used for less than or equal, not equal, etc. |
__call__(self, args…) | Called for obj(3, "lithium") |
__len__(self) | Object “length” |
__getitem__(self, key) | Called for obj[3.14] |
__setitem__(self, key, value) | Called for obj[3.14] = 2.17 |
__contains__ | Called for "lithium" in obj |
__add__ | Called for obj + value; use __mul__ for obj * value, etc. |
__int__ | Called for int(obj); use __float__ and others to convert to other types |
|
Table 15.1: Special Methods |
|---|
- Send comments
Example: Sparse Vector
- A vector is sparse if most of its entries are zero
- Use a dictionary to record non-zero values and their indices
- No point padding eleven actual values with nine million zeroes
- Overload operators to make the object look like a “real” vector:
- Addition: create a new vector with a non-zero value wherever either operand had a non-zero value
- Dot product: add up products of matching non-zero values
- Length: return one more than the index of the largest non-zero value
- “One more” to be consistent with Python's lists
- Send comments
How Long is a Sparse Vector?
- What is the length of
v after the following operations? - This isn't really a programming question
- “Largest current index” and “largest index ever seen” can both be implemented
- The latter is easy, so we'll use that
- Send comments
Vector Behavior
- Construction creates an empty sparse vector
- Define
__len__, __getitem__, and __setitem__ to make it behave like a list- Exercise: implement
del sparse[index]
class SparseVector(object):
'''Implement a sparse vector. If a value has not been set
explicitly, its value is zero.'''
def __init__(self):
'''Construct a sparse vector with all zero entries.'''
self.data = {}
def __len__(self):
'''The length of a vector is one more than the largest index.'''
if self.data:
return 1 + max(self.data.keys())
return 0
def __getitem__(self, key):
'''Return an explicit value, or 0.0 if none has been set.'''
if key in self.data:
return self.data[key]
return 0.0
def __setitem__(self, key, value):
'''Assign a new value to a vector entry.'''
if type(key) is not int:
raise KeyError, 'non-integer index to sparse vector'
self.data[key] = value- Send comments
Dot Product
- The other object (on the right side of
"*") is usually called other- No reason to insist that it be a sparse vector
- Could equally well be a list of values
- So loop over our indices, and multiply by corresponding values in other object
- Any index not encountered in this loop doesn't matter, since it corresponds to something that's zero
- And make
__rmul__ = __mul__ do the same thing as __rmul__ def __mul__(self, other):
'''Calculate dot product of a sparse vector with something else.'''
result = 0.0
for k in self.data:
result += self.data[k] * other[k]
return result
def __rmul__(self, other):
return self.__mul__(other)- Send comments
Addition
- Trickier than multiplication: result is non-zero wherever either argument is non-zero
- Don't want to loop over all the zeroes of either argument
- Solution: if the other object is a sparse vector, cheat
- I.e., reach inside it, and rely on details of its implementation
def __add__(self, other):
'''Add something to a sparse vector.'''
# Initialize result with all non-zero values from this vector.
result = SparseVector()
result.data.update(self.data)
# If the other object is also a sparse vector, add non-zero values.
if isinstance(other, SparseVector):
for k in other.data:
result[k] = result[k] + other[k]
# Otherwise, use brute force.
else:
for i in range(len(other)):
result[i] = result[i] + other[i]
return result
# Right-hand add does the same thing as left-hand add.
__radd__ = __add__- Send comments
Testing
- The class isn't written until the tests are finished
- Exercise: replace the
print statements with assertions
if __name__ == '__main__':
x = SparseVector()
x[1] = 1.0
x[3] = 3.0
x[5] = 5.0
print 'len(x)', len(x)
for i in range(len(x)):
print '...', i, x[i]
y = SparseVector()
y[1] = 10.0
y[2] = 20.0
y[3] = 30.0
print 'x + y', x + y
print 'y + x', y + x
print 'x * y', x * y
print 'y * x', y * x
z = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
print 'x + z', x + z
print 'x * z', x * z
print 'z + x', z + x
len(x) 6
... 0 0.0
... 1 1.0
... 2 0.0
... 3 3.0
... 4 0.0
... 5 5.0
x + y [0.0, 11.0, 20.0, 33.0, 0.0, 5.0]
y + x [0.0, 11.0, 20.0, 33.0, 0.0, 5.0]
x * y 100.0
y * x 100.0
x + z [0.0, 1.1, 0.2, 3.3, 0.4, 5.5]
x * z 3.5
z + x [0.0, 1.1, 0.2, 3.3, 0.4, 5.5]
- Send comments
Static Data Members
- Sometimes want to share data between all instances of a class
- Constants, a count of the number of class instances created, etc.
- Any data members defined inside the
class block belong to the class as a whole
class Counter(object):
num = 0 # Number of Counter objects created.
def __init__(self, name):
Counter.num += 1
self.name = name
if __name__ == '__main__':
print 'initial count', Counter.num
first = Counter('first')
print 'after creating first object', Counter.num
second = Counter('second')
print 'after creating second object', Counter.num
initial count 0
after creating first object 1
after creating second object 2
- Send comments
Static Methods
- Can also create static methods
- Just like a function, but put inside the class definition for clarity
- Define the method without the
self parameter- Since it isn't tied to any particular instance of the class
- Put
@staticmethod in front of it- A decorator
- Powerful, but beyond the scope of this course
class Experiment(object):
already_done = {}
@staticmethod
def get_results(name, *params):
if name in Experiment.already_done:
return Experiment.already_done[name]
exp = Experiment(name, *params)
exp.run()
Experiment.already_done[name] = exp
return exp
def __init__(self, name, *params):
self.name = name
self.params = params
def run(self):
# marker:vdots
if __name__ == '__main__':
first = Experiment.get_results('anti-gravity')
second = Experiment.get_results('time travel')
third = Experiment.get_results('anti-gravity')
print 'first ', id(first)
print 'second', id(second)
print 'third ', id(third)
first 5120204
second 5120396
third 5120204
- Send comments
Design Patterns
- Style describe what code should look like line by line
- Design patterns are how we describe larger patterns
- A standard solution to a commonly-occurring problem
- That isn't specific enough to be captured once and for all in a library routine or framework
- Idea developed from the 1960s on by the (building) architect Christopher Alexander
- For example, it's hard to define what a porch is, but the basic idea comes up everywhere the climate is warm
- Introduced to programmers in [Gamma et al 1995]
- Still a bestseller, but not particularly approachable
- Send comments
The Singleton Pattern
- Problem: want to ensure that there's only ever one instance of a particular class
- E.g. the controller for a radio telescope antenna
- Considerations:
- There must be exactly one instance of the class
- All objects that use the class must have access to that instance
- Solution:
- Create objects by calling a function instead of the class's constructor
- Have the function store a reference to the first object it creates
- Have it return that same object on every subsequent call
- Send comments
Singleton Implementation
class AntennaClass(object):
'''Singleton that controls a radio telescope.'''
# The unique instance of the class.
instance = None
# The constructor fails if an instance already exists.
def __init__(self, max_rotation):
assert AntennaClass.instance is None, 'Trying to create a second instance!'
self.max_rotation = max_rotation
AntennaClass.instance = self
# Make the creation function look like a class constructor.
def Antenna(max_rotation):
'''Create and store an AntennaClass instance, or return the one
that has already been created.'''
if AntennaClass.instance:
return AntennaClass.instance
return AntennaClass(max_rotation)- Send comments
Demonstration
first = Antenna(23.5)
print 'first instance:', id(first)
second = Antenna(47.25)
print 'second instance:', id(second)
first instance: 10685200
second instance: 10685200
- Send comments
The Visitor Pattern
- Problem: want an easy way to walk around a complex structure
- E.g. visit each value in a list of lists of lists exactly once
- Considerations:
- Many different operations may need to be performed
- Structure is complex enough that visiting elements is error-prone
- The types of objects in the structure, and the ways they are connected, are fixed
- Solution:
- Create a class that knows how to get to each value in turn
- Give it an empty method that is called once for each value
- Users derive from this class and fill in the method
- An all-in-one version of the framework shown earlier
- Send comments
Visitor Implementation
class NestedListVisitor(object):
'''Visit each element in a list of nested lists.'''
def __init__(self, data):
'''Construct, but do not run.'''
assert type(data) is list, 'Only works on lists!'
self.data = data
def run(self):
'''Iterate over all values.'''
self.recurse(self.data)
def recurse(self, current):
'''Loop over a particular list or sub-list (not meant
to be called by users).'''
if type(current) is list:
for v in current:
self.recurse(v)
else:
self.visit(current)
def visit(self, value):
'''Users should fill this method in.'''
pass- Send comments
Demonstration
class MaxOfN(NestedListVisitor):
def __init__(self, data):
NestedListVisitor.__init__(self, data)
self.max = None
self.count = 0
def visit(self, value):
self.count += 1
if self.max is None:
self.max = value
else:
self.max = max(self.max, value)
test_data = [['gold', 'lead'], 'zinc', [['silver', 'iron'], 'mercury']]
test = MaxOfN(test_data)
test.run()
print 'max:', test.max
print 'count:', test.count
max: zinc
count: 6
- Send comments
The Abstract Factory Pattern
- Problem: application doesn't know the specific types of objects it wants to create until runtime
- If the chromatograph is an RCT-100, create an RCT-100 controller and an RCT-100 configuration panel
- If it's a Subalta 4C, create a Subalta 4C controller and configuration panel
- Considerations:
- Objects can be grouped by category and family
![[]](./img/oop02/factory_type_family.png)
Figure .:
- New categories or families may appear later
- Solution:
- Create a class that knows how to build an instance of each category for a particular family
- Create another class that stores instances of these builder classes, and calls their methods when asked to
- Adding a new family is easy, but adding a new category requires changes to every builder
- Send comments
Abstract Factory Builder
class AbstractFamily(object):
'''Builders for particular families derive from this.'''
def __init__(self, family):
self.family = family
def get_name(self):
return self.name
def make_controller(self):
raise NotImplementedError('make_controller missing')
def make_configuration_panel(self):
raise NotImplementedError('make_configuration_panel missing')- Send comments
Abstract Factory Manager
class FactoryManager(object):
'''Manage builders by family.'''
def __init__(self, current_family=None):
self.builders = {}
self.family = family
def set_family(self, family):
assert family, 'Empty family'
self.family = family
def add(self, builder):
name = builder.get_name()
self.builders[name] = builder
def make_controller(self):
self._check_state()
return self.builders[self.family].make_controller()
def make_configuration_panel(self):
self._check_state()
return self.builders[self.family].make_configuration_panel()
def _check_state(self):
assert self.family, 'No family specified'
assert self.family in self.builders, 'Unknown family:', self.family- Send comments
Demonstration
The Command Pattern
- Problem: want to be able to control the operation of a complex object
- Turn on the robot arm, move it, lower it, move it again, etc.
- Considerations:
- Do not want to have to write an entirely new program for each sequence of operations
- Want to be able to add new operations
- Would like to be able to undo operations
- Solution:
- Create one class for each distinct operation
- Give the class
do, undo, and redo methods
- Create instances of these classes to represent particular commands
- Create lists of these instances to control the robot arm
- Send comments
Base Command Class
class AbstractCommand(object):
'''Base class for commands.'''
def is_undoable(self):
return False # by default, can't undo/redo operations
def do(self, robot):
raise NotImplementedError("Don't know how to do %s" % self.name)
def undo(self, robot):
pass
def redo(self, robot):
pass- Send comments
A Particular Command
class MoveCommand(AbstractCommand):
'''Move the robot arm.'''
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
def is_undoable(self):
return True
def do(self, robot):
robot.translate(self.x, self.y, self.z)
def undo(self, robot):
robot.translate(-self.x, -self.y, -self.z)
def redo(self, robot):
self.do(robot)- Send comments
Demonstration
robot = Robot()
commands = [MoveCommand(5.0, 2.0, 2.3),
RotateCommand(-90.0, 0.0, 0.0),
MoveCommand(1.0, 2.0, 2.0),
CloseHandCommand()]
for c in commands:
c.do(robot)- Send comments
A Few Others
- Cache: store temporary copies of objects locally to improve performance
- State: record state of program as object so that it can be re-started
- Null Object: use an object that does nothing in place of null
- Saves testing that object isn't null before doing operations
- Adapter: wrap one object in another to give the first a different interface
- Usually used to give a new library an interface that's compatible with an old one
- Proxy: use one object as an interface to another
- Typically, the proxy is local, and the real object is on another machine
- Send comments
Summary
- Overloading, design patterns, and other advanced concepts serve two purposes:
- Communication: a concise way for designers to communicate with each other
- Education: gives them a way to communicate what they know to newcomers
- Don't expect to connect them all to your own experience the first time
- But keep them in mind as you look at new problems
- Send comments
Unit Testing
Introduction
- Unit testing follows a pattern
- Setup and teardown
- Lots of small, independent tests
- Reporting
- Combine tests into test suites, and test suites into larger suites
- See a pattern, build a framework
- Write shared code once
- Encourage people to work a certain way
- I.e., make it easy for them to do things right
- Send comments
JUnit and Its Children
JUnit is a testing framework originally written by Kent Beck and Erich Gamma in 1997- Made testing easy enough that programmers actually started doing it
- Now integrated into almost all Java IDEs
- Widely imitated:
- Workalikes are available C++, Perl, .NET, etc.
- Once you know one, you can easily learn and use the others
- Add-ons for measuring test execution times, recording tests, testing web applications, etc.
- This lecture introduces Python's version, called
unittest - Send comments
You Can Skip This Lecture If...
- You know what a test suite is
- You know what setup and teardown are
- You know how to test for exceptions
- You know how to test I/O
- You know what stubs and mock objects are
- Send comments
The Big Idea
- Define one method for each test
- Method name must begin with “test”
- Method must not take any parameters (other than
self) - Shouldn't return anything
- Group related tests together in classes
- Which must be derived from
unittest.TestCase
- Call
unittest.main(), which:- Searches the module (i.e., the file) to find all classes derived from
unittest.TestCase - Runs methods whose names begin with “test” in an arbitrary order
- Another reason not to make tests dependent on each other
- Counts and reports the passes, fails, and errors
- Send comments
Checking
- Actually check things inside test methods using methods provided by
TestCase- Allows the framework to distinguish between test assertions, and normal
assert statements- Since the code being tested might use the latter
- Checking methods include:
assert_(condition): check that something is true (note the underscore)assertEqual(a, b): check that two things are equalassertNotEqual(a, b): the reverse of the aboveassertRaises(exception, func, …args…): call func with arguments (if provided), and check that it raises the right exceptionfail(): signal an unconditional failure
- Send comments
Example: Checking Addition
import unittest
class TestAddition(unittest.TestCase):
def test_zeroes(self):
self.assertEqual(0 + 0, 0)
self.assertEqual(5 + 0, 5)
self.assertEqual(0 + 13.2, 13.2)
def test_positive(self):
self.assertEqual(123 + 456, 579)
self.assertEqual(1.2e20 + 3.4e20, 3.5e20)
def test_mixed(self):
self.assertEqual(-19 + 20, 1)
self.assertEqual(999 + -1, 998)
self.assertEqual(-300.1 + -400.2, -700.3)
if __name__ == '__main__':
unittest.main()
.F.
======================================================================
FAIL: test_positive (__main__.TestAddition)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_addition.py", line 12, in test_positive
self.assertEqual(1.2e20 + 3.4e20, 3.5e20)
AssertionError: 4.6e+20 != 3.5e+20
----------------------------------------------------------------------
Ran 3 tests in 0.000s
FAILED (failures=1)
- The typing mistake is easily fixed
- Send comments
Running Sums
- You want to test a function that calculates a running sum of the values in the list
- Given
[a, b, c, …], it produces [a, a+b, a+b+c, …]
- Test cases:
- Empty list
- Single value
- Long list with mix of positive and negative values
- Hm…is it supposed to:
- Return a new list?
- Modify its argument in place and return that?
- Modify its argument and return
None?
- Your tests can only ever be as good as (your understanding of) the spec
- Assume for now that it's supposed to return a new list
- Send comments
Flawed Implementation
- First implementation
def running_sum(seq):
result = seq[0:1]
for i in range(2, len(seq)):
result.append(result[i-1] + seq[i])
return result
class SumTests(unittest.TestCase):
def test_empty(self):
self.assertEqual(running_sum([]), [])
def test_single(self):
self.assertEqual(running_sum([3]), [3])
def test_double(self):
self.assertEqual(running_sum([2, 9]), [2, 11])
def test_long(self):
self.assertEqual(running_sum([-3, 0, 3, -2, 5]), [-3, -3, 0, -2, 3])
F.E.
======================================================================
ERROR: test_long (__main__.SumTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "running_sum_wrong.py", line 22, in test_long
self.assertEqual(running_sum([-3, 0, 3, -2, 5]), [-3, -3, 0, -2, 3])
File "running_sum_wrong.py", line 7, in running_sum
result.append(result[i-1] + seq[i])
IndexError: list index out of range
======================================================================
FAIL: test_double (__main__.SumTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "running_sum_wrong.py", line 19, in test_double
self.assertEqual(running_sum([2, 9]), [2, 11])
AssertionError: [2] != [2, 11]
----------------------------------------------------------------------
Ran 4 tests in 0.001s
FAILED (failures=1, errors=1)
- One failure, one error
- Use this information to guide your diagnosis of the problem
- Send comments
Check and Re-check
- Fix the function and rerun the tests
def running_sum(seq):
result = seq[0:1]
for i in range(1, len(seq)):
result.append(result[i-1] + seq[i])
return result
....
----------------------------------------------------------------------
Ran 4 tests in 0.000s
OK
- Most first attempts to fix bugs are wrong, or introduce new bugs [McConnell 2004]
- Continuous testing catches these mistakes while they're still fresh
- Send comments
Is This Cost-Effective?
- Should you really go to this much effort to test a simple function?
- Took less than a minute to write the four tests
- Uncovered one gap in the requirements, and one error in the first implementation
- Able to verify the fix almost instantly
- Sounds pretty good to me…
- Did you notice that we aren't checking that the input list isn't modified?
- Send comments
Eliminating Redundancy
- Setting up a fixture can often be more work than writing the test
- The more complex the data structures, the less often you want to have to type them in
- If the test class defines a
setUp method, unittest calls it before running each test- And if there's a
tearDown method, it is run after each test
- Example: test a method that removes atoms from molecules
class TestThiamine(unittest.TestCase):
def setUp(self):
self.fixture = Molecule(C=12, H=20, O=1, N=4, S=1)
def test_erase_nothing(self):
nothing = Molecule()
self.fixture.erase(nothing)
self.assertEqual(self.fixture['C'], 12)
self.assertEqual(self.fixture['H'], 20)
self.assertEqual(self.fixture['O'], 1)
self.assertEqual(self.fixture['N'], 4)
self.assertEqual(self.fixture['S'], 1)
def test_erase_single(self):
self.fixture.erase(Molecule(H=1))
self.assertEqual(self.fixture, Molecule(C=12, H=19, O=1, N=4, S=1))
def test_erase_self(self):
self.fixture.erase(self.fixture)
self.assertEqual(self.fixture, Molecule())
.E.
======================================================================
ERROR: test_erase_self (__main__.TestThiamine)
----------------------------------------------------------------------
Traceback (most recent call last):
File "setup.py", line 49, in test_erase_self
self.fixture.erase(self.fixture)
File "setup.py", line 21, in erase
for k in other.atoms:
RuntimeError: dictionary changed size during iteration
----------------------------------------------------------------------
Ran 3 tests in 0.000s
FAILED (errors=1)
- Removing an atom from itself doesn't work
- Send comments
Testing Exceptions
- Testing that code fails in the right way is just as important as testing that it does the right thing
- Otherwise, someone will do something wrong some day, and the code won't report it
- In Python, use
TestCase.assertRaises to check that a specific function raises a specific exception - In most languages, have to use
try/except yourself- Run the test
- If execution goes on past it, it didn't raise an exception at all (failiure)
- If the right exception is caught, the test passed
- If any other exception is caught, the test failed
- Send comments
Manual Exception Testing Example
- Example: manually test error handling in a function that finds all values in a double-ended range
- Raises
ValueError if the range is empty, or if the set of values is empty
class TestInRange(unittest.TestCase):
def test_no_values(self):
try:
in_range([], 0.0, 1.0)
except ValueError:
pass
else:
self.fail()
def test_bad_range(self):
try:
in_range([0.0], 4.0, -2.0)
except ValueError:
pass
else:
self.fail()- Send comments
Testing I/O
- Input and output often seem hard to test
- Store a bunch of input files in a subdirectory?
- Create temporary files when tests are run?
- The best answer is to use I/O using strings
- Python's
StringIO and cStringIO modules can read and write strings instead of files - Similar packages exist for C++, Java, and other languages
- This only works if the function being tested takes streams as arguments, rather than filenames
- If the function opens and closes the file, no way for you to substitute a fake file
- You have to design code to make it testable
- Send comments
I/O Testing Example
- Example: find lines where two files differ
- Input: two streams (which might be open files or
StringIO wrappers around strings) - Output: another stream (i.e., a file, or a
StringIO)
class TestDiff(unittest.TestCase):
def wrap_and_run(self, left, right, expected):
left = StringIO(left)
right = StringIO(right)
actual = StringIO()
diff(left, right, actual)
self.assertEqual(actual.getvalue(), expected)
def test_empty(self):
self.wrap_and_run('', '', '')
def test_lengthy_match(self):
str = '''\
a
b
c
'''
self.wrap_and_run(str, str, '')
def test_single_line_mismatch(self):
self.wrap_and_run('a\n', 'b\n', '1\n')
def test_middle_mismatch(self):
self.wrap_and_run('a\nb\nc\n', 'a\nx\nc\n', '2\n')- As a side effect, we've made the function itself more useful
- People can now use it to compare strings to strings, or strings to files
- Send comments
Stubs and Mock Objects
- A stub is a placeholder for a function or method you haven't written yet
- Always returns the same value (or a random one)
- Created so that you don't have to wait until your whole program is written before running and testing it
- Eventually replaced with real code
- Mock objects are more sophisticated
- Has the same interface as the object whose place it takes
- But return values of methods are hard-coded
- E.g., use a dictionary of possible argument values to look up the correct response, instead of consulting a database
- Used to isolate components during testing
- Use a real instance of the object under suspicion, and mock replacements for everything else
- Not thrown away once the program is working
- Send comments
Test Performance
- Making tests run fast is another reason to use stubs, mock objects, and other tricks
- Reinitializing a database on disk can take 1-2 seconds
- So 500 tests take 10 minutes to run
- Makes it impractical for developers can't re-run the tests after every small code change
- “Test performance” can also mean “test how fast the target code is”
- Record how long it takes to run the test suite
- Sudden increases or decreases may signal bugs
- Even if they don't, you probably want to know that your code is four times slower than it used to be before you ship it
- Send comments
Choosing Test Cases
- Human beings are creatures of habit
- Tend to make the same kinds of errors over and over again
- So test for those first
- Once you start testing for habitual errors, you become more conscious of them, and make them less often
- A catalog of errors
- Numbers: zero, largest, smallest magnitude, most negative
- Structures: empty, exactly one element, maximum number of elements
- Duplicate elements (e.g., the letter
"J" appears three times in a string) - Aliased elements (e.g., a list contains two references to another list)
- Circular structures (e.g., a list that contains a reference to itself)
- Searching: no match found, one match found, multiple matches found, everything matches
- Code like
x = find_all(structure)[0] is almost always wrong - Should also check aliased matches (same thing found multiple times)
- Send comments
Example: Rectangle Overlap
- Want to test a function that calculates the overlap between two rectangles
- Send comments
Solution
- Assume for the moment that
Rect is correct- I.e., that it has been tested elsewhere
- Each fixture will be a pair of rectangles
- The test will be to pass them to
overlap, and see if the output is correct
- In this example, “boundary case” and “corner case” can be taken literally
![[Rectangle Overlap Test Cases]](./img/unit/rectangle_overlap.png)
Figure 16.1: Rectangle Overlap Test Cases
- Send comments
What Tests To Write First
- Tests you expect to succeed
- Boundary cases (e.g., sort the empty list, or a list of one value)
- Simplest interesting case (e.g., sort a list of two values)
- General case (e.g., sort a list of nine values)
- If duplicate values are allowed, make sure you test with them
- Tests you expect to fail
- Invalid input (e.g., passed a dictionary instead of a list)
- Remember, error handling is part of the interface too
- Sanity tests
- Make sure data structures remain consistent
- If there is redundant information, check it against itself
- Send comments
Summary
- A good framework does more than just cut down on typing
- Guides you toward solutions that other developers have already discovered
- The better you are at testing (and using testing frameworks), the more productive you will be
- Send comments
Exercises
Exercise 16.1:
Python has another unit testing module called doctest.
It searches files for sections of text that look like interactive
Python sessions, then re-executes those sections and checks the
results. A typical use is shown below.
def ave(values):
'''Calculate an average value, or 0.0 if 'values' is empty.
>>> ave([])
0.0
>>> ave([3])
3.0
>>> ave([15, -1.0])
7.0
'''
sum = 0.0
for v in values:
sum += v
return sum / float(max(1, len(values)))
if __name__ == '__main__':
import doctest
doctest.testmod()
Convert a handful of the tests you have written for other
questions in this lecture to use doctest. Do you prefer it
to unittest? Why or why not? Do you think doctest
makes it easier to test small problems? Large ones? Would it be
possible to write something similar for C, Java, Fortran, or
Mathematica?
Send comments
Regular Expressions
Introduction
- How to count the blank lines in a file?
- Most people consider a line with just spaces and tabs to be blank
- But examining characters one by one is tedious
- More complex patterns (like telephone numbers or email addresses) are hard to describe in code
- Use regular expressions (REs) instead
- Represent patterns as strings
- Just like the
"*" in the shell's *.txt
- Warning: the notation is ugly
- Have to use what's on the keyboard, instead of inventing new symbols the way mathematicians do
- Send comments
You Can Skip This Lecture If...
- You know what a regular expression is
- You understand the difference between
«*» and «+» - You know how and why to compile an RE
- You know how to find out which part of a string matched which part of an RE
- You know how to get all of an RE's matches with one method call
- Send comments
A Simple Example
- The simplest kind of RE matches a fixed string of characters
- Similar to the
in operator
import re
dragons = [
['CTAGGTGTACTGATG', 'Antipodean Opaleye'],
['AAGATGCGTCCGTAT', 'Common Welsh Green'],
['AGTCGTGCTCGTTATATC', 'Hebridean Black'],
['ATGCGTCGTCGATTATCT', 'Hungarian Horntail'],
['CCGTTAGGGCTAAATGCT', 'Norwegian Ridgeback']
]
for (dna, name) in dragons:
if re.search('ATGCGT', dna):
print name
Common Welsh Green
Hungarian Horntail
- Send comments
This or That
- Modify the regular expression a little
import re
dragons = [
['CTAGGTGTACTGATG', 'Antipodean Opaleye'],
['AAGATGCGTCCGTAT', 'Common Welsh Green'],
['AGTCGTGCTCGTTATATC', 'Hebridean Black'],
['ATGCGTCGTCGATTATCT', 'Hungarian Horntail'],
['CCGTTAGGGCTAAATGCT', 'Norwegian Ridgeback']
]
for (dna, name) in dragons:
if re.search('ATGCGT|GCT', dna):
print name
Common Welsh Green
Hebridean Black
Hungarian Horntail
Norwegian Ridgeback
- The vertical bar
«|» means “or”- So this RE matches any string containing either
"ATGCGT" or "GCT"
- Send comments
Precedence
- What about matching either
"ATA" or "ATC" (both of which code for isoleucine)?«ATA|C» will not work: it matches either "ATA" or "C"«ATA|ATC» will work, but it's a bit redundant
- Solution: use parentheses, just as in math
import re
tests = [
['ATA', True],
['xATCx', True],
['ATG', False],
['AT', False],
['ATAC', True]
]
for (dna, expected) in tests:
actual = re.search('AT(A|C)', dna) is not None
assert actual == expected
- Note that there's no output: the
asserts will crash the program if any of the tests fail
- Send comments
Escaping Special Characters
- How to match an actual
"|", "(", or ")"? - Solution is to use
«\|», «\(», or «\)» in the RE- And of course
«\\» to match a backslash
- But in order to put a backslash in a Python string, you have to escape it
- So the written form of the RE is
"\\|", "\\(", "\\)", or "\\\\"
- What you type in is being compiled twice:
- Once by Python to create a string
- Once by the regular expression library to create the RE
![[Double Compilation of Regular Expressions]](./img/re/double_compilation.png)
Figure 17.1: Double Compilation of Regular Expressions
- Send comments
Raw Strings
- To help keep things readable, Python supports raw strings
- Written as
r'abc' or r"this\nand\nthat" - Inside a raw string, a backslash is just a backslash
- So
r'\n' is a string containing the two characters "\" and "n", not a newline
- Raw strings are not automatically converted into REs
- But that is their most common use
- Send comments
Sequences
- In the shell,
"*" matches zero or more characters - In an RE,
«*» is an operator that means, “match zero or more occurrences of a pattern”- Comes after the pattern, not before
- Example: match any strand of DNA in which
"TTA" and "CTA" are separated by any number of "G"tests = [
['TTACTA', True], # separated by zero G's
['TTAGCTA', True], # separated by one G
['TTAGGGCTA', True], # separated by three G's
['TTAXCTA', False], # an X in the way
['TTAGCGCTA', False], # an embedded X in the way
]
for (dna, expected) in tests:
actual = re.search('TTAG*CTA', dna) is not None
assert actual == expected- Note that the RE matches
"TTACTA" because «G*» can match zero occurrences of "G" ![[Zero or More]](./img/re/star_match.png)
Figure 17.2: Zero or More
«+» matches one or more (i.e., won't match the empty string)- Send comments
Making Something Optional
Character Sets
- Use
«[]» to match sets of characters- The expression
«[abcd]» matches exactly one "a", "b", "c", or "d" - Can be abbreviated as
«[a-d]»
- Often combined with
«*», «+», or «?»«[aeiou]+» matches any non-empty sequence of vowels
- Example: find lines containing numbers
import re
lines = [
"Charles Darwin (1809-82)",
"Darwin's principal works, The Origin of Species (1859)",
"and The Descent of Man (1871) marked a new epoch in our",
"understanding of our world and ourselves. His ideas",
"were shaped by the Beagle's voyage around the world in",
"1831-36."
]
for line in lines:
if re.search('[0-9]+', line):
print line
Charles Darwin (1809-82)
Darwin's principal works, The Origin of Species (1859)
and The Descent of Man (1871) marked a new epoch in our
1831-36.
- Try writing this without using regular expressions…
- Send comments
Abbreviations
- Some character sets occur so often that they have abbreviations
| Sequence | Equivalent | Explanation |
|---|
«\d» | «[0-9]» | Digits |
«\s» | «[ \t\r\n]» | Whitespace |
«\w» | «[a-zA-Z0-9_]» | Word characters (i.e., those allowed in variable names) |
|
Table 17.1: Regular Expression Escapes in Python |
|---|
- Send comments
Special Cases
«[^abc]» means “anything except the characters in this set”«.» means “any character except the end of line”«\b» matchs the break between word and non-word characters- Doesn't consume any actual characters
![[Word/Non-Word Breaks]](./img/re/word_nonword_break.png)
Figure 17.5: Word/Non-Word Breaks
- Example: find words that end in a vowel
- Use
string.split to break on spaces and newlines before applying RE
import re
words = '''Born in New York City in 1918, Richard Feynman earned a
bachelor's degree at MIT in 1939, and a doctorate from Princeton in
1942. After working on the Manhattan Project in Los Alamos during
World War II, he became a professor at CalTech in 1951. Feynman won
the 1965 Nobel Prize in Physics for his work on quantum
electrodynamics, and served on the commission investigating the
Challenger disaster in 1986.'''.split()
end_in_vowel = set()
for w in words:
if re.search(r'[aeiou]\b', w):
end_in_vowel.add(w)
for w in end_in_vowel:
print w
a
Prize
degree
became
doctorate
the
he
- Send comments
Anchoring
- How to find blank lines?
re.search(r'\s*', line) will match "start end"
- Use anchors
«^» matches the beginning of the string«$» matches the end- Neither consumes any characters
![[Anchoring Matches]](./img/re/match_anchor.png)
Figure 17.6: Anchoring Matches
- Examples:
| Pattern | Text | Result |
|---|
«b+» | "abbc" | Matches |
«^b+» | "abbc" | Fails (string doesn't start with b) |
«c$» | "abbc" | Matches (string ends with c) |
«^a*$» | aabaa | Fails (something other than "a" between start and end of string) |
|
Table 17.2: Regular Expression Anchors in Python |
|---|
- Send comments
Extracting Matches
- Problem: want to find comments in a data file
- A comment starts with a
"#", and extends to the end of the line
- First try: If the RE matches, split on the
"#"
import sys, re
lines = '''Date: 2006-03-07
On duty: HP # 01:30 - 03:00
Observed: Common Welsh Green
On duty: RW #03:00-04:30
Observed: none
On duty: HG # 04:30-06:00
Observed: Hebridean Black
'''.split('\n')
for line in lines:
if re.search('#', line):
comment = line.split('#')[1]
print comment
01:30 - 03:00
03:00-04:30
04:30-06:00
- Output is inconsistent
split followed by strip seems clumsy
- Send comments
Match Objects
- Result of
re.search is actually a match object that records what what matched, and wheremo.group() returns the whole string that matched the REmo.start() and mo.end() are the indices of the match's location
import re
text = 'abbcb'
for pattern in ['b+', 'bc*', 'b+c+']:
match = re.search(pattern, text)
print '%s / %s => "%s" (%d, %d)' % \
(pattern, text, match.group(), match.start(), match.end())
b+ / abbcb => "bb" (1, 3)
bc* / abbcb => "b" (1, 2)
b+c+ / abbcb => "bbc" (1, 4)
- Send comments
Match Groups
- Every parenthesized subexpression in the RE is a group
- Group 0 is the entire match
- Text that matched Nth parentheses (counting from left) is group N
mo.group(3) is the text that matched the third subexpression, m.start(3) is where it started
- Extracting comments is now easy:
import sys, re
lines = '''Date: 2006-03-07
On duty: HP # 01:30 - 03:00
Observed: Common Welsh Green
On duty: RW #03:00-04:30
Observed: none
On duty: HG # 04:30-06:00
Observed: Hebridean Black
'''.split('\n')
for line in lines:
match = re.search(r'#\s*(.+)', line)
if match:
comment = match.group(1)
print comment
01:30 - 03:00
03:00-04:30
04:30-06:00
- Send comments
Reversing Columns
- REs are the power tools of text processing
- Can do things in one line that would otherwise take many lines of code
- Example: reverse two-column data
import re
def reverse_columns(line):
match = re.search(r'^\s*(\d+)\s+(\d+)\s*$', line)
if not match:
return line
return match.group(2) + ' ' + match.group(1)
tests = [
['10 20', 'easy case'],
[' 30 40 ', 'padding'],
['60 70 80', 'too many columns'],
['90 end', 'non-numeric']
]
for (fixture, title) in tests:
actual = reverse_columns(fixture)
print '%s: "%s" => "%s"' % (title, fixture, actual)
easy case: "10 20" => "20 10"
padding: " 30 40 " => "40 30"
too many columns: "60 70 80" => "60 70 80"
non-numeric: "90 end" => "90 end"
- Send comments
Compiling
- The RE library compiles patterns into a more concise form for matching
- Each regular expression becomes a finite state machine
- Library follows the arcs in the FSM as it reads characters
- Drawing FSMs is a good way to debug REs
![[Regular Expressions as Finite State Machines]](./img/re/re_fsm.png)
Figure 17.7: Regular Expressions as Finite State Machines
- You can improve a program's performance by compiling the RE once, and re-using the compiled form
- Use
re.compile(pattern) to get the compiled RE - Its methods have the same names and behavior as the functions in the
re module - E.g.,
matcher.search(text) searches text for matches to the RE that was compiled to create matcher
- Send comments
Finding Title Case Words
- Example: find all Title Case words in a document
import re
# Put pattern outside 'find_all' so that it's only compiled once.
pattern = re.compile(r'\b([A-Z][a-z]*)\b(.*)')
def find_all(line):
result = []
match = pattern.search(line)
while match:
result.append(match.group(1))
match = pattern.search(match.group(2))
return result
lines = [
'This has several Title Case words',
'on Each Line (Some in parentheses).'
]
for line in lines:
print line
for word in find_all(line):
print '\t', word
This has several Title Case words
This
Title
Case
on Each Line (Some in parentheses).
Each
Line
Some
- Send comments
Finding All Matches
- Notice how the function gets all matches:
- Pattern captures what we want in group 1, and everything else on the line in group 2
- Each time there's a match, continue the search in the remainder captured in group 2
- Much easier to use the
findall method
import re
lines = [
'This has several Title Case words',
'on Each Line (Some in parentheses).'
]
pattern = re.compile(r'\b([A-Z][a-z]*)\b')
for line in lines:
print line
for word in pattern.findall(line):
print '\t', word
This has several Title Case words
This
Title
Case
on Each Line (Some in parentheses).
Each
Line
Some
- Send comments
Reference Material
| Pattern | Matches | Doesn't Match | Explanation |
|---|
«a*» | "", "a", "aa", … | "A", "b" | «*» means “zero or more” matching is case sensitive |
«b+» | "b", "bb", … | "" | «+» means “one or more” |
«ab?c» | "ac", "abc" | "a", "abbc" | «?» means “optional” (zero or one) |
«[abc]» | "a", "b", or "c" | "ab", "d" | «[…]» means “one character from a set” |
«[a-c]» | "a", "b", or "c" | Character ranges can be abbreviated |
«[abc]*» | "", "ac", "baabcab", … | Operators can be combined: zero or more choices from "a", "b", or "c" |
|
Table 17.3: Regular Expression Operators |
|---|
| Method | Purpose | Example | Result |
|---|
split | Split a string on a pattern. | re.split('\\s*,\\s*', 'a, b ,c , d') | ['a', 'b', 'c', 'd'] |
findall | Find all matches for a pattern. | re.findall('\\b[A-Z][a-z]*', 'Some words in Title Case.') | ['Some', 'Title', 'Case'] |
sub | Replace matches with new text. | re.sub('\\d+', 'NUM', 'If 123 is 456') | "If NUM is NUM" |
|
Table 17.4: Regular Expression Object Methods |
|---|
- Send comments
But Wait, There's More
- We've only scratched the surface
- Regular expressions have proved to be too useful to remain clean and elegant
- For example, use
«pat{N}» to match exactly N occurrences of a pattern- More generally,
«pat{M,N}» matches between M and N occurrences
- Most important thing is to build up complex REs one step at a time
- Write something that matches part of what you're looking for
- Test it
- Add to it
- Send comments
Summary
- Regular expressions are available in almost every language
- As a library: C/C++, Java, …
- Built into the language: Perl, Ruby, …
- Syntax varies slightly, but the ideas are the same
- For a broader tutorial, see [Wilson 2005]
- Send comments
Exercises
Exercise 17.1:
By default, regular expression matches are
greedy: the first term in the RE
matches as much as it can, then the second part, and so on. As a
result, if you apply the RE «X(.*)X(.*)» to the string
"XaX and XbX", the first group will contain "aX and Xb",
and the second group will be empty.
It's also possible to make REs match
reluctantly, i.e., to have the
parts match as little as possible, rather than as much. Find out
how to do this, and then modify the RE in the previous paragraph
so that the first group winds up containing "a", and the
second group " and XbX".
Exercise 17.2:
What the easiest way to write a case-insensitive regular expression?
(Hint: read the documentation on compilation options.)
Exercise 17.3:
What does the VERBOSE option do when compiling a regular
expression? Use it to rewrite some of the REs in this lecture in
a more readable way.
Exercise 17.4:
What does the DOTALL option do when compiling a regular
expression? Use it to get rid of the call to
string.split in the example that finds words ending in
vowels.
Send comments
Binary Data
Introduction
- All data is stored as 1's and 0's
- But those 1's and 0's may represent:
- Characters that can be displayed as text
- Something else
- That “something else” is (misleadingly) called binary data
- Usually means “anything you can't manipulate with a standard text editor”
- This lecture describes how binary values are stored and manipulated
- Please, don't write code to manipulate binary formats unless you absolutely have to
- Good libraries exist for working with every image, sound, and video format out there
- Send comments
You Can Skip This Lecture If...
- You know what two's complement is
- You know what bit shifting is
- You know that roundoff errors are not random
- You know how to pack and unpack binary values
- Send comments
Why Binary?
- Size:
"10239472" is 8 bytes long, but the 32-bit integer it represents is 4 bytes - Speed: takes dozens of operations to add the integer represented by
"34" to the one represented by "57" - Hardware interfaces: someone has to convert the electrical signal from the gas chromatograph to a readable number
- Lack of anything better
- It's possible to represent images as text (ASCII art, PostScript)
- But sound? Or movies?
- Send comments
How Numbers Are Stored
- Positive numbers stored in base-2 format
- 10012 is (1×23)+(0×22)+(0×21)+(1×20) = 9
- Could use sign-and-value for negative numbers
- First bit is 0 for positive, 1 for negative
- 00112 is 310, and 10112 is -310
- Problem: there are two zeroes (0000 and 1000)
- Send comments
Two's Complement
- Almost all computers use two's complement instead
- “Roll over” when going below zero, like a car's odometer
- 11112 is -110, 11102 is -210, etc.
- 10002 is the most negative 4-bit integer, 01112 the most positive
![[Two's Complement]](./img/binary/twos_complement.png)
Figure 18.1: Two's Complement
- Asymmetric: there is one more negative number than positive
- Since there has to be room for 0 in the middle
- Can still tell whether a number is positive or negative by looking at the first bit
- Send comments
Bitwise Operators
- Like most languages, Python has four operators that work on bits
| Name | Symbol | Purpose | Example |
|---|
| And | & | 1 if both bits are 1, 0 otherwise | 0110 & 1010 = 0010 |
| Or | | | 1 if either bit is 1 | 0110 & 1010 = 1110 |
| Xor | ^ | 1 if the bits are different, 0 if they're the same | 0110 & 1010 = 1100 |
| Not | ~ | Flip each bit | ~0110 = 1001 |
|
Table 18.1: Bitwise Operators in Python |
|---|
- The name “xor” is short for “exclusive or”, i.e., either/or
- Use these to write a function that displays the bits in an integer
def format_bits(val, width=1):
'''Create a base-2 representation of an integer.'''
result = ''
while val:
if val & 0x01:
result = '1' + result
else:
result = '0' + result
val = val >> 1
if len(result) < width:
result = '0' * (width - len(result)) + result
return result
tests = [
[ 0, None, '0'],
[ 0, 4, '0000'],
[ 5, None, '101'],
[19, 8, '00010011']
]
for (num, width, expected) in tests:
if width is None:
actual = format_bits(num)
else:
actual = format_bits(num, width)
print '%4d %8s %10s %10s' % (num, width, expected, actual)
0 None 0 0
0 4 0000 0000
5 None 101 101
19 8 00010011 00010011
- Send comments
Shifting
- Shifting an integer's bits left N places written as
x << N- Each leftward shift corresponds to multiplying by 2
- Just as shifting a decimal number left corresponds to multiplying by 10
- Example: 3<<2 is 00112<<2, or 11002, which is 12
- Shifting a number right corresponds to division by 2 (throwing away the remainder)
- 710>>1 is 01112>>1, or 00112, which is 310
- Send comments
Cautions
- Shifting is not more efficient than multiplication and division on modern computers
- What happens if the top bit changes value as a result of a shift?
- 610<<1 = 01102<<1 = 11002
- On a 4-bit machine, this is -410, not 1210
- Some machines preserve the sign bit when shifting down
- So 11002>>1 = 11102, instead of 01102
- Depends on the hardware being used
- Java provides a separate operator for this
- Send comments
Setting and Clearing Bits
- Can use bitwise
and, or, and not to set specific bits to 1 or 0- Do the same things to bit that logical operations do to Booleans
- To set the ith of
x to 1:- Create a value
mask in which bit i is 1 and all others are 0 - Use
x = x | mask
- To set the ith of
x to 0:- Create a value
mask in which bit i is 1 and all others are 0 - Negate it using
~, so that the ith bit is 0, and all the others are 1 - Use
x = x & mask
![[Setting and Clearing Bits]](./img/binary/setting_clearing_bits.png)
Figure 18.2: Setting and Clearing Bits
- Send comments
Bit Flags
- Can use bitwise operators to store several Boolean flags in a single integer
- Slower than storing each in a separate variable
- But uses much less space
- Example: need to record whether a sample contains any mercury, phosphorus, or chlorine
- Define constants to test for particular elements
- Use bit 1 for mercury, bit 2 for phosphorus, bit 3 for chlorine
![[Using Bits to Record Sets of Flags]](./img/binary/bit_flags.png)
Figure 18.3: Using Bits to Record Sets of Flags
# hex binary
MERCURY = 0x01 # 0001
PHOSPHORUS = 0x02 # 0010
CHLORINE = 0x04 # 0100
# Sample contains mercury and chlorine
sample = MERCURY | CHLORINE
print 'sample: %04x' % sample
# Check for various elements
for (flag, name) in [[MERCURY, "mercury"],
[PHOSPHORUS, "phosphorus"],
[CHLORINE, "chlorine"]]:
if sample & flag:
print 'sample contains', name
else:
print 'sample does not contain', name
sample: 0005
sample contains mercury
sample does not contain phosphorus
sample contains chlorine
- Send comments
Floating Point
- Floating point numbers are (much) more complicated
- A 32-bit float has:
- One bit for the sign
- 23 bits for the mantissa (or value)
- 8 bits for the exponent
![[Floating Point Representation]](./img/binary/float_rep.png)
Figure 18.4: Floating Point Representation
- Floating point numbers are not real numbers
- Fixed number of bits per value means that only a limited set of values can be represented
- If the actual value isn't in that set, you must settle for the closest available approximation
- Send comments
Floating Point Spacing
- Consequence #1: values are unevenly spaced
- Less absolute precision for numbers with larger magnitudes
- Example: 1 sign, 3 mantissa, 2 exponent bits for each number
![[Uneven Spacing of Floating-Point Numbers]](./img/binary/uneven_spacing.png)
Figure 18.5: Uneven Spacing of Floating-Point Numbers
- Send comments
Floating Point Roundoff
- Consequence #2: roundoff errors
- 6-bit system can represent 6, and ¼, but not 5¾
- So 6 - 0.25 is 6, not 5.75
- And if 6 - 0.25 - 0.25 - 0.25 - 0.25 is evaluated left to right, the answer is still 6
- This is not random
- Happens exactly the same way every time
- But it is very hard to reason about
- Which is why people get Ph.D.'s in numerical analysis
- Send comments
Binary I/O
- I/O routines seen so far are line-based
- Can also use byte-oriented routines
f.read(N) reads (up to) next N bytes- Result is returned as a string, but there's no guarantee its contents are characters
- If the file
f is empty, returns None
f.write(str) writes the bytes in the string str
- Send comments
Binary I/O Mode
- Caution: must open files in binary mode on Windows
input = open(filename, 'rb') (and similarly for output)
- Otherwise, the low-level routines Python relies on convert Windows line endings
"\r\n" to Unix-style "\n"…- …which is an unkind thing to do to an image
- Example: open a file using
"r", then in "rb"- Identical on Unix, but different on Windows
import sys
print sys.platform
for mode in ('r', 'rb'):
f = open('open_binary.py', mode)
s = f.read(40)
f.close()
print repr(s)
cygwin
'import sys\r\nprint sys.platform\r\nfor mode'
linux
'import sys\nprint sys.platform\nfor mode in '
- Send comments
Packing and Unpacking
- In C and Fortran, an integer is a raw 32-bit value
fwrite(&array, sizeof(int), 3, file) will write 3 4-byte integers to a file
- Python, Java, and other languages usually don't use raw values
- There's no guarantee that things like lists are stored contiguously in memory…
- …so programs need to pack data into contiguous bytes for writing…
- …and unpack those bytes to recreate the structures when reading
![[C Storage vs. Python Storage]](./img/binary/c_vs_python_storage.png)
Figure 18.6: C Storage vs. Python Storage
- Send comments
Packing Data
- Packing looks a lot like formatting a string
- A format specifies the data types being packed (including sizes, where appropriate)
- This format exactly determines how much memory is required by the packed representation
- The result of packing is a chunk of bytes
- Stored as a string in Python
- But as mentioned above, it's not a string of characters
![[Packing Data]](./img/binary/pack_data.png)
Figure 18.7: Packing Data
- Send comments
Unpacking Data
- Unpacking reverses this process
- Read bytes from a “string” according to a format
- Use the data in these bytes to create Python data structures
- Return the result as a tuple of values
- Send comments
The struct Module
- Use Python's
struct module to pack and unpackpack(fmt, v1, v2, …) packs the values v1, v2, etc. according to fmt, returning a stringunpack(fmt, str) unpacks the values in str according to fmt, returning a tuple
import struct
fmt = 'hh' # two 16-bit integers
x = 31
y = 65
binary = struct.pack(fmt, x, y)
print 'binary representation:', repr(binary)
normal = struct.unpack(fmt, binary)
print 'back to normal:', normal
binary representation: '\x1f\x00A\x00'
back to normal: (31, 65)
- Send comments
Hexadecimal Characters
- What's
'\x1f\x00A\x00'?- If Python finds a character in a string that doesn't have a printable representation, it prints a 2-digit hexadecimal (base-16) escape sequence
- Uses the letters A-F (or a-f) to represent the digits from 10 to 15
- So this string represents the four bytes
['\x1f', '\x00', 'A', '\x00']- 1f16 is (1×16 + 15), or 31
- ASCII code for the letter
"A" is 6510
- Send comments
Format Specifiers
| Format | Meaning |
|---|
"c" | Single character (i.e., string of length 1) |
"B" | Unsigned 8-bit integer |
"h" | Short (16-bit) integer |
"i" | 32-bit integer |
"f" | 32-bit float |
"d" | Double-precision (64-bit) float |
"2" | String of fixed size (see below) |
|
Table 18.2: Packing Format Specifiers |
|---|
- Any format can be preceded by a count
- E.g.,
"4i" is four integers
- How much data is packed is specified by the format
- Can pack the lowest 8 or 16 bits of an integer using
"B" or "h" instead of the full 32
- Send comments
Calculating Sizes
- Must always specify the size of strings
- E.g.,
"4s" for a 4-character string - Otherwise, how would
unpack know how much data to use?
calcsize(fmt) calculates how large (in bytes) the data produced using fmt will be- Data sizes can vary from platform to platform
- And the computer is better at doing arithmetic than you are
- Send comments
Endianness
- Note that the least significant byte of the integer comes first
- This is called little-endian, and is used by all Intel processors
- Other chips put the most significant byte first (big-endian)
- If you move data from one architecture to another, it's your responsibility to flip the bytes…
- …because the machine doesn't know what the bytes mean
import struct
packed = struct.pack('4c', 'a', 'b', 'c', 'd')
print 'packed string:', repr(packed)
left16, right16 = struct.unpack('hh', packed)
print 'as two 16-bit integers:', left16, right16
all32 = struct.unpack('i', packed)
print 'as a single 32-bit integer', all32[0]
float32 = struct.unpack('f', packed)
print 'as a 32-bit float', float32[0]
packed string: 'abcd'
as two 16-bit integers: 25185 25699
as a single 32-bit integer 1684234849
as a 32-bit float 1.67779994081e+22
- Send comments
Packing Variable-Length Data
- How to store a variable-length vector of integers?
- Store the number of elements in a fixed-size header
- Then store that many integers one by one
![[Packing a Variable-Length Vector]](./img/binary/pack_vec.png)
Figure 18.8: Packing a Variable-Length Vector
- Packing is easy:
def pack_vec(vec):
buf = struct.pack('i', len(vec))
for v in vec:
buf += struct.pack('i', v)
return buf
def unpack_vec(buf):
# Get the count of the number of elements in the vector.
int_size = struct.calcsize('i')
count = struct.unpack('i', buf[0:int_size])[0]
# Get 'count' values, one by one.
pos = int_size
result = []
for i in range(count):
v = struct.unpack('i', buf[pos:pos+int_size])
result.append(v[0])
pos += int_size
return result
- Send comments
Unpacking Variable-Length Data
- Unpacking is a little harder
- Have to step up to the right location in the packed string on each pass through the unpacking loop
def unpack_vec(buf):
# Get the count of the number of elements in the vector.
int_size = struct.calcsize('i')
count = struct.unpack('i', buf[0:int_size])[0]
# Get 'count' values, one by one.
pos = int_size
result = []
for i in range(count):
v = struct.unpack('i', buf[pos:pos+int_size])
result.append(v[0])
pos += int_size
return result
- Send comments
Dynamic Formats
- Problem: what if you want to pack strings, but don't know their length in advance?
- Solution: create the format string on the fly, and save the string's length as well as its characters
def pack_strings(strings):
result = ''
for s in strings:
length = len(s)
format = 'i%ds' % length
result += struct.pack(format, length, s)
return result
def unpack_strings(buf):
int_size = struct.calcsize('i')
pos = 0
result = []
while pos < len(buf):
length = struct.unpack('i', buf[pos:pos+int_size])[0]
pos += int_size
format = '%ds' % length
s = struct.unpack(format, buf[pos:pos+length])[0]
pos += length
result.append(s)
return result- Send comments
Unpacking Dynamic Formats
- Unpacking is the same as it was for vectors
def unpack_strings(buf):
int_size = struct.calcsize('i')
pos = 0
result = []
while pos < len(buf):
length = struct.unpack('i', buf[pos:pos+int_size])[0]
pos += int_size
format = '%ds' % length
s = struct.unpack(format, buf[pos:pos+length])[0]
pos += length
result.append(s)
return result- Send comments
Metadata
- Metadata literally means “data about data”
- I.e., data that describes other data, such as the date it was collected, or its format
- When creating binary files, put a header at the start of the file that describes the format of the data the file contains
- Advantages:
- One parser handles all data files
- Can't lose the format: programs come and go, but data is forever
- Disadvantages:
- Slower (generality always is)
- Reader is more complicated than a single special-purpose reader would be…
- …but simpler than the sum of all the special-purpose readers you'd have to write…
- …and you only have to debug it once
- Send comments
Metadata File Structure
- Files have a three-part structure:
- Integer (fixed size) recording the length of the metadata
- Metadata (N bytes) describing the format of the records in the file
- The records themselves
![[Structure of a Binary File With Metadata]](./img/binary/metadata.png)
Figure 18.9: Structure of a Binary File With Metadata
- Send comments
Packing with Metadata
- First step is to store a list of identically-structured records to a file
def store(outf, format, values):
'''Store a list of lists, each of which has the same structure.'''
length = struct.pack('i', len(format))
outf.write(length)
outf.write(format)
for v in values:
temp = [format] + v
binary = struct.pack(*temp)
outf.write(binary)
- Notice how
struct.pack is called- It takes each value to be packed as a separate argument, rather than taking a list of values
- First argument has to be the format
- So create a list with the format, and the values to be packed, and apply
struct.pack to it - Common pattern when using variable number of arguments
- Send comments
Unpacking with Metadata
- Second step is to unpack the bytes created by
store- Read the size of the metadata, then the metadata, then the data
def retrieve(inf):
'''Retrieve data from a self-describing file.'''
data = inf.read(struct.calcsize('i'))
format_length = struct.unpack('i', data)[0]
format = inf.read(format_length)
record_size = struct.calcsize(format)
result = []
while True:
data = inf.read(record_size)
if not data:
break
values = list(struct.unpack(format, data))
result.append(values)
return result
- Send comments
Testing
- Final step is to test that everything works
- Just as important as steps 1 and 2
from cStringIO import StringIO
tests = [
['i', [[17]]],
['ii', [[17, 18]]],
['ii', [[17, 18], [19, 20], [21, 22]]],
['if', [[17, 18.0], [19, 20.0]]]
]
for (format, original) in tests:
storage = StringIO()
temp = store(storage, format, original)
storage.seek(0)
final = retrieve(storage)
assert original == final- Note that there's no output: tests should only ask for attention when something goes wrong
- Send comments
Summary
- Binary data is to programming what chemistry is to biology
- You don't want to spend any more time thinking at its level than you have to…
- …but when you do have to, there's no substitute
- Remember: libraries already exist to handle (almost) every binary format ever created
- The easiest code to debug is the code you didn't actually have to write
- Send comments
XML
Introduction
- XML is becoming the standard way to store everything from web pages to astronomical data
- Bewildering variety of tools for dealing with it
- And more appearing every day
- This lecture describes how to process and modify XML
- Warning: the standards are more complex than they should have been
- Reading:
- Send comments
You Can Skip This Lecture If...
- You know what the difference is between HTML and XML
- You know what elements, attributes, and entities are
- You know how to create a hyperlink in an HTML page
- You know what DOM is
- You know how to search an XML document
- Send comments
In the Beginning
- 1969-1986: Standard Generalized Markup Language (SGML)
- Developed by Charles Goldfarb and others at IBM
- A way of adding information to medical and legal documents so that computers could process them
- Very complex specification (over 500 pages)
- 1989: Tim Berners-Lee creates HyperText Markup Language (HTML) for the World Wide Web
- Much (much) simpler than SGML
- Anyone could write it, so everyone did
- Send comments
The Modern Era
- Problem: HTML had a small, fixed set of tags
- Everyone wanted to add new ones
- Solution: create a standard way to define a set of tags, and the relationships between them
- First version of XML standardized in 1998
- A set of rules for defining markup languages
- Much more complex than HTML, but still simpler than SGML
- New version of HTML called XHTML was also defined
- Like HTML, but obeys all XML rules
- Still a lot of non-XML compliant HTML out there
- Send comments
Formatting Rules
- A basic XML document contains elements and text
- Full spec allows for external entity references, processing instructions, and other fun
- Elements are shown using tags
- Must be enclosed in angle brackets
"<>" - Full form:
<tagname>…</tagname> - Short form (if the element doesn't contain anything):
<tagname/>
- Send comments
Document Structure
- Elements must be properly nested
- If Y starts inside X, Y must end before X ends
- So
<X>…<Y>…</Y></X> is legal… - …but
<X>…<Y>…</X></Y> is not
- Every document must have a single root element
- I.e., a single element must enclose everything else
- Specific XML dialects may restrict which elements can appear inside which others
- XHTML is very liberal
- MathML (Mathematical Markup Language) is stricter
- Send comments
Text
- Text is normal printable text
- Must use escape sequences to represent
"<" and ">"- In XML, written
&name; | Sequence | Character |
|---|
< | < |
> | > |
" | " |
& | & |
|
Table 19.1: XML Character Escapes |
|---|
- Send comments
XHTML
- Most common use of XML is still XHTML (the XML version of hypertext)
- Basic tags:
| Tag | Usage |
|---|
<html> | Root element of entire HTML document. |
<body> | Body of page (i.e., visible content). |
<h1> | Top-level heading. Use <h2>, <h3>, etc. for second- and third-level headings. |
<p> | Paragraph. |
<em> | Emphasized text; browser or editor will usually display it in italics. |
<address> | Address of document author (also usually displayed in italics). |
|
Table 19.2: Basic XHTML Tags |
|---|
- Send comments
Sample XHTML Page
Critique of HTML/XHTML
- HTML and XHTML mix semantics and display
<h1/> (level-1 heading) is semantic (meaning)<i/> (italics) is display (formatting)
- Now generally considered a bad thing
- Send comments
Attributes
- Elements can be customized by giving them attributes
- Enclosed in the opening tag
<h1 align="center">A Centered Heading</h1><p id="disclaimer" align="center">This planet provided as-is.</p>
- An attribute name may appear at most once in any element
- Like keys in a dictionary
- So
<p align="left" align="right">…</p> is illegal
- Values must be quoted
- Old-style browsers accepted
<p align=center>…<p>, but modern parsers will reject it - Must use escape sequences for angle brackets, quotes, etc. inside values
- Send comments
Attributes Vs. Elements
- Use attributes when:
- Each value can occur at most once for any element
- The order of the values doesn't matter
- Those values have no internal structure
- In all other cases, use nested elements
- If you have to parse an attribute's value to figure out what it means, use an element instead
- Send comments
More XHTML Tags
- Well-written HTML pages have a
<head/> element as well as a <body/>- Contains metadata about the page
- Well-written pages also use comments (just like code)
- Send comments
Lists and Tables
- Use
<ul/> for an unordered (bulleted) list, and <ol/> for an ordered (numbered) one- Each list item is wrapped in
<li/>
- Use
<table/> for tables- Each row is wrapped in
<tr/> (for “table row”) - Within each row, column items are wrapped in
<td/> (for “table data”) - Note: tables are often used to force multi-column layout, as well as for tabular data
- Send comments
Example
Images
Links
The Document Object Model
- The Document Object Model (DOM) is a cross-language standard for representing XML documents as trees
- One node for each element, attribute, or text
- Pro:
- Much easier to manipulate trees than strings
- Same basic model in many different languages (which lowers the learning cost)
- Con:
- Needs a lot of memory for large documents
- Generic standard doesn't take advantage of the more advanced features of some languages
- Python's standard library includes a simple implementation of DOM called
minidom- Fast, sturdy, and well documented…
- …if you understand all the terminology, and know more or less what you're looking for
- Send comments
The Basics
- Every DOM tree has a single root representing the document as a whole
- Doesn't correspond to anything that's actually in the document
- This element has a single child, which is the root node of the document
- It, and other element nodes, may have three types of children:
- Other elements
- Text nodes
- Attribute nodes
- Send comments
DOM Tree Example
More On Tree Structure
- Every node keeps track of what its parent is
- Allows programs to search up the tree, as well as down
- Note: it's easy to forget that text and attributes are stored in nodes of their own
- Other Python libraries like
ElementTree use dictionaries instead - Pro: makes simple things a little simpler
- Con: not (yet) part of the standard library
- Send comments
Creating a Tree
- Usual way to create a DOM tree is to parse a file
<?xml version="1.0" encoding="utf-8"?>
<planet name="Mercury">
<period units="days">87.97</period>
</planet>
import xml.dom.minidom
doc = xml.dom.minidom.parse('mercury.xml')
print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?>
<planet name="Mercury">
<period units="days">87.97</period>
</planet>
- Send comments
Converting to Text
- The
toxml method can be called on the document, or on any element node, to create text - DOM trees always store text as Unicode, so when you're converting the tree to text, you must tell the library how to represent characters
- This means that strings taken from XML documents are Unicode, not ASCII
import xml.dom.minidom
my_xml = '''<name>Donald Knuth</name>'''
my_doc = xml.dom.minidom.parseString(my_xml)
name = my_doc.documentElement.firstChild.data
print 'name is:', name
print 'but name in full is:', repr(name)
name is: Donald Knuth
but name in full is: u'Donald Knuth'
- Note the
u in front of the string the second time it is printed- A simple
print statement converts the Unicode string to ASCII for display
- Send comments
Other Ways To Create Documents
- Can also create a tree by parsing a string
import xml.dom.minidom
src = '''<planet name="Venus">
<period units="days">224.7</period>
</planet>'''
doc = xml.dom.minidom.parseString(src)
print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?>
<planet name="Venus">
<period units="days">224.7</period>
</planet>
- Or by building a tree by hand
import xml.dom.minidom
impl = xml.dom.minidom.getDOMImplementation()
doc = impl.createDocument(None, 'planet', None)
root = doc.documentElement
root.setAttribute('name', 'Mars')
period = doc.createElement('period')
root.appendChild(period)
text = doc.createTextNode('686.98')
period.appendChild(text)
print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?>
<planet name="Mars"><period>686.98</period></planet>
- Notice that the output of the preceding example wasn't nicely indented
- Because we didn't create text nodes containing carriage returns and blanks
- Most machine-generated XML doesn't
- Send comments
The Details
xml.dom.minidom is really just a wrapper around other platform-specific XML libraries- Have to reach inside it and get the underlying implementation object to create the
document node - That node then knows how to create other elements in the document
- Middle argument to
createDocument specifies the type of the document's root node - Documentation explains what the first and third arguments to
createDocument are
- Add new nodes to existing ones by:
- Asking the document to create the node
- Appending it to a node that's already part of the tree
- Set attributes of element nodes using
setAttribute(attributeName, newValue)- Remember, all attribute values are strings
- If you want to store an integer or a Boolean, you have to convert it yourself
- Send comments
Finding Nodes
- Often want to do things to all elements of a particular type
- E.g., find all
<experimenter/> nodes, extract names, and print a sorted list
- Use the
getElementsByTagName method to do this- Returns a list of all the descendents of a node with the specified tag
import xml.dom.minidom
src = '''<heavenly_bodies>
<planet name="Mercury"/>
<planet name="Venus"/>
<planet name="Earth"/>
<moon name="Moon"/>
<planet name="Mars"/>
<moon name="Phobos"/>
<moon name="Deimos"/>
</heavenly_bodies>'''
doc = xml.dom.minidom.parseString(src)
for node in doc.getElementsByTagName('moon'):
print node.getAttribute('name')
Moon
Phobos
Deimos
- Question: what happens if you add or delete nodes while looping over this list?
- Send comments
Walking a Tree
- Often want to visit each node in the tree
- E.g., print an outline of the document showing element nesting
- Node's type is stored in a member variable called
nodeTypeELEMENT_NODE, TEXT_NODE, ATTRIBUTE_NODE, DOCUMENT_NODE
- If a node is an element, its children are stored in a read-only list called
childNodes - If a node is a text node, the actual text is in the member
data - Send comments
Recursive Tree Walker
import xml.dom.minidom
src = '''<solarsystem>
<planet name="Mercury"><period units="days">87.97</period></planet>
<planet name="Venus"><period units="days">224.7</period></planet>
<planet name="Earth"><period units="days">365.26</period></planet>
</solarsystem>
'''
def walkTree(currentNode, indent=0):
spaces = ' ' * indent
if currentNode.nodeType == currentNode.TEXT_NODE:
print spaces + 'TEXT' + ' (%d)' % len(currentNode.data)
else:
print spaces + currentNode.tagName
for child in currentNode.childNodes:
walkTree(child, indent+1)
doc = xml.dom.minidom.parseString(src)
walkTree(doc.documentElement)
solarsystem
TEXT (1)
planet
period
TEXT (5)
TEXT (1)
planet
period
TEXT (5)
TEXT (1)
planet
period
TEXT (6)
TEXT (1)
- Traversing a tree like this is just one of many recurring patterns in object-oriented programming
- Send comments
Modifying the Tree
- Modifying trees in place is a little bit tricky
- Helps to draw lots of pictures
- Example: want to emphasize the first word of each paragraph
- Get the text node below the paragraph
- Take off the first word
- Insert a new
<em/> element whose only child is a text node containing that word ![[Modifying the DOM Tree]](./img/xml/modify_tree.png)
Figure 19.6: Modifying the DOM Tree
- Send comments
Complications
- But what if the first child of the paragraph already has some markup around it?
- E.g., what if the paragraph starts with a link?
- Could just wrap the first child with
<em/>- But if (for example) the link contains several words, this will look wrong
- We'll ignore this problem for now
- Send comments
Solution
- Step 1: find all the paragraphs using
getElementsByTagName, and iterate over them - Step 2: break the paragraph text into pieces, and handle each piece in turn
- Create a new node for each piece
- Push it onto the front of the paragraph's child list
- Once they've all been handled, get rid of the original text node
def emphasizeText(doc, para, textNode):
# Look for optional spaces, a word, and the rest of the paragraph.
m = re.match(r'^(\s*)(\S*)\b(.*)$', str(textNode.data))
if not m:
return
leadingSpace, firstWord, restOfText = m.groups()
if not firstWord:
return
# If there's text after the first word, re-save it.
if restOfText:
restOfText = doc.createTextNode(restOfText)
para.insertBefore(restOfText, para.firstChild)
# Emphasize the first word.
emph = doc.createElement('em')
emph.appendChild(doc.createTextNode(firstWord))
para.insertBefore(emph, para.firstChild)
# If there's leading space, re-save it.
if leadingSpace:
leadingSpace = doc.createTextNode(leadingSpace)
para.insertBefore(leadingSpace, para.firstChild)
# Get rid of the original text.
para.removeChild(textNode)
- Send comments
Not Finished Yet
- Part 3: test it
- Yes, it really is part of the program
if __name__ == '__main__':
src = '''<html><body>
<p>First paragraph.</p>
<p>Second paragraph contains <em>emphasis</em>.</p>
<p>Third paragraph.</p>
</body></html>'''
doc = xml.dom.minidom.parseString(src)
emphasize(doc)
print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?>
<html><body>
<p><em>First</em> paragraph.</p>
<p><em>Second</em> paragraph contains <em>emphasis</em>.</p>
<p><em>Third</em> paragraph.</p>
</body></html>
- Send comments
Summary
- There's a lot of hype in hypertext
- Haven't yet heard anyone claim that XML will cure the common cold, but I'm sure it's been said
- Pros:
- One set of rules for people to learn
- Never have to write a parser again
- At least, the low-level syntactic bits—still need to figure out what all those tags mean
- Cons:
- Raw XML is hard to read
- Particularly if it has been generated by a machine
- A lot of data isn't actually trees
- When storing a 2D matrix or a table, you have to organize data by row or by column…
- …either of which makes the other hard to access
- There are a lot of complications and subtleties
- Most applications ignore most of them
- Which means that they fail (usually badly) when confronted with something outside the subset they understand
- Like Inglish speling, it's here to stay
- Send comments
Relational Databases
Introduction
- Text and XML have their place, but most of the data people really care about is stored in relational databases
- The bad news: it's a huge topic
- The documentation for most commercial databases would fill the entire room
- No matter what room you're in
- The good news: you only need to know a little to get most things done
- A few key ideas
- A little syntax
- [Fehily 2003] is a good tutorial and reference guide
- Send comments
You Can Skip This Lecture If...
- You know what a table is
- You know how to select data from a table
- You know how to aggregate data
- You know what a nested query is
- You know what primary and foreign keys are
- You know how to do inner and outer joins
- Send comments
History
- Originated with E. F. Codd's work in the late 1960s and early 1970s
- By the 1980s,
Oracle and IBM's DB2 dominated the market - Open source alternatives like
MySQL and PostgreSQL emerged in the 1990s- Now have commercial support and competitive performance
SQLite is a lightweight alternative for small jobs- Originally designed for use in small web sites
- Stores entire database in a single file on disk (which simplifies backup and recovery)
- Send comments
When To Use A Database
- When you have lots of data
- When you need to ask complex questions
- E.g., “Find all experiments done with the Mark VII that had yields greater than 30%, that didn't use cadmium disulfide as a reagant”
- Relational database can answer these questions directly
- Many people try to use spreadsheets as simple databases
- Works for small data sets like course grades
- But they don't scale up, and search capabilities are primitive
- Increasingly common to store images, video clips, and other data
- Almost always store information about this data as well to support search and retrieval
- Send comments
Getting Started
- A database is a collection of zero or more tables, each of which:
- Has a name
- Stores a single relation (i.e., a set of information of a particular kind)
- Each table has a fixed set of named columns
- All the values in a column have the same type
- Each table has zero or more rows
- Send comments
Example: Experimental Data
![[Database Tables]](./img/db/database_tables.png)
Figure 20.1: Database Tables
- Use these tables as a running example
- Send comments
Using SQL
- Interact with database management system (DBMS) using a specialized language called SQL
- Every vendor implements its own extensions to the standard
- Not case sensitive:
gravity, Gravity and GRAVITY are considered the same
- Three approaches:
- Use an interactive GUI
- Put commands in a file, and give it to the DBMS
- E.g.,
sqlite experiments.db < find_names.sql
- Have a program written in another language (such as Python or Java) send strings containing commands to the database manager
![[Interacting with a DBMS]](./img/db/dbms_interaction.png)
Figure 20.2: Interacting with a DBMS
- Send comments
Creating Tables
Inserting Data
- To insert values into a table, specify the name of the tables, and the values to be inserted
- Each
INSERT creates a new row - Rows do not have to be unique
INSERT INTO Person VALUES("skol", "Kovalevskaya", "Sofia");
INSERT INTO Person VALUES("mlom", "Lomonosov", "Mikhail");
INSERT INTO Person VALUES("dmitri", "Mendeleev", "Dmitri");
INSERT INTO Person VALUES("ivan", "Pavlov", "Ivan");- Send comments
Simple Queries
- Suppose we want to get everyone's name and login ID
- Write a query that specifies what we want, and where to find it
SELECT Person.FirstName, Person.LastName, Person.Login FROM Person;
Sofia|Kovalevskaya|skol
Mikhail|Lomonosov|mlom
Dmitri|Mendeleev|dmitri
Ivan|Pavlov|ivan
- Send comments
Sorting
- How about sorting rows by login ID?
SELECT Person.FirstName, Person.LastName, Person.Login
FROM Person
ORDER BY Person.Login;
Dmitri|Mendeleev|dmitri
Ivan|Pavlov|ivan
Mikhail|Lomonosov|mlom
Sofia|Kovalevskaya|skol
- Note: some SQL commands are multi-word, such as
ORDER BY
- Send comments
Selection
- Frequently want only a subset of data
SELECT Experiment.ProjectId, Experiment.ExperimentId, Experiment.Hours
FROM Experiment
WHERE Experiment.Hours < 0;
1737|1|-1.0
1737|2|-1.5
- Use
WHERE to specify conditions that rows must satisfy to be included in results- Works on each row independently: cannot be used to compare one row to another
- Send comments
Joins
- Project IDs aren't particularly readable
- Want to look up the corresponding names, and display those instead
- A join is a query that combines information from two or more tables
- Most common kind is an inner join, which matches rows of the first with rows of the second based on common values
- Other variants include cross join, outer join, and self join
- Conceptually, an inner join:
- Constructs the cross product of the tables
- Discards rows that don't meet the selection criteria
- Selects columns from the surviving rows
![[Inner Joins]](./img/db/inner_join.png)
Figure 20.3: Inner Joins
- Send comments
Example: Translating IDs
- Rewrite the previous query to replace project IDs with names
SELECT Project.ProjectName, Experiment.ExperimentId, Experiment.Hours
FROM Project INNER JOIN Experiment
WHERE (Project.ProjectId = Experiment.ProjectId)
AND (Experiment.Hours < 0);
Time Travel|1|-1.0
Time Travel|2|-1.5
- What just happened:
- Construct cross product of
Project and Experiment (which has 3×6=18 rows) - Throw away rows for which the project IDs (reduces data to 6 rows)
- And rows for which hours are not negative (reduces data to 2 rows)
- Show project name, experiment ID, and experiment hours
- Send comments
Keys and Constraints
- One or more values in each record form its primary key
- A table may also contain one or more foreign keys
- A value (or set of values) in one table that identifies a record in another
- For example, the values in the
Login column in Involved identify records in the Person table
- Can (and should) specify such constraints explicitly, so that the DBMS can enforce them
- Simple form:
CREATE TABLE Person(
Login TEXT NOT NULL,
LastName TEXT NOT NULL,
FirstName TEXT NOT NULL,
PRIMARY KEY (Login)
);
- Named form:
CREATE TABLE Experiment(
ProjectId INTEGER NOT NULL,
ExperimentId INTEGER NOT NULL,
NumInvolved INTEGER NOT NULL,
ExperimentDate DATE,
Hours REAL NOT NULL
CONSTRAINT Experiment_Key PRIMARY KEY (ProjectId, ExperimentId)
);
- Send comments
Eliminating Duplicates
- How to find out who has done experiments for each project?
SELECT Project.ProjectName, Involved.Login
FROM Project, Involved
WHERE Project.ProjectId = Involved.ProjectId;
Antigravity|mlom
Antigravity|mlom
Teleportation|dmitri
Teleportation|skol
Teleportation|ivan
Teleportation|mlom
Time Travel|skol
Time Travel|skol
Time Travel|ivan
- User
mlom appears twice for the Antigravity project because he did two experiments for it- Use the
DISTINCT keyword to eliminate duplicates
SELECT DISTINCT Project.ProjectName, Involved.Login
FROM Project, Involved
WHERE Project.ProjectId = Involved.ProjectId;
Antigravity|mlom
Teleportation|dmitri
Teleportation|skol
Teleportation|ivan
Teleportation|mlom
Time Travel|skol
Time Travel|ivan
- Send comments
Aggregation
- Often need to aggregate (combine) values from different rows
- Sum, maximum, average, etc.
- Example: how much time has Mikhail spent on antigravity experiments?
SELECT SUM(Experiment.Hours)
FROM Involved INNER JOIN Experiment
WHERE (Involved.Login = "mlom")
AND (Involved.ProjectId = 1214)
AND (Involved.ProjectId = Experiment.ProjectId)
AND (Involved.ExperimentId = Experiment.ExperimentId);
15.8
- Send comments
Grouping
- It would be tedious to write a separate query to total each scientist's hours
- SQL doesn't have loops
- Although some vendors provide non-standardized equivalents
- Use
GROUP BY to apply aggregation function to specific subsets of rows
SELECT Involved.Login, SUM(Experiment.Hours)
FROM Involved INNER JOIN Experiment
WHERE (Involved.ProjectId = Experiment.ProjectId)
AND (Involved.ExperimentId = Experiment.ExperimentId)
GROUP BY Involved.Login;
dmitri|7
ivan|5.5
mlom|23.0
skol|4.5
- Note: negative hours on time travel experiments really mess up budgeting…
- Send comments
Self Joins
- How to find people who have done experiments for two (or more) projects?
- First attempt: use
AND
SELECT DISTINCT Person.Login
FROM Person INNER JOIN Involved
WHERE (ProjectId = 1214) AND (ProjectId = 1709);
- Doesn't work because
ProjectID cannot simultaneously be 1214 and 1709
- Second attempt: use
OR
SELECT DISTINCT Person.Login
FROM Person INNER JOIN Involved
WHERE (ProjectId = 1214) OR (ProjectId = 1709);
skol
mlom
dmitri
ivan
- Doesn't work because it includes rows where information about different people has been joined
- Send comments
Using Self Joins
- Right solution that works is to join the
Involved table with itself, so that we have two project IDs in the same row- Then select rows where the person is the same, but the project IDs are different
- Have to create a temporary alias for the two versions of the tables
SELECT DISTINCT A.Login
FROM Involved A CROSS JOIN Involved B
WHERE (A.Login = B.Login)
AND (A.ProjectId != B.ProjectId);
mlom
skol
ivan
- Send comments
Who Has Worked Together?
- Which pairs of people have performed experiments together?
- Send comments
Null
- Real-world data always has holes in it
- Some people don't have cell phone numbers, some authors' birth dates are unknown…
- Can represent this in a database using the special value
NULL- not the same as zero, empty string, False, etc.
- Instead, it means “nothing known at all”
- Database designers argue about whether
NULL is a good idea or not- Does it mean “no value”, “value not known”, or something else?
- Send comments
Operations on Nulls
- Check to see if a value is null using
IS NULL - The result of any computation involving
NULL is NULL2 + NULL is NULL, NULL OR True is NULL, etc.- Although in some databases,
False AND NULL is False, and True OR NULL is True
- Send comments
Managing Nulls
- By default, columns may contain
NULL, but this can be prohibited when the table is createdCREATE TABLE Experiment(
ProjectId INTEGER NOT NULL,
ExperimentId INTEGER NOT NULL,
NumInvolved INTEGER NOT NULL,
ExperimentDate DATE,
Hours REAL NOT NULL
);
- Queries must take possibility of
NULL into accountExperiment.ExperimentDate <> 1901-05-01 selects all experiments that weren't conducted on May 1, 1901, and all experiments whose date is NULL (since NULL isn't equal to anything except itself)- Have to use
(Experiment.ExperimentDate <> 1901-05-01) AND (Experiment.ExperimentDate IS NOT NULL)
- Send comments
Database Design
- Database design is a sizeable topic in its own right
- Make sure the relationships are correct
- Make sure the database performs well
- Very dependent on exactly which vendor database is being used
- But all commercial-grade DBMSes have powerful optimizers
- Most important thing is to normalize the data
- I.e., conform to a set of rules called normal forms
- Details are beyond the scope of this course
- Send comments
Normal Forms
- First normal form: values do not have any internal structure
- I.e., you shouldn't have to parse them in order to use them
- This is why first names and last names are stored separately
- Second normal form: tables don't contain redundant information
- Natural to combine
Experiment and Involved tables into one ![[A Combined Table]](./img/db/combined_table.png)
Figure 20.4: A Combined Table
- But now some information appears in two or more places…
- …which makes updates, consistency checks, and queries all harder to do
- Make up attributes (like
InvolvedID) to relate these tables
- Send comments
Nested Queries
- How to find everyone who hasn't been experimenting with time travel?
- Select rows of
Involved where ProjectID is not 1737?
SELECT DISTINCT Involved.Login
FROM Involved
WHERE (Involved.ProjectId != 1737);
mlom
dmitri
skol
ivan
- Wrong answer: Kovalevskaya and Pavlov have worked on time travel, but also on other projects
- Solution requires use of nested queries
- Database manager runs the inner query first, then applies the outer query to the inner query's result
- Send comments
Nested Query Example
- Strategy:
- Nested query finds all the people we don't want
- Outer query subtracts them from the set containing everyone
SELECT DISTINCT Login
FROM Involved
WHERE Login NOT IN
(SELECT DISTINCT Login
FROM Involved
WHERE Involved.ProjectId = 1737);
mlom
dmitri
- Send comments
More Uses for Nested Queries
- This strategy is useful for many other things as well
- Example: how many people have done experiments for exactly one project?
- Solution: find the people who've done experiments for none, or for two or more, and subtract them from everyone
SELECT DISTINCT Login
FROM Involved
WHERE Login NOT IN
(SELECT DISTINCT A.Login
FROM Involved A INNER JOIN Involved B
WHERE (A.Login = B.Login)
AND (A.ProjectId != B.ProjectId));
dmitri
- Send comments
Using Other Languages
- Usually don't write entire application in SQL, or run SQL in sub-shell
- Instead, embed SQL in the programming language of your choice
- Need the right driver to connect to the database
- Procedure:
- Establish a connection between the program and the DBMS
- Typically a socket, but other methods are used as well
- Create a pointer into the database called a cursor
- Send queries, and loop over results
![[Using Databases from Programs]](./img/db/using_dbms.png)
Figure 20.5: Using Databases from Programs
- Send comments
Example: Database Access from Python
- Example: get the names of all the scientists into a Python program
from pysqlite2 import dbapi2 as sqlite
connection = sqlite.connect("example.db")
cursor = connection.cursor()
cursor.execute("SELECT FirstName, LastName FROM Person ORDER BY LastName;")
results = cursor.fetchall();
for r in results:
print r
cursor.close();
connection.close();
("Sofia", "Kovalevskaya")
("Mikhail", "Lomonosov")
("Dmitri", "Mendeleev")
("Ivan", "Pavlov")
- Send comments
Concurrency
- The biggest challenge in database programming isn't formulated queries—it's handling concurrency
- Two or more things happening at once
- In the database world, one user changing the database while another is making a query
- Need to prevent race conditions, in which the final state of the system depends on the random order of operations
- First user: get current balance of grant #19823, add $100.00, save result
- Second user: get current balance, add $200.00, save
- If the operations are interleaved, final result could be $100, $200, or $300 added to account
![[Race Conditions]](./img/db/race_condition.png)
Figure 20.6: Race Conditions
- Also need to guard against failure
- Step 1: remove $100.00 from grant #19823
- Step 2: add $100.00 to grant #17928
- Don't want money to disappear if computer goes down in between
- Send comments
Transactions
- Solution to both problems is to use a transaction
- A set of operations which either all take effect as if nothing else was going on, or do not change the database
- Transactions must be ACID:
- Atomic: either all are performed, or none
- Consistent: database is in a legal state when the transaction ends
- Isolated: no operation outside the transaction sees the database in any intermediate state
- Durable: once the user is notified that the operation has completed, its effects are permanent
- Send comments
Example: Changing User ID
- Change Kovalevskaya's login ID from
"skol" to "kovalev"
BEGIN TRANSACTION;
UPDATE Person
SET Login = "kovalev"
WHERE Login = "skol";
UPDATE Involved
SET Login = "kovalev"
WHERE Login = "skol";
END TRANSACTION;
SELECT *
FROM Person
WHERE (Login = "kovalev") OR (Login = "skol");
SELECT *
FROM Involved
WHERE (Login = "kovalev") OR (Login = "skol");
kovalev|Kovalevskaya|Sofia
1709|1|2|kovalev
1737|1|1|kovalev
1737|2|1|kovalev
- No query can run after
Person changes, but before Involved changes - If database goes down in the middle, any changes made are discarded
- Send comments
Using Transactions
- Transactions can be used for queries, but should always be used for updates
- Why not use transactions everywhere?
- Because they require the database to serialize some operations
- Which slows the system down
- Send comments
Testing
- Unit testing programs that use databases is just like testing other programs…
- …except slower
- Creating a fresh fixture for each test means erasing and re-creating the entire database
- If it takes two seconds to run each unit test, developers are not going to re-run 1000 tests after each small program change
- Solutions:
- Create the fixture once, and clone it each time it's needed
- Store the database in memory, rather than on disk
- Not very useful in a production system (since all data is lost when the program ends)
- But much, much faster
- Send comments
Advanced Topics
- A stored procedure is a piece of compiled code stored in the database itself
- Trades flexibility for efficiency
- A trigger is a procedure that automatically runs when a table is modified
- E.g., send mail whenever a new dataset is entered
- And then there's the problem of ensuring referential integrity
- I.e., that references within and between database entries are consistent
- Send comments
Summary
- The world isn't made of tables any more than it's made of lists
- But sooner or later, every home-grown spreadsheet or XML substitute needs the capabilities of a DBMS
- Even simple queries can get you into trouble if they miss important data
- Or include data they shouldn't
- Treat queries written in SQL with the same respect you'd give any other program
- Send comments
Spreadsheets
Introduction
- A spreadsheet is a simple way to do tabular calculations
- The first one, VisiCalc, was the original “compelling application”
- Widely used in science as a simple database, and for statistics
- Most widely used spreadsheet today is Microsoft Excel
- This lecture will use
Gnumeric instead- Principles and UI are the same
- Send comments
You Can Skip This Lecture If...
- You can enter and format data in a spreadsheet
- You know how to calculate totals, averages, and other aggregate values
- You can create a chart
- You know how conditionals work
- You know how to create lookup tables
- Send comments
First Steps
- Run Gnumeric
![[Empty Spreadsheet]](./img/spreadsheets/gnumeric_empty.png)
Figure 21.1: Empty Spreadsheet
- Spreadsheet consists of cells with unique ids like
A1 or C4- Enter data by typing in cells
-
[enter]
moves to the next row,
[tab]
moves to next column
- Use mouse or arrow keys to move around
- Send comments
Entering Data
- Studying the long-term effects of programming on human memory
- How well do people remember names, faces, and where they live after they've been programming for several years?
- How does performance change with repetition?
- Want to use a spreadsheet to store data and do calculations
- Enter the values shown below
![[Raw Scores]](./img/spreadsheets/scores_when_first_entered.png)
Figure 21.2: Raw Scores
- Send comments
Formatting Data
- Spreadsheets can help make data more readable
- I.e., easier to understand
- Select the six numbers using either the mouse or the keyboard
- Select Format...Cells
- Under the “Number” category, change the number of decimal places to 1
- While we're here:
- Select the “1” label (to select the first row)
- Select Format...Cells again
- Under the “Font” tab, change the format to bold
- Do the same for the title column
![[Scores After Formatting Titles]](./img/spreadsheets/scores_after_formatting_titles.png)
Figure 21.3: Scores After Formatting Titles
- Send comments
Formulas
- Often want to calculate new values from old
- In this case, want to calculate each subject's overall performance
- Weighting is 30% for preliminary result, 70% for final result
- Select E2 (first empty cell in Alan Turing's row)
- Click on the “=” button to enter a formula
- Enter
(0.3*C2)+(0.7*D2)C2 and D2 are references to other cells
- 84.6 appears in cell E2
![[Scores With A Formula]](./img/spreadsheets/scores_with_formula.png)
Figure 21.4: Scores With A Formula
- Send comments
Replicating Formulas
- Could type similar formulas into cells E3 and E4
- But it would be tedious to do this for a hundred test subjects
- Instead, select E2, copy, and paste into E3, then into E4
- Spreadsheet automatically adjusts relative references
- Turns C2 into C3 or C4 as required
- Grades are 84.6, 71.9, and 86.3
![[Copying Formulas]](./img/spreadsheets/scores_copying_formulas.png)
Figure 21.5: Copying Formulas
- Send comments
Built-In Functions
- What about calculating averages and other aggregate statistics?
- Don't want to have to type in
(0.01*A1+0.01*A2+...+0.01*A100) - Especially if we might add another row or column later on
- Solution is to use built-in functions
- Select C5
- Enter the formula
AVERAGE(C2:C4)- C2:C4 is a range of cells
- Copy and paste into D5 to calculate the average final score
- Add
MAX below average, along with a label- Spreadsheets are programs, and programs should be documented
- Then insert a row above average for
MIN- The spreadsheet does the right thing with relative references when rows are inserted or deleted
![[Minimum, Average, and Maximum Scores]](./img/spreadsheets/scores_min_ave_max.png)
Figure 21.6: Minimum, Average, and Maximum Scores
- Send comments
Commonly-Used Functions
| Function | Purpose |
|---|
AND(e1,e2,...) | True if all expressions are true; false otherwise |
AVERAGE(values) | Return the average of the given values (which may be a range) |
DATE(year,month,day) | Return the number of days since January 1, 1900 for the given date |
INDEX(array,row,col) | Return the section of an array indexed by row and column indices |
LOOKUP(value,lookup_vector,result_vector) | Find a value in a lookup vector, and return the corresponding entry from the result vector |
NOT(e) | True if the expression is 0; false otherwise |
OR(e1,e2,...) | True if any expression is true; false otherwise |
RAND() | Return a random value between 0 and 1 |
REPLACE(old,start,num,new) | Replace part of a string |
ROUND(number) | Round off a number |
SIN(e) | Return the sine of an expression |
TODAY() | Return the number of days since January 1, 1900 for today |
|
Table 21.1: Gnumeric Functions |
|---|
- Send comments
Dependencies
- Spreadsheets update dependencies between cells automatically
- Just like Make updates dependencies between files
- A declarative programming language
- Example:
- Copy the equation for overall grade from E4 into E5, E6, and E7
- Change John von Neumann's final score from 68 to 86
- Several cells' values immediately change
![[Dependencies Between Cells]](./img/spreadsheets/scores_dependencies_highlighted.png)
Figure 21.7: Dependencies Between Cells
- This is why people use spreadsheets: the data is the program
- Send comments
Conditionals
- Who remembered enough to be able to carry out simple tasks?
- Field studies show that people need at least 75% recall
- Syntax for a conditional is
IF(condition,true_value,false_value)- First argument to function must be a Boolean expression
- Second is the function's value if the first argument is true
- Third is its value if the first argument is false
- Put
IF(E2>75,"success","failure") in cell F2- Copy and paste into F3 and F4
![[Conditionals]](./img/spreadsheets/scores_conditional.png)
Figure 21.8: Conditionals
- Send comments
Multi-Valued Conditionsl
- Want more specific diagnosis than just “success” or “failure”
- Could use nested conditional expressions
IF(E2<70,"Failure",IF(E2<80,"Marginal",IF(E2<86,"Good","Excellent")))
- But it would be hard to read
- And what if there were a hundred options?
- Send comments
Lookup Tables
- Use a lookup table instead
- Find a value in one row, and use the contents of a corresponding cell from another row
- Syntax is
LOOKUP(value,lookup_vector,result_vector)value is a single celllookup_vector is part of a single row or column- If the value isn't in the lookup vector, the spreadsheet uses the nearest cell whose value is less than or equal to it
- So 84.6 falls back to the cell containing the value 80
lookup_vector must be sorted in order for this to work
- The result vector must be exactly the same length as the lookup vector
- Send comments
Lookup Table Example
- Put cutoffs and evaluations in rows 9 and 10
- Put
LOOKUP(E2,B9:E9,B10:E10) in cell F2![[Looking Up Results]](./img/spreadsheets/scores_lookup_results.png)
Figure 21.9: Looking Up Results
- Send comments
Absolute References
- Copy and paste into F3 to look up letter grade for John von Neumann
- Displays
#N/A, meaning “not valid”
- Formula has been adjusted to use B10:E10 as the lookup vector and B11:E11 as the result vector
- But these cells are empty
![[Lookup Failure]](./img/spreadsheets/scores_lookup_error.png)
Figure 21.10: Lookup Failure
- Send comments
Adjusting The Formula
- Use absolute references to prevent this
$B$9 means cell B9 even when the formula is copied, rows and columns are inserted or deleted, etc.
- So change the formula for F2 to
LOOKUP(E2,$B$9:$E$9,$B$10:$E$10)![[Absolute References in Formulas]](./img/spreadsheets/scores_absolute_references.png)
Figure 21.11: Absolute References in Formulas
- Send comments
A Larger Data Set
- Open
solarsystem.csv- Orbital information about planets and satellites
- Data stored as comma-separated values
Name,Position,Orbits,Distance,Period,Inclination,Eccentricity
,,,x1000km,days,degrees,degrees
Sun,-,-,-,-,-,-
Mercury,1,Sun,57910,87.97,7.00,0.21
Venus,2,Sun,108200,224.70,3.39,0.01
Earth,3,Sun,149600,365.26,0.00,0.02
Mars,4,Sun,227940,686.98,1.85,0.09
Jupiter,5,Sun,778330,4332.71,1.31,0.05
Saturn,6,Sun,1429400,10759.50,2.49,0.06
Uranus,7,Sun,2870990,30685.00,0.77,0.05
- Click on the row below the two title rows, and select View...Freeze Panes
- Scrolling will now leave the title rows in place
![[Scrolling the Solar System]](./img/spreadsheets/solarsystem_scroll.png)
Figure 21.12: Scrolling the Solar System
- Send comments
Creating Charts
- How are period and distance related?
- The best (often, the only) way to understand large data sets is to view them graphically
- To create a chart:
- Select the range D4:E78 (i.e., everything except the Sun)
- Select Insert...Chart
- Choose an XY scatterplot
![[Basic Chart]](./img/spreadsheets/solarsystem_with_basic_chart.png)
Figure 21.13: Basic Chart
- Send comments
Customizing The Display
- Too many points crowded against left edge of chart
- Not surprising: values for Pluto and Jupiter's innermost satellites span orders of magnitude
- Try log-log plot
- Each ten-fold increase in scale is a single unit on the axis
- Can do it by setting options when creating the chart
- More informative in this case to do it by hand
- Send comments
Creating A Log-Log Chart
- Set I4 to
log(D4), then copy, select I5:I78, and paste- Pastes into all selected cells at once
- Set J4 to
log(E4), copy, and paste into - Error: J28 and others display
#NUM! ![[Error Creating Log-Log Plot]](./img/spreadsheets/solarsystem_log_log_error.png)
Figure 21.14: Error Creating Log-Log Plot
- Send comments
Fixing The Error
- The data uses negative periods for retrograde motion
- Have to scrub the data in order to use it
- Change formula to
log(abs(E4))- Now plot columns I and J
- Add X and Y axis titles along the way
![[Log-Log Plot of Distances and Periods]](./img/spreadsheets/solarsystem_log_log.png)
Figure 21.15: Log-Log Plot of Distances and Periods
- Send comments
Analysis
- Looks like several straight lines lying beside one another
- The rightmost has nine points…
- …and there are nine planets
- Do the other lines correspond to Jupiter, Saturn, and other orbited bodies?
- Often use spreadsheets for this kind of exploratory data analysis
- Look for things that might be patterns
- Then apply real statistical tools to see if they are
- It's not science until you do the second step
- The human eye is very good at “seeing” patterns that aren't there
- Send comments
Programming
- FIXME: describe how to manipulate Gnumeric data from Python
- Send comments
Summary
Exercises
Exercise 21.1:
Spreadsheets use conditional expressions, rather than
conditional statements. C/C++, Java, and Python also support
conditional expressions. How are they written? When should you
use them? When shouldn't you?
Exercise 21.2:
$B$9 is an absolute reference to the cell B9. What does
the expression $B9 refer to? What about B$9?
When would you use expressions like these?
Send comments
Integration
Introduction
- Good programmers don't write programs: they assemble them
- Combine tools and libraries that others have written
- Thereby creating something that others can then recombine
- This lecture explores various ways to combine things
- Helps a lot to design with combination in mind
- Send comments
You Can Skip This Lecture If...
- You know how to run an external program from Python
- You know how to load a module dynamically
- You know how to inspect the contents of a module or class
- You know how to call a C function from Python…
- …and a Python function from C
- Send comments
Running External Programs
- There are lots of old command-line programs in the world
- And lots of GUI programs that have command-line interfaces
- The older they are, the stronger the argument for leaving them alone
- The less you change, the less will break
- Instead, run the program as-is from Python (or some other high-level language)
- Talk to the web, create 3D graphics, etc., in Python
- Run the legacy program to do the calculation, and parse its output
- Send comments
The subprocess Module
- Python's
subprocess module lets you run external programs- Connect to their standard input, output, and error (just like pipes)
- Capture their return codes
- Defines a single class called
Popen- Takes up to 14 (!) options
- Common cases only use three or four of these
- Does its best to behave the same on Unix and Windows
- Read the documentation before doing anything particularly tricky
- Send comments
Running In Place
- Simple usage is
Popen("cmd"), where "cmd" is the program to be run- New process created
- Inherits the parent's
stdin, stdout, stderr, working directory, and environment variables
import subprocess
subprocess.Popen("date")
Mon Apr 3 09:05:39 EST 2006
- Send comments
Running With Arguments
- Pass command-line arguments by giving
Popen a list
import subprocess
subprocess.Popen(["date", "-u"])
Mon Apr 3 13:06:27 EST 2006
- Can also:
- Specify a working directory for the child process (the
cwd parameter) - Provide or override environment variables (the
env parameter)
- Send comments
Capturing Output
- Often useful to run a program and capture its output
- E.g., legacy program prints records from a database
- Do this by:
- Setting
Popen's stdout parameter to PIPE - Reading from the object's
stdout member
import subprocess
SQL = 'select * from Person'
child = subprocess.Popen(['sqlite3', 'experiment.db', SQL],
stdout=subprocess.PIPE)
lines = child.stdout.readlines()
for line in lines:
line = line.strip().split('|')
print '%s %s (%s)' % (line[1], line[2], line[0])
Kovalevskaya Sofia (skol)
Lomonosov Mikhail (mlom)
Mendeleev Dmitri (dmitri)
Pavlov Ivan (ivan)
- Note: the SQL is passed as a single argument
- Send comments
Providing Input
- Can also pipe data to a child by setting
stdin to PIPE - Example: compress output on its way to a file
- In reality, better to use
zlib, gzip, or bz2 libraries def pipe_write(filename, lines):
child = subprocess.Popen(['gzip', '-c'],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE)
for line in lines:
child.stdin.write(line)
child.stdin.close()
result = child.stdout.read()
return result- Create a child process to run
gzip -c-c meaning “write result to standard output”
- Send data by writing to child's
stdin - Once all data has been sent, read compressed result from child's
stdout
- Send comments
Deadlock
- Example above doesn't scale to large data sets
- Operating system can only buffer a limited amount of data
- You can increase this limit, but can't make it infinite
- Program can deadlock
- Parent and child are each waiting for the other to read
- So neither does
- Solution is to use
Popen.communicate- Sends data to the child process's
stdin - Reads from
stdout and stderr at the same time
- Send comments
Pros and Cons
- Pro:
- Often the quickest thing to set up
- Involves fewest changes to legacy code (i.e., least risk)
- Con:
- Legacy application may not expose the needed functionality
- Managing parent/child interactions is tricky
- I.e., easy to break, and hard to debug
- Send comments
Plan B: Integrating with C
- Been saying since the first lecture that you should:
- Write the first version in Python (or some other very high-level language)
- Find out whether it's fast enough, and if not, what's slowing it down
- Optimize only those parts that you need to
- The most effective way to optimize programs written in high-level languages is to find more efficient algorithms
- The second most effective way is often to rewrite core modules in a low-level language like C
- Also a good way to handle inherited code: wrapping tried-and-trusted C or Fortran in Python is safer and easier than rewriting
- And faster
- If you don't speak C, [Kernighan & Ritchie 1998] is the standard introduction
- Send comments
How Python Represents Objects
- Python represents things using a C structure of type
PyObject- Include
python.h to get its definition ![[PyObject]](./img/integrate/pyobject.png)
Figure 22.1: PyObject
- The type code tells the interpreter knows how to interpret the rest of the structure
- The union is large enough to hold any basic value
- In Python's case, the two 64-bit values needed to store a complex number
- The reference count keeps track of how many other objects are pointing to this one
- When a thing is created, its reference count is initialized to 1
- When its count drops to 0, Python can garbage collect it
- Send comments
Calling Conventions
- Every C function that the interpreter calls must take two arguments:
self is NULL for pure functions, and an object for methodsargs is a variable-length list of arguments- Use the function
PyArg_ParseTuple to extract arguments' values
- Returns
NULL to signal error - Otherwise, uses
Py_BuildValue to build a Python structure with the result value - Example: take an integer and return three times its value
/* Triple an integer value. */
static PyObject * triple(PyObject * self, PyObject * args)
{
int val;
if (!PyArg_ParseTuple(args, "i", &val)) {
return NULL;
}
val = val * 3;
return Py_BuildValue("i", val);
}
- Send comments
Boilerplate
- Need some boilerplate to bring this function to the Python interpreter's attention
/* Table of module contents (handed back to Python at initialization). */
static PyMethodDef contents[] = {
{"triple", triple, METH_VARARGS},
{NULL, NULL}
};
/* Initialization function. */
void inittriple()
{
Py_InitModule("triple", contents);
}
- The array
contents has one entry for each function - The initialization function:
- Has a name of the form
initXYZ for the module XYZ - Calls
Py_InitModule to pass the table of module contents to the interpreter
- Send comments
Loading and Calling
- Compile the C to create a shared library
.dll on Windows.so on Unix)
- Put the shared library in a directory that's on Python's search path
- Then import and use as if it were written in Python
import triple
print triple.triple(11)
33
- Send comments
What About C++?
- Connecting Python and C++ is harder
- C++ has many features that don't have Python equivalents (e.g., templates)
- Many of the analogous features have different semantics (e.g., exceptions)
- Every compiler and platform has its own Application Binary Interface
- Much wider variation than there is for C libraries
Boost.Python does the best it can (which is pretty good)- Send comments
SWIG
- It's simpler with
SWIG (the Simple Wrapper Interface Generator) SWIG can also generate wrappers for Perl, Java, and other high-level languages- The more you plan for change, the less often you'll have to change your plan
- Similar tools exist to connect Python to Fortran (e.g.
F2PY and Pyfort) - Send comments
Integrating the Other Way
- Can also go the other way, and embed Python in C/C++
- Every large application eventually needs an interactive command interpreter
- To embed:
- Initialize a Python interpreter object
- Convert application values into Python objects
- Pass the interpreter a string containing the code to be executed, and the values to execute it on
- Unwrap the result
- Much less common than wrapping
- Multilanguage programming isn't simple
- And multilanguage debugging is downright hard
- But both are often simpler (as well as more efficient) than the alternatives
- Send comments
Loading Modules
- What happens when Python executes
import stuff?- Look in the directories listed in
PYTHONPATH for stuff.py - Read into memory
- Compile
- Create a
module object to keep track of the things just compiled
![[Loading a Module]](./img/integrate/loading_module.png)
Figure 22.2: Loading a Module
- Send comments
Plugin Frameworks
- All modern languages let you do this programmatically
- Load code (pre-compiled or not) from a file on disk
- Add it to your program
- Call it as if it had always been there
- Which is why most modern programs are built as frameworks
- The “program” knows how to load modules and pass data between them
- Modules provide different image processing operations, alternative ocean circulation models, etc.
- This modularization development as well as usability
- Better testability: well-defined interfaces between self-contained objects
- Easier maintenance: can replace things one at a time
- Send comments
Manual Loading
- Use the
__import__ function to load a file- Resulting module object behaves like a dictionary
- Note that the module must be on Python's search path
- Use
vars to find out what a module object contains- Can also be applied to classes and class instances
- Using code to examine other code is called reflection
- Example: list the contents of a Python file
import sys, os
def list_contents(module_name):
print module_name
if os.path.dirname(module_name) not in sys.path:
sys.path.append(os.path.dirname(module_name))
try:
module = __import__(module_name)
for name in vars(module):
print '\t' + name
except ImportError:
print >> sys.stderr, 'Unable to import %s' % module_name
if __name__ == '__main__':
for module_name in sys.argv[1:]:
list_contents(module_name)
$ python lister.py lister
lister
__builtins__
__name__
__file__
list_contents
__doc__
- Note: adding a module's path to
sys.path not something you should do in general…
- Send comments
Using Manual Loading
- Have several different user interfaces, or finite difference grids, or…
- Load code based on specification in a configuration file
- Makes it easy to add new options after the fact
def loader(config_file):
result = {}
imported = {}
infile = open(config_file, 'r')
for line in config_file:
name, module, func = line.split()
if name in result:
raise LoaderError('Trying to set name %s twice', name)
if module not in imported:
imported[module] = __import__(module)
if func not in imported[module]:
raise LoaderError('Function %s not in module %s', func, module)
result[name] = func
return result
- Send comments
Manipulating Namespaces
- Can take this one step further and make dynamically-loaded objects look like “normal” variables
- Function
globals returns a dictionary of global variables - Adding items to this “creates” new global variables
$ python
Python 2.4.1 (#1, May 27 2005, 18:02:40)
[GCC 3.3.3 (cygwin special)] on cygwin
>>> G = globals()
>>> G
{'__builtins__': <module '__builtin__' (built-in)>, '__name__': '__main__', '__doc__': None, 'G': {...}}
>>> a = 1
>>> G
{'__builtins__': <module '__builtin__' (built-in)>, '__name__': '__main__', 'a': 1, '__doc__': None, 'G': {...}}
>>> del G['a']
>>> G
{'__builtins__': <module '__builtin__' (built-in)>, '__name__': '__main__', '__doc__': None, 'G': {...}}
>>> G['b'] = 2
>>> G
{'b': 2, 'G': {...}, '__builtins__': <module '__builtin__' (built-in)>, '__name__': '__main__', '__doc__': None}
>>> def double(x):
... return 2 * x
...
>>> G['d'] = double
>>> del G['double']
>>> d
<function double at 0x4d68b4>
>>> G
{'b': 2, 'd': <function double at 0x4d68b4>, 'G': {...}, '__builtins__': <module '__builtin__' (built-in)>, '__name__': '__main__', '__doc__': None}
- You can and should do all of this in C, Fortran, Java, C#, etc.
- Mechanics depend on language and operating system
- Send comments
Summary
- Re-using is usually more productive than rewriting
- Use new code to run old
- Open up the old code so that it can be called from the new
- Remember that programs are just data
- Program source is just text in a file
- A running program is just a data structure in memory
- Take advantage of this to make your programs leaner, cleaner, and easier to maintain
- Send comments
Web Client Programming
Introduction
- The Internet is changing everything
- Distributed programs are different from unitary ones
- Distributed teams work differently from collocated ones
- This lecture looks at how to build programs that get data from the web
- If you want to know more, see [Goerzen 2004]
- Send comments
You Can Skip This Lecture If...
- You know what TCP, UDP, and DNS stand for
- You know what a socket is
- You know how HTTP requests and responses are formatted
- You know how to append parameters to a URL
- You know what screen scraping is, and why you shouldn't do it
- Send comments
Small Pieces, Loosely Joined
- The Unix command line was the world's first component object model
- Programmers build small pieces, then connect them in arbitrary ways
- Key features:
- Low cost of entry: it's easy to add one more tool to the toolbox
- Common data format: stream of strings
- Common communication protocol:
stdin, stdout, and zero/nonzero exit codes
- The Web grew so quickly because it replicated these strengths
- Everything used HTML (data format) over HTTP (communication protocol)
- Send comments
Distributed Is Different
- Distributed systems are fundamentally different from unitary systems
- Small programs (like the ones in this lesson) can ignore these differences…
- …but every industrial-strength application eventually has to deal with them
- Difference #1: concurrency
- As in databases, means “several things happening at once”
- Can lead to:
- Deadlock: A is waiting for B while B is waiting for A
- Race conditions: final result depends on whether A or B goes last
- Send comments
Partial Failure
- Difference #2: partial failure
- One component fails while others are still healthy
- If you've waited five seconds for a web site to respond, should you assume that it's down, or keep waiting?
- Both differences make distributed applications much harder to debug than unitary ones
- Often have heisenbugs (which only appear intermittently)
- And it's usually impossible to get a complete picture of the system's state
- Only way to get a distributed system right is to build it right in the first place
- Send comments
Under the Hood
Sockets
- Using IP, processes communicate through sockets
- Each socket is one end of a point-to-point communication channel
- Provides the same kind of read and write operations as files
- The socket's host address identifies a machine
- Consists of four 8-bit numbers, like
"24.153.22.195" - The Domain Name System (DNS) gives these symbolic names like
"www.third-bit.com" - Use
nslookup to talk to DNS directly
- The socket's port is just a number in the range 0-65535
- 0-1023 are reserved for the operating system's use
![[Sockets]](./img/client/sockets.png)
Figure 23.1: Sockets
- Send comments
Client/Server vs. Peer-to-Peer
Socket Client
import sys, socket
buffer_size = 1024 # bytes
host = '127.0.0.1' # local machine
port = 19073 # hope nobody else is using it...
message = 'ping!' # what to send
# AF_INET means 'Internet socket'.
# SOCK_STREAM means 'TCP'.
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((host, port))
# Send the message.
sock.send(message)
# Receive and display the reply.
data = sock.recv(buffer_size)
print 'client received', `data`
# Tidy up.
sock.close()
client received 'pong!'
- Send comments
Socket Server
import sys, socket
buffer_size = 1024 # bytes
host = '' # empty string means 'this machine'
port = 19073 # must agree with client
# Create and bind a socket.
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind((host, port))
# Wait for a connection request.
s.listen(True)
sock, addr = s.accept()
print 'Connected by', addr
# Receive and display a message.
data = sock.recv(buffer_size)
print 'server saw', str(data)
# Replace vowels in reply.
data = data.replace('i', 'o')
sock.send(data)
# Tidy up.
sock.close()
Connected by ('127.0.0.1', 1297)
server saw ping!
- Send comments
The Hypertext Transfer Protocol
- the Hypertext Transfer Protocol (HTTP) specifies how programs exchange documents over the web
![[HTTP Request Cycle]](./img/client/http_cycle.png)
Figure 23.2: HTTP Request Cycle
- Clients are typically browsers, such as
Firefox Apache is the most widely used server, but many others exist
- The client sends a request specifying what it wants
- The server sends the contents of the file in reply
- HTTP is a stateless protocol
- Server doesn't remember anything between requests
- Every image in a web page must be requested and downloaded separately
- Send comments
HTTP Request Line
- An HTTP request has three parts
![[HTTP Request]](./img/client/http_request.png)
Figure 23.3: HTTP Request
- HTTP method is almost always one of:
"GET": to fetch information"POST": to submit form data or upload files
- URL identifies the thing the request wants
- Typically a path to a file, such as
/index.html - But it's entirely up to the server how to interpret the URL
- HTTP version is usually
"HTTP/1.0"- Occasionally see
"HTTP/1.1"
- Send comments
Headers
- An HTTP header is a key/value pair
"Accept: text/html""Accept-Language: en, fr""If-Modified-Since: 16-May-2005"
- Unlike a dictionary, a key may appear any number of times
- So a request can specify that it's willing to accept several types of content
- Send comments
Body
- The body is any extra data associated with the request
- Used with web forms, to upload files, etc.
- Must be a blank line between the last header and the start of the body
- Signals the end of the headers
- Forgetting it is a common mistake
- The
"Content-Length" header tells the server how many bytes to read - Note: there's no magic in any of this
- An HTTP request is just text—any program that wants to can create them or parse them
- Send comments
HTTP Response
![[HTTP Response]](./img/client/http_response.png)
Figure 23.4: HTTP Response
- HTTP version, headers, and body have the same form, and mean the same thing
- Status code is a number indicating what happened
- 200: everything worked
- 404: page not found
- Status phrase repeats that information in a human-readable phrase (like “OK” or “not found”)
- Send comments
HTTP Response Codes
| Code | Name | Meaning |
|---|
| 100 | Continue | Client should continue sending data |
| 200 | OK | The request has succeeded |
| 204 | No Content | The server has completed the request, but doesn't need to return any data |
| 301 | Moved Permanently | The requested resource has moved to a new permanent location |
| 307 | Temporary Redirect | The requested resource is temporarily at a different location |
| 400 | Bad Request | The request is badly formatted |
| 401 | Unauthorized | The request requires authentication |
| 404 | Not Found | The requested resource could not be found |
| 408 | Timeout | The server gave up waiting for the client |
| 500 | Internal Server Error | An error occurred in the server that prevented it fulfilling the request |
| 601 | Connection Timed Out | The server did not respond before the connection timed out |
|
Table 23.1: HTTP Response Codes |
|---|
- Send comments
HTTP Example
- Fetch a page from the course site
- Request has no headers, so the blank line that signals “end of headers” is right after the request line
import sys, socket
buffer_size = 1024
HttpRequest = '''GET /greeting.html HTTP/1.0
'''
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(('www.third-bit.com', 80))
sock.send(HttpRequest)
response = ''
while True:
data = sock.recv(buffer_size)
if not data:
break
response += data
sock.close()
print response
HTTP/1.1 200 OK
Date: Fri, 03 Mar 2006 18:12:55 GMT
Server: Apache/2.0.54 (Debian GNU/Linux)
Last-Modified: Fri, 03 Mar 2006 18:12:23 GMT
Content-Length: 92
Content-Type: text/html
<html>
<head><title>Greeting Page</title></head>
<body>
<h1>Greetings!</h1>
</body>
</html>
- Note: the double parentheses in the call to
sock.connect are deliberate- Method's argument is a (host, port) tuple
- Send comments
Fetching Pages
- Opening sockets, constructing HTTP requests, and parsing responses is tedious
- So most languages provide libraries to do the work for you
- In Python, that library is called
urllib
urllib.urlopen(URL) does what your browser would do if you gave it the URL- Parse it to figure out what server to connect to
- Connect to that server
- Send an HTTP request
- Returns an object that looks like a file, from which to read response data
- Send comments
urllib Example
- Read a page the easy way
- Note:
readlines wouldn't do the right thing if the thing being read was an image- Might try to convert “line endings”
- Use
read to grab the bytes in that case
- Send comments
Building A Spider
- A web spider is a program that can explore the web on its own
- Fetch a page, extract all the external links, visit those pages…
- That, a search engine, and a few billion dollars, and you're Google
import sys, urllib, re
url = sys.argv[1]
instream = urllib.urlopen(url)
page = instream.read()
instream.close()
links = re.findall(r'href=\"[^\"]+\"', page)
temp = set()
for x in links:
x = x[6:-1] # strip off 'href="' and '"'
if x.startswith('http://'):
temp.add(x)
links = list(temp)
links.sort()
for x in links:
print x
$ python spider.py http://www.google.ca
http://groups.google.ca/grphp?hl=en&tab=wg&ie=UTF-8
http://news.google.ca/nwshp?hl=en&tab=wn&ie=UTF-8
http://scholar.google.com/schhp?hl=en&tab=ws&ie=UTF-8
http://www.google.ca/fr
- Send comments
Passing Parameters
- Sometimes want to provide extra information as part of a URL
- Example: when searching on Google, have to specify what the search terms are
- Could do this as part of the URL
- Amazon puts ISBNs in URLs
- More flexible to add parameters to the URL
http://www.google.ca?q=Python searches for pages related to Python"?" separates the parameters from the rest of the URL- If there are multiple parameters, they are separated from each other by
"&"- E.g.,
http://www.google.ca/search?q=Python&client=firefox
- Send comments
Special Characters
- What if you want to include
"?" or "&" in a parameter?- Same problem (and solution) as including a quote in a string, or <> in XML
- URL encode special characters using
"%" followed by a 2-digit hexadecimal code- And replace spaces with
"+" | Character | Encoding |
|---|
"#" | %23 |
"$" | %24 |
"%" | %25 |
"&" | %26 |
"+" | %2B |
"," | %2C |
"/" | %2F |
":" | %3A |
";" | %3B |
"=" | %3D |
"?" | %3F |
"@" | %40 |
|
Table 23.2: URL Encoding |
|---|
- Send comments
Encoding Example
- To search Google for “grade = A+”, use
http://www.google.ca/search?q=grade+%3D+A%2B urllib has functions to make this easyurllib.quote(str) replaces special characters in str with escape sequencesurllib.unquote(str) replaces escape sequences with charactersurllib.urlencode(params) takes a dictionary and constructs the entire query parameter string
import urllib
print urllib.urlencode({'surname' : 'Von Neumann', 'forename' : 'John'})
surname=Von+Neumann&forename=John
- Send comments
Screen Scraping (And Why Not)
- Suppose you want to write a script that actually does search Google
- Construct a URL: easy
- Send it and read the response: no problem
- Parse the response: there's a lot of junk on the page…
- Many first-generation web applications relied on screen scraping
- “Parse” the HTML with regular expressions
- Hard to get right if the page layout is complex
- And whenever the layout changes, the application breaks
- Send comments
Web Services
- Modern web services separate data from presentation
- When a client sends a request, it indicates that it wants machine-readable XML, rather than human-readable HTML
- Much easier to parse
- Much less likely to change over time
![[Web Services]](./img/client/web_services.png)
Figure 23.5: Web Services
- Many web services use the Simple Object Access Protocol (SOAP) standard
- Despite its name, it's anything but simple
- Luckily, there are libraries to hide the details for most widely-used web services
- Send comments
Example: Amazon
- Amazon has defined an API for web services
- You need to get a license key in order to use it
- They're free
- But they allow Amazon to throttle requests to one per second per client
PyAmazon turns parameters into URL, and converts the XML reply into Python objects
import sys, amazon
# Format multiple authors' names nicely.
def prettyName(arg):
if type(arg) in (list, tuple):
arg = ', '.join(arg[:-1]) + ' and ' + arg[-1]
return arg
if __name__ == '__main__':
# Get information.
key, asin = sys.argv[1], sys.argv[2]
amazon.setLicense(key)
items = amazon.searchByASIN(asin)
# Handle errors.
if not items:
print 'Nothing found for', asin
if len(items) > 1:
print len(items), 'items found for', asin
# Display information.
item = items[0]
productName = item.ProductName
ourPrice = item.OurPrice
authors = prettyName(item.Authors.Author)
print '%s: %s (%s)' % (authors, productName, ourPrice)
$ python findbook.py 123ABCDEFGHIJKL4MN56 0974514071
Greg Wilson: Data Crunching : Solve Everyday Problems Using Java, Python, and more. ($18.87)
- Note: much more code devoted to creating human-readable output than to getting the information
- Send comments
Summary
- Most computers now spend more time communicating than they do calculating
- Every few years, we put another layer on top of the pile of protocols to make communication easier
- TCP to HTTP to web services to…?
- Getting information from the web is now (almost) as easy as getting it from a file
- See in the next lecture how to provide information to others
- Send comments
Web Server Programming
Introduction
- Most of the web's power comes from the fact that browsers can interact with programs
- More accurately, browsers can ask web servers to run programs on their behalf
- This lecture looks at what to do if you receive an HTTP request
- Very important that you go through the lecture on Security before putting your programs on the web
- Send comments
You Can Skip This Lecture If...
- You know what CGI stands for
- You know how web servers communicate with CGI programs
- You know what MIME types are
- You know how to create an HTML form
- You know how to get HTML form data from an HTTP request
- You know when and how to create cookies
- Send comments
The Pluggable Web
- Users want to make the web do different things
- How to let them write programs that handle HTTP requests?
- Option #1: Require them to write socket-level code
- Complicated and error-prone
- Can only have one program listening to a socket at a time
- Option #2: have the web server accept the HTTP request, and then run the user's code
- Recompiling the web server every time someone wants to add functionality would be a pain
- So define a protocol that lets web servers run other programs
- Send comments
The CGI Protocol
- The Common Gateway Interface (CGI) protocol specifies:
- How a web server passes information to a program
- How that program passes information back to the web server
- CGI does not specify:
- A particular language
- You can use Fortran, the shell, C, Java, Perl, Python…
- How the web server figures out what program to run
- Each web server has its own rules
- We'll (briefly) talk about Apache's
- Send comments
From Server To CGI
- Web server runs the CGI by creating a new process
![[CGI Data Processing Cycle]](./img/server/cgi_round_trip.png)
Figure 24.1: CGI Data Processing Cycle
- Web server passes some information to the CGI process through environment variables
| Name | Purpose | Example |
|---|
REQUEST_METHOD | What kind of HTTP request is being handled | GET or POST |
SCRIPT_NAME | The path to the script that's executing | /cgi-bin/post_photo.py |
QUERY_STRING | The query parameters following "?" in the URL | name=mydog.jpg&expires=never |
CONTENT_TYPE | The type of any extra data being sent with the request | img/jpeg |
CONTENT_LENGTH | How much extra data is being sent with the request (in bytes) | 17290 |
|
Table 24.1: Important CGI Environment Variables |
|---|
- The web server may also send
CONTENT_LENGTH bytes to the CGI on standard input- E.g., when a file is being uploaded
- Send comments
From CGI To Server
- The CGI program sends data back to the web server by printing it to standard output
- The web server then forwards this directly to the client
- Which means that the CGI program is responsible for creating headers
- Note: none of this works unless the web server has been configured to run the CGI
- By default, modern servers won't do this unless they're told they can
- Send comments
MIME Types
- Clients and servers need a way to specify data types to each other
- Remember, bytes are just bytes: the browser doesn't magically know how to interpret them
- Multipurpose Internet Mail Extensions standard specifies how to do this
- Organizes data types into families, and provides a two-part name for each type
- Use the
"Content-Type" header to specify the MIME type of the data being sent
| Family | Specific Type | Describes |
|---|
| Text | text/html | Web pages |
| Image | image/jpeg | JPEG-format image |
| Audio | audio/x-mp3 | MP3 audio file |
| Video | video/quicktime | Apple Quicktime video format |
| Application-specific data | application/pdf | Adobe PDF document |
|
Table 24.2: Example Mime Types |
|---|
- Send comments
Hello, CGI
Invoking a CGI
- Invoke it by going to
http://www.yourserver.com/cgi-bin/hello_cgi.py- By convention, CGI programs are put in a
cgi-bin directory
- Browser displays the simple HTML page generated by the program
![[Basic CGI Output]](./img/server/hello_cgi.png)
Figure 24.2: Basic CGI Output
- Send comments
Generating Dynamic Content
- But the whole point of CGI is to generate content dynamically
- E.g., show a list of environment variables and their values
#!/usr/bin/env python
import os, cgi
# Headers and an extra blank line
print 'Content-type: text/html'
print
# Body
print '<html><body>'
keys = os.environ.keys()
keys.sort()
for k in keys:
print '<p>%s: %s</p>' % (cgi.escape(k), cgi.escape(os.environ[k]))
print '</body></html>'
- You'll use this frequently when debugging…
![[Environment Variable Output]](./img/server/show_env.png)
Figure 24.3: Environment Variable Output
- Send comments
Forms
- Next step is to allow users to enter data
- Without manually editing URLs to append parameters
- HTML forms allow users to enter text, choose items from lists, etc.
- Not nearly as sophisticated as desktop interfaces
- Although programmers are doing more every day (particularly using Javascript)
- Send comments
Creating Forms
- Create a form using a
<form>…</form> elementaction attribute specifies the URL to send data tomethod attribute specifies the type of HTTP request to send- Usually
"POST" for HTML forms
- Inside the form, can have:
<select/> elements to let users choose values from a list- List items specified using
<option/> elements
<input/> elements for other kind of data- If
type is "text", get a one-line text entry box - If
type is "checkbox", get an on/off checkbox "submit" and "reset" create buttons to submit the form, or re-set the data to initial values
- Send comments
A Simple Form
Parameter Names
- Each
<input/> element has a name attribute- These become the names of the parameters that the client sends to the server
- The input elements' values are the parameters' values
- Submitting the form shown above with default values produces:
os.environ['REQUEST_METHOD']: "POST"os.environ['SCRIPT_NAME']: "/cgi-bin/simple_form.py"os.environ['CONTENT_TYPE']: "application/x-www-form-urlencoded"os.environ['REQUEST_LENGTH']: "80"- Standard input:
sequence=GATTACA&search_type=Similarity+match&program=FROG-11&program=Bayes-Hart
- Send comments
Handling Forms
- Could handle form data directly
- Read and parse environment variables
- Read extra data from standard input
- But the mechanics are the same each time, so use Python's
cgi module instead- Defines a dictionary-like object called
FieldStorage- Keys are parameter names
- Values are either strings (if there's a single value assocatied with the parameter) or lists (if there are many)
- When a
FieldStorage object is created, it reads and stores information contained in the URL and environment- Which means that a CGI program should only ever create one
- Program can read extra data from
sys.stdin - Send comments
Form Handling Example
- Example: show the parameters send to a script
#!/usr/bin/env python
import cgi
print 'Content-type: text/html'
print
print '<html><body>'
form = cgi.FieldStorage()
for key in form.keys():
value = form.getvalue(key)
if isinstance(value, list):
value = '[' + ', '.join(value) + ']'
print '<p>%s: %s</p>' % (cgi.escape(key), cgi.escape(value))
print '</body></html>'
| URL | Value of a | Value of b |
|---|
http://www.third-bit.com/swc/show_params.py?a=0 | "0" | None |
http://www.third-bit.com/swc/show_params.py?a=0&b=hello | "0" | "hello" |
http://www.third-bit.com/swc/show_params.py?a=0&b=hello&a=22 | [0, 22] | "hello" |
|
Table 24.3: Example Parameter Values |
|---|
- Send comments
Development Tips
- During development, add
import cgitb; cgitb.enable() to the top of the programcgitb is the CGI traceback module- When enabled, it will create a web page showing a stack trace when something goes wrong in your script
- Testing whether a
FieldStorage value is a string or a list is tedious- In almost all cases, you'll know whether to expect one value or many
- Use
FieldStorage.getfirst(name) to get the unique value- Returns the first, if there are many
FieldStorage.getlist(name) always returns a list of values- Empty list if there's no data associated with
name - If there's only one value, get a single-item list
- Send comments
Maintaining State
- Often want to change the data a server is managing, as well as read it
- Update a description of an experiment, change your preferred email address, etc.
- The industrial-strength solution is to use a three-tier architecture
![[Three Tier Architecture]](./img/server/three_tier_architecture.png)
Figure 24.5: Three Tier Architecture
- CGI program stuffs parameters from HTTP requests into SQL queries
- Runs the queries
- Translates results into HTML to send back to the client
- Send comments
Maintaining State in Files
- Simple programs can often get away with using files
- The CGI program re-reads the file each time it processes a request
- And re-writes it if there have been any updates
- Example: append messages to a web page
- Script checks the incoming parameters to decide what to do
- If
newmessage is there, append it, and display results - If
newmessage isn't there, someone's visiting the page, rather than submitting the form # Get existing messages.
infile = open('messages.txt', 'r')
lines = [x.rstrip() for x in infile.readlines()]
infile.close()
# Add more data?
form = cgi.FieldStorage()
if form.has_key('newmessage'):
lines.append(form.getfirst('newmessage'))
outfile = open('messages.txt', 'w')
for line in lines:
print >> outfile, line
outfile.close()
- Send comments
HTML Generation
HTML Templating
- A lot of this program is devoted to copying values into an HTML template
- There are lots of good systems out there, in many languages, for doing this
Kid in Python- Java Server Pages (JSPs) in Java
- Please do not write one of your own
- Send comments
What About Concurrency?
- What happens if two users try to save messages at the same time?
- I/O is typically slower than processing
- So most web servers try to overlap operations
- Race condition:
- First instance of
message_form.py opens messages.txt, reads lines, closes file - Second instance opens
messages.txt, reads the same lines, closes file - First instance re-opens file, writes out original data plus one new line
- Second instance re-opens file, writes out original plus a different new line
- First instance's message has been lost!
- Send comments
File Locking
- Solution is to lock the file
- As the name implies, gives one process exclusive rights to the file
- After the first process acquires the lock, any other process that tries to read or write the file is suspended until the first releases it
- Mechanics are different on different operating systems
- But the
Python Cookbook includes a generic file locking function that works on both Unix and Windows
- Send comments
Implementing Locking
# Get existing messages.
msgfile = open('messages.txt', 'r+')
fcntl.flock(msgfile.fileno(), fcntl.LOCK_EX)
lines = [x.rstrip() for x in msgfile.readlines()]
# Add more data?
form = cgi.FieldStorage()
if form.has_key('newmessage'):
lines.append(form.getfirst('newmessage'))
msgfile.seek(0)
for line in lines:
print >> msgfile, line
# Unlock and close.
fcntl.flock(msgfile.fileno(), fcntl.LOCK_UN)
msgfile.close()- Send comments
Who Are You?
- How to maintain state on the client?
- Need to know which shopping cart to display for a particular user
- HTTP is a stateless protocol
- If a client makes a second (or third, or fourth…) request, server has no reliable way of connecting it to the first one
- Can guess based on client address, elapsed time, etc.
- Send comments
Cookies
- Solution is for the server to create a cookie
- A string that is sent to the client in an HTTP response header
- Client saves it (either in memory or on disk)
![[Cookies]](./img/server/cookies.png)
Figure 24.6: Cookies
- The next time the client sends a request to the site, it sends the cookie back to the server
- Like giving someone a claim check for their luggage
- Send comments
Creating Cookies
- Represent cookies in Python using
Cookie.SimpleCookie- Do not use
SmartCookie: it is potentially insecure
- When creating, add values to a cookie as if it were a dictionary
- Convert it to a string (e.g., by printing it) to create the required HTTP header
- When the cookie comes back:
- Get the value associated of the environment variable
"HTTP_COOKIE" - Create a
SimpleCookie - Pass the
"HTTP_COOKIE" value to the cookie's load method
- Send comments
Cookie Example
Cookie Tips
- Can control how long a cookie is valid by setting an expiry value
- Either the number of milliseconds
- Or the time it should expire (in UTC)
- Use
time.asctime(time.gmtime()) to create the value
- Do not put sensitive information in cookies
- Browsers store them in files on disk
- Villains can watch network traffic, and steal data
- Cookies should instead be random values that act as keys into server-side information
- Send comments
Summmary
- CGI is example of event-driven programming
- The framework invokes your code at specific times, and passes it specific information
- What happens the rest of the time isn't your concern
- At least, until something goes wrong, and you have to debug it
- Simple CGI programs can accomplish a lot
- The entire first generation of web applications were built this way
- But they can easily become very complicated
- Send comments
Exercises
Exercise 24.1:
One way to test a CGI application is to send it HTTP
requests, and examine the responses. Write a program that takes
a hostname, port, and partial URL as command-line parameters,
and sends the URL to the server identified by the hostname and
port. The program should display the status code, reason,
headers, and response page (if any) that are returned by the web
server.
For example, if your program is run as httptest
localhost 80 /greeting.html, it should send a request for
/greeting.html to a web server running on port 80 on the
local machine, and display something like:
STATUS: 200
REASON: OK
HEADERS:
content-length [49]
server ['Apache/2.0.54 (Debian GNU/Linux) DAV/2 SVN/1.1.4 mod_python/3.1.3 Python/2.3.5]
last-modified ['Wed, 19 Apr 2006 13:59:19 GMT']
date ['Sun, 30 Apr 2006 14:12:13 GMT']
content-type ['text/html']
PAGE:
<html>
<body>
<h1>Hello, CGI!</h1>
</body>
</html>
What are the pros and cons of testing a CGI application this
way?
Exercise 24.2:
Another way to test a CGI application is to construct a mock
container to take the place of the web server. As described in
the lecture, CGI applications read data from environment
variables and standard input; by using the subprocess
module described in the Integration, you can run the CGI yourself, passing it
whatever test data you want. Write a program that does this.
(For bonus marks, explain how you would test the mock
container…)
Exercise 24.3:
The third way to test a CGI application is to construct a
mock container that calls the CGI directly, rather than creating
a new process and passing it data through environment variables
and standard input. In order for this to work, the CGI program
must import specially-crafted versions of the sys and
os libraries that provide the CGI with data from the
testing program, rather than reading it from the real
sources:
if testing:
import test_sys as sys
import test_os as os
else:
import sys, os
What other changes must be made to the CGI application to
allow it to be tested this way? What are the pros and cons of
making such changes?
Send comments
Security
Evil Exists
- Computer security is a collective responsibility
- A system is only as strong as its weakest component
- If you are creating CGI scripts, or sending data over the web, you are putting others at risk as well as yourself
- Impossible to cover anything more than the basics in this lecture
- Send comments
You Can Skip This Lecture If...
- You understand the tradeoff between convenience and security
- You know that computer security is not primarily a technological problem
- You know what authentication, authorization, and access control are
- You never trust user input
- You know what public-key cryptography, HTTPS, and SSH are
- Send comments
What Are We Trying to Do?
- Goal: let everyone who should be able to do something do it easily…
- …while blocking people who shouldn't be able to…
- …and gathering information about their attempts
- Most people are trustworthy most of the time
- Preventing legitimate users from doing things annoys them
- If people are sufficiently annoyed, they'll turn security off, or find ways around it
- But we must account for the villainous minority
- Any system that relies on trust will attract abuse
- Keeping track of how villains are trying to break in is (almost) as important as preventing them
- You can't fix holes unless you know they exist
- Often need an audit trail in order to take legal or disciplinary action
- Send comments
Technology Alone Is Not A Solution
- Many successful attacks rely on social engineering
- Call up your bank, and see if you can get your credit card balance without your PIN
- Helps if you sound like a grandmother who is close to tears because her poodle has just been hit by a car
- Second way to attack a system is to get a job with the company running it
- Many companies choose not to press charges, rather than deal with bad publicity after a security failure
- So burn an extra copy of credit card data while backing up the server…
- …or take notes of all the “to be fixed later” points that come up during the security audit of the web site
- Send comments
More Ways Security Can Fail
- And then there's carelessness
- Many people don't bother to change the default password on their wireless router
- Many more choose easily-guessed passwords
- Where “easy” means “can be found by a clever program running for a couple of hours”
- Remember: once one villain builds a tool, they can all use it
- In fact, technology can make systems less secure
- Imagine a facial recognition system that works correctly 99% of the time
- So one person in a hundred is mistakenly identified as a potential terrorist
- 300,000 passengers a day in a busy airport means one false alarm every 30 seconds
- Do you think the guards will still be paying attention to the alarms on Tuesday?
- Send comments
How to Think About Security
- Security systems are responsible for:
- When analyzing security, look for ways to compromise the three A's
- Convince the system you are:
- Some other regular user (if you're trying to buy stuff with someone else's credit card)
- An administrator (or someone else with special privileges)
- Convince it that you're allowed to do something you're not
- E.g., give yourself administrative privileges
- Circumvent its enforcement of the rules
- E.g., take advantage of a browser bug that lets Javascript in a page make copies of your cookies
- Send comments
Risk Assessment
- First step is always risk assessment
- What could an attacker do?
- How much would it cost?
- Example: WebDTR is a password-protected web interface to a database of drug trial results
| Risk | Importance | Discussion |
|---|
| Denial of service | Minor | Researchers can wait until the system comes back up |
| Data in database destroyed | Minor | Restore from backup |
| Unauthorized data access | Major | If competitors access data, competitive advantage may be lost |
| Backups corrupted, so that data is permanently lost | Major | Redoing trials may cost millions of dollars |
| Data corrupted, and corruption not immediately detected | Critical | Researchers may make recommendations or diagnoses that lead to injury or death |
|
Table 25.1: Risk Assessment |
|---|
- Send comments
Thinking Like A Villain
- Good judgment comes from experience
- But experience is just the name we give to our mistakes when talking to our grandchildren
- The books listed in the introduction describe attacks that have worked in the past
- Use these to guide your analysis of your system
- Send comments
Example: Don't Trust Your Input
- Anyone who knows the URL of a web application can send it data
- And can study its HTTP requests and responses
- There is therefore no guarantee that the HTTP request you receive was generated from your form
- The input provided for a selection list may not be one of the values you offered
- The input for a text field may be longer than the maximum you specified
- Some parameters may be missing from
QUERY_STRING, while unexpected ones may be present QUERY_STRING may not even be formatted according to the HTTP specification
- Send comments
Attacking URLs
- Attacker looks at WebDTR URLs
- Before logging in:
http://www.webdtr.com - After logging in:
http://www.webdtr.com/display.py?user=cdarwin
- Looks for a cookie from
webdtr.com : none present - Conclusion: user ID is being stored in the URL
- Try surfing to
http://www.webdtr.com/display.py?user=bmcclintock - Yup, we're in…
- Send comments
Leaking Information
- Now try
http://www.webdtr.com/display.py?user=nobody?- Result is an error page saying “no such user”
- Which means we have a way to see who's authorized to use the system
- I.e., whose password it might be worth cracking
- What about the URL
http://www.webdtr.com/display.py?user=?- Result is a page containing a stack track
- Developer left
cgitb (or its equivalent) enabled in the production system
- Doesn't help normal users: stack trace doesn't tell them what they did wrong
- But it does help attackers by telling them what functions are being called, what libraries are in use, etc.
- Every piece of information that leaks out of the application helps attackers find vulnerabilities
- Send comments
SQL Injection
- New version of WebDTR uses secure connections and encrypted cookies to close the holes identified above
- The URL used to look up a result is
http://www.webdtr.com/display.py?testid=178923 - Set
testid to "1);UPDATE Results SET result=FALSE WHERE (id=*"- Whole query is then
"SELECT date,result FROM Results WHERE (id=1);UPDATE Results SET result=FALSE WHERE (id=*)" - Oops
- Mistake #1: CGI program has a capability (updating the database) it doesn't actually need
- Mistake #2: application failed to validate its input
- Should have checked that
testid's value was an integer, and in range
- Send comments
Attacking Defaults and Denial of Service
- Another attack is to see if default accounts or passwords are still enabled
- Try logging in with
"admin" and "admin", or "guest" and "guest", etc.- Better yet, write a small script to try this
- Helps (the attacker) if the results distinguish between “no such user” and “invalid password”
- Can use a script like this to run a denial of service (DoS) attack
- Flood the server with login requests, so that legitimate users can't get access
- Or their connections time out even if they do
- Send comments
Phishing
- Phishing is increasingly common
- Trick users into giving away sensitive information
- Email someone you believe is a user of the system
- “System crashed last night, click here to reset your password”
- The link actually sends them to
http://www.webbdtr.com- Did you notice the difference in the host name?
- Phony site shows them the same login page as the real one
- Records their password, then redirects them to the real system
- Send comments
Attacking Data Entry
- How is the database updated?
- Files mailed in by clinicians are formatted and concatenated by a Python script
- Results temporarily stored in
/tmp/webdtr/0001.tmp, /tmp/webdtr/0002.tmp, etc. - Administrator periodically runs another Python script to load this data into the database
- Backups are run twice a week
- Attack #1: mail in a file full of fake data
- Administrator “authenticates” messages just by looking at sender address
- Which is very easy to fake
- Attack #2: modify or replace one or the other Python script
- Attack #3: create a file
/tmp/webdtr/9999.tmp- Does the script that loads the database check that sequence numbers are consecutive?
- Does it check who created or owns the file?
- Send comments
Timed Attacks
- See a message on the WebDTR mailing list saying that the program now checks for attack #3 above
def read_file(filename, required_uid):
'''Read submission data from a file, checking that the file
is owned by the specified user.'''
owner = os.stat(filename)[ST_UID]
if owner != required_uid:
raise SecurityException('%s has incorrect owner' % filename)
stream = open(filename, 'r')
data = stream.read()
stream.close()
return data
- There's a tiny window of opportunity between when the program checks ownership, and when it opens the file
- Write a script that loops over files, deleting them and creating new ones in their place
- Low chance of success on any one try…
- …but computers are very patient
- Send comments
Securing HTTP
- HTTP sends data as cleartext
- As is too often the case, security was ignored in HTTP's original design
- Netscape later developed HTTPS (Secure HTTP) to protect confidential information
- Uses a different port (443 instead of 80) and protocol (
https in URL instead of http) - Encrypts data between the browser and the web server
- Does not guarantee secure storage on the server
- Far too many web sites store sensitive information in databases as cleartext
- Gives villains another point of attack
- Send comments
Cryptography 101
- Encryption is the process of obscuring information so that it can't be read without special knowledge
- An algorithm for encrypting and decrypting is called a cipher
- Original and encrypted messages are called plaintext and ciphertext respectively
- All classical (pre-1970s) ciphers are symmetric
- Same key is used for both encryption and decryption
- Which means that the key can only be shared among trusted parties
- Send comments
Public-Key Cryptography
- Asymmetric ciphers have two keys
- Each undoes the other's effects
- (Practically) impossible to determine one given the other
- Asymmetric systems are often called public key cryptography systems
- Note: symmetric encryption is typically many times faster than asymmetric encryption
- Usual scheme these days is to use asymmetric encryption (slow) to exchange a one-time symmetric key
- Then use the symmetric key (fast) for the rest of the conversation
- Send comments
Sending and Receiving
- Anyone who wants to send a message to you encrypts it using the public key
- You're the only one who can decrypt it
- Look up their public key in order to encode your reply
![[Secure Communication with Asymmetric Keys]](./img/security/public_keys.png)
Figure 25.1: Secure Communication with Asymmetric Keys
- Send comments
Digital Signatures
- Key pairs can also be used to sign messages
- Encrypt message using your private key, and append the result to the original message
- Recipients use your public key to decrypt the signature
- If it matches the message, you must have been the sender
- Also guarantees that the clear text was not changed
- Which is better then for regular signatures.
- In practice, encrypt a digest of the original message
- Practically impossible for someone to construct a message that has a given digest
![[Signing a Message]](./img/security/digital_signatures.png)
Figure 25.2: Signing a Message
- Send comments
Securing Login
- Another flaw in HTTP is its built-in password handling (called basic authentication)
- Sends the user name and password as cleartext
- Solution is simple: never use HTTP basic authentication
- And never have users submit ID and password via a form, since form data isn't encrypted
- Alternative:
- Have user provide ID and password over secure connection
- Use a random number as a cookie
- Do not just use a sequence of integer session IDs: too easy for attackers to fabricate
- Give that to the client to track the session
- When it comes back, use it as a key into a dictionary of active sessions
- Send comments
Red Queen Race
- If villains can snoop on network traffic, they can hijack sessions
- Insert a copy of your cookie into their message
- Also vulnerable to replay attacks
- Copy the cookie (or an entire message) and re-send it later
- Useful if the message means “open the vault door”
- Note: none of this helps if there is spyware on the client machine
- These days, this is much more likely than someone sniffing network traffic
- Keep your anti-virus protection and spyware monitors up to date, and run them regularly
- What else is there for your machine to do at 3:00 a.m.?
- Send comments
It Isn't Just The Web
- (In)security isn't just a feature of web-based applications
- How do you know the software you've installed on your machine is reliable?
- How would you find out if it had been tampered with during production?
- C and C++ have vulnerabilities that other languages don't
- Best-known are buffer overflow attacks:
- Attacker sends more data than the program has allocated memory to receive
- “Extra” bytes overwrite the program itself
- If those bytes' values correspond to machine instructions, the attacker can change the program's behavior
- Send comments
Summary
- Remember that technology doesn't solve security problems: it just moves them around
- Never rely on keeping your techniques secret to ensure security
- Never design your own ciphers
- Use 3DES or AES for symmetric encryption
- And RSA, DSA, or EC-DSA for public-key
- Most important: security has to be designed in from the start
- Send comments
The Development Process
Introduction
- There's more to building a house than nailing boards together
- Have to make sure the pipes are put in before the drywall goes up
- Satisfy building code regulations
- Make sure everyone on the team is productive (not just busy)
- This lecture covers the equivalent topics for small-team software development
- 12×12: up to a dozen people, working for up to a year
- All of these ideas apply to people working on their own for two weeks or more
- Send comments
You Can Skip This Lecture If...
- You don't care if anyone else can ever use your software
- You enjoy being frustrated and unproductive
- You're sure that people you've never met will be able to use and modify your software two years from now
- Send comments
Design vs. Agility
- Two camps currently dominate the debate about software development
- Big Design Up Front (BDUF): measure twice, cut once
- Think through users' needs, design, and possible problems before starting to code
- Agile: lots of small steps, with continuous testing and refactoring
- “No battle plan ever survives contact with the enemy.” (Helmuth von Moltke)
- Both are responses to Boehm's Curve
![[Boehm's Curve]](./img/dev01/boehm_curve.png)
Figure 26.1: Boehm's Curve
- BDUF: prevent problems from happening at all
- The cheapest bug to fix is one that doesn't exist
- Agile: catch problems while you're still at the low-cost end of the curve
- Differences in practice are much less than the differences in rhetoric
- Send comments
Project Lifecycle
- Very few individuals or teams stick to textbook rules
- Teams always adapt processes to local needs and personalities
- Remember: reality matters more than rulebooks
- No matter what the official process is, most well-run medium-sized projects follow a similar path
![[Project Lifecycle]](./img/dev01/project_lifecycle.png)
Figure 26.2: Project Lifecycle
- Send comments
Step 0: Vision
- A vision statement is a one- or two-sentence summary of the project
- Also called an elevator pitch
- Helps keep everyone pointed in the same direction
- A good way for project members to introduce themselves at conferences and trade shows
- Exercise: have everyone on the team replace the bits in italics with words of their own
- Do this independently, then compare answers
| Part | Boilerplate | Example |
|---|
| Problem statement | The problem of | only being able to simulate invasion percolation on regular 2D grids |
| Target market | affects | scientists who work with composite materials, |
| Impact | who currently | have to extrapolate from regular models. |
| Solution | Our solution, | a set of enhancements to InvPerc, |
| Key technical feature | | handles any structure that can be represented as non-overlapping regions. |
| Competition | Unlike | PI2D and other simulators, |
| Differentiator | it | can read standard CAD files as well as IP2-format grid files. |
|
Table 26.1: Vision Statement Template |
|---|
- Just as important for solo projects!
- Send comments
Step 1: Gathering Requirements
- Single biggest cause of project failure is failing to get the requirements right
- Boehm's Curve again: building the wrong thing is the most expensive mistake you can make
- Start by asking what problem the software is supposed to solve
- What do you want to be able to do that you can't right now?
- What does the existing software do that you don't want it to?
- What does it make you do that you don't want to?
- Organize requirements as point-form list
- Give each one a unique name
- And keep the list under version control
- Send comments
What Requirements Are and Aren't
- Good requirements are complete and unambiguous
- “The system will reformat data files as they are submitted“ is neither
- Instead:
- Only users who have logged in by providing a valid user name and password can upload files
- The system must allow users to upload files via a secure web form
- The system must accept files up to 16MB in size
- The system must accept files in PDB and RJCS-1 format
- The system must convert files to RJCS-2 format before storing them
- The system must present users with an error message page if an uploaded file cannot be parsed
- Etc.
- A contract amongst the various stakeholders
- Overly formal for two-person research prototypes
- But essential for distributed teams
- Send comments
Step 2: From Requirements to Features
- Figure out what features you need
- What do you have to build in order to accomplish XYZ?
- How will you tell that it's working?
- Yet another point-form list…
- Relationship between requirements and features can be very complex
- One feature can (help) satisfy many requirements
- One requirement may require many features
- Traceability once again:
- Why does each feature exist?
- How is each requirement being satisfied?
- Who said so? When?
- Send comments
Waterfalls And Why Not
- Pause for a moment…
- This looks like the start of the waterfall model [Royce 1970]
- Describes development as flowing through several distinct phases
- Requirements analysis to design to implementation to testing to maintenance
![[The Waterfall Model]](./img/dev01/waterfall_model.png)
Figure 26.3: The Waterfall Model
- But:
- If different people are responsible for different phases, then no one has to deal with the consequences of their mistakes
- Whoever is responsible for testing has to make up all the lost time from the previous phases
- Time lag: it can take a long time for changes in requirements to filter through to the finished product
- No one actually ever works this way in real life anyway
- Send comments
The Spiral Model
- The spiral model [Boehm 1988] wraps this around itself
![[The Spiral Model]](./img/dev01/spiral_model.png)
Figure 26.4: The Spiral Model
- Go through the waterfall cycle over and over again, each time on a larger scale
- Royce actually advocated doing this too, but most people have forgotten that
- Key ideas:
- The code teaches you about the problem
- Customers can only find out what they actually want by playing with a working system
- But Boehm still envisaged:
- Cycles lasting from six months to two years
- And division of labor
- Send comments
Enter the Extremists
- Extreme Programming (XP) arose in the 1990s to cope with:
- Ever-changing requirements
- Internet time
- Six-month iterations were longer than the lifespan of the average dot-com
- Web-based delivery: it's possible to “ship” a new version whenever you want
- Basic ideas:
- Send comments
Pitfalls
- Requires a lot of self-discipline to stop it degenerating into pure hackery
- Funding agencies are understandably reluctant to fund a project whose deliverables will be made up along the way
- On the other hand, this is a good description of research…
- Not as well suited to large projects or teams
- But this is changing as web-based collaboration tools improve
- Send comments
Step 3: Analysis & Estimation
- Next step is analysis & estimation (A&E)
- How can each feature be implemented?
- And how long will it take?
- Where possible, investigate two or more options
- Plan A: only solve three quarters of the problem, but can be implemented in a week
- Plan B: does everything and more, but will take three months
- Write throw-away code to become familiar with new libraries and tools
- Keep it under version control
- But do not let it find its way into the application
- Send comments
Where Estimates Come From
- Time estimates come from experience
- You should be able to guess how much code you'll have to write to implement something
- If you can't, you should think it through in more detail, or write some more throwaway code
- Keep track of how long it takes your team to build things
- Remember to include time for testing, debugging, and documenting
- Based on average developers and average days, not the best and their best
- Your first estimates will be far too optimistic
- But the more you estimate, the better your estimates will become
- Send comments
What Goes Into An A&E
- Title page, with feature name and document revision history
- Abstract: two or three sentences that will tell someone browsing dozens of A&Es whether this is one they should read
- Background: summarize the problem for a reasonably knowledgeable developer
- Imagine writing for who you were when you started the project
- Don't bother explaining what the Internet is…
- For each alternative:
- What is it, and how will it work?
- How long will it take to create? To test? To document? To add to the build and installer?
- What impact will it have on other features?
- How certain are you of your estimates?
- References to other A&Es, URLs, pointers to prototypes in the repository, etc.
- In practice, every A&E is different, because every problem is different
- Requiring team members to express their ideas in a fixed way just leads to fewer ideas
- Send comments
Reviews
- The whole point of writing the A&E is so that other team members can look for holes in your ideas
- Just like reviewing scientific papers
- Most important thing to look for is people glossing over hard bits
- “We parse the XML configuration file, and then a miracle happens” is not a good A&E
- Assign specific A&Es to specific people for review
- Ensures that reviews actually get done…
- …and that knowledge and decisions are communicated
- It's not just for coders
- Make sure the people who have to test, document, deploy, maintain, and support all get their say
- Expect to revise the A&E two or three times
- Stop when everyone is confident the feature can be built in the time estimated
- End result is “just enough” design
- Send comments
What Can Go Wrong with A&Es
- Too much formality
- Make the format fit the particular design problem, instead of trying to squeeze everything into a standard template
- You are (probably) not a lawyer: don't try to write like one
- Don't bother to include detailed change history in the document
- That's what version control is for
- Not enough formality
- Everything you leave out because “everyone knows it” is something you might trip over later
- Analysis paralysis
- You can second-guess yourself forever
- Once you believe you know how to implement everything, stop writing and start coding
- Send comments
Step 4: Prioritization
- Now it's time to prioritize
- Which features are most cost-effective to develop?
- There's never time to do them all
- Usual way to do this is to build a 3×3 grid
- Rank each feature low-medium-high on importance and effort
- More honest than the false accuracy of a 1-10 scale
![[Ranking Features]](./img/dev01/ranking_features.png)
Figure 26.5: Ranking Features
- Send comments
Step 5: Scheduling
- Can now draw up a schedule
- Throw out everything below the diagonal of the priority matrix
- Only big choice remaining is whether to do big items first, or little ones
- Remember to take dependencies into account
- End result is a list of who's doing what, when
- Schedule people at 80% of capacity to allow for sick time, interruptions, etc.
- Yes, it contains a lot of guesswork…
- …but it's better than nothing…
- …and estimates improve with practice
- Send comments
Science Fiction Scheduling
- Do not shave time estimates to make them fit
- If you do this, developers will start padding their estimates…
- …or supplying random numbers, secure in the knowledge that they won't be able make the deadline anyway
- Making up “science fiction schedules” is a very common mistake
- Yes, people will complain if their feature doesn't make it onto the schedule
- But putting it in the schedule when you know that schedule is fiction won't actually make them any happier
- It's better to live up to small promises than break big ones
- Send comments
Step 6: Development
- Now it's time to test and code
- Remember to do them in this order
- Expect to refine design during early stages of construction
- If you're still refining the design a week before you're due to ship, something has gone wrong
- Take time to refactor old code while adding new stuff
- Your skills (and coding style) improve over time
- Or the person working on the feature in Version 3.2 knows something the Version 3.1 author didn't
- The problem changes over time
- A good solution to last year's requirements may not be a good solution to this year's
- Describe day-to-day activities in the Teamware
- Send comments
Tracking Progress
- Make sure the schedule is always up to date
- Every developer writes a few bullet points every week
- Doing this at 9:00 a.m. Monday works better than asking for it at 4:45 on Friday
- Describe tasks in terms of verifiable deliverables
- Things that other people can inspect or test
- Always mark tasks as “done” or “not done”, rather than “X% complete“
- If you allow percentages, then many tasks will be 90% done for 90% of the lifetime of the project
- Instead, break tasks down into subtasks that are at most a few days long, and either are or are not completed
- Send comments
Burn Rate
- The real purpose of a schedule is to tell you when to start cutting corners
- Keep track of how quickly items are actually being finished
- The project's burn rate
![[Burn Rate]](./img/dev01/burn_rate.png)
Figure 26.6: Burn Rate
- If it is 75% of what you predicted, you can:
- Move the completion date back
- Replace some tasks with smaller, lower-priority ones
- Hope that you will miraculously become more productive
- The sooner you do this, the happier you (and your intended users) will be
- Send comments
Step 7: Finishing
- Stop adding new features three-quarters of the way through the project
- No matter how much testing you do as you go along, you'll need time to fix things at the end
- Shift resources into integration testing and documentation
- If you're only starting to build the installer now, you've left it too late
- Installation and upgrade code can be as complex as the application itself
- Do design, and budget time, when writing A&E
- Do not ask for a “big push”
- People can only be productive for 40 hours a week [Robinson 2005]
- Any more than that, and the mistakes they make will actually cost you time overall
- Send comments
After the Party's Over
- Always do a post mortem after the project finishes
- What went right (that you want to do again)?
- What went wrong (that you want to avoid next time)?
- Often helps to bring in an outsider to facilitate
- Feedback is only as useful as it is honest
- Update the A&Es to reflect what was actually built
- Forces team members to examine what they got wrong (so that they can improve)
- Provides a starting point for the next round of development
- Send comments
Summary
- BDUF and XP are diametrically opposed, but both improve productivity
- So either the way most people develop software is the worst possible…
- …or what really matters is having a process—any process—so that you have some rules to play by…
- …and something to improve
- Send comments
Exercises
Exercise 26.1:
Does your manager know when you expect to complete your
current task? How inaccurate the schedule currently is?
Exercise 26.2:
Can you find out when your manager expects you to complete
your current task (without asking her directly)? When team members
expect to complete their current tasks (without asking them directly)?
Who would be affected if you slipped a week?
Send comments
Teamware
Introduction
- No programmer is an island
- We inherit projects from other people
- They depend on tools and libraries we didn't write
- So you ought to think about the person who's going to take over your work before you move on
- As distributed collaboration becomes the norm, good team skills become even more important
- All of which also make you more productive individually
- This lecture introduces core skills by showing a typical developer's day
- Uses an open source web-based project management portal called
DrProject - But the point is how it's used
- Send comments
You Can Skip This Lecture If...
- You know how to use wikis, weblogs, and mailing lists
- You know what an issue tracker is
- You know how to file a ticket
- Send comments
Motivation
- There's a lot more to a project than just software
- Who's working on what?
- When are things due?
- What decisions did we make last week?
- How do I install this thing anyway?
- Software project management portals like
DrProject gather all this information together- Much of the value in information lies in the links between items
- Send comments
Architecture
- A single
DrProject installation manages one or more projects- Each project's files are stored in a separate
Subversion repository - Everything else stored in a single
PostgreSQL database
- Relies on host to handle authentication
- So there's one less password for them to remember
- Each user has a specific role with respect to each project
- These roles define what people can do
- Send comments
Getting Started
- Ginny sits down to start work at 9:00 a.m. on Wednesday morning
- Her group is working on Version 3.2 of GeneMagic
- She spent Monday and Tuesday tracking down a bug for a group at another university who are using an earlier version of the software
- So the first thing she wants to know is what the rest of the team have done recently
- Goes to the group's
DrProject site and looks at the event log![[The Event Log]](./img/dev02/event_log.png)
Figure 27.1: The Event Log
- Chronological listing of
Subversion commits, mail messages, etc. - Filter let her control what kinds of events she sees, and from how long ago
- Send comments
Blogging
- Weblogs (or blogs) started off as on-line journals
- Author updates a file with new journal entries
- File written in an XML format called RSS
- Blog-reading software downloads file at irregular intervals
- If anything has changed since the last time, display the titles of new articles
![[How Blogs Work]](./img/dev02/how_blogs_work.png)
Figure 27.2: How Blogs Work
- A publish-subscribe system
- If you no longer want to read a blog, stop polling for updates
- Every
DrProject project's event log is also available as a blog- Very convenient way to keep up with several projects at once
- Send comments
Repository Browser
- Ginny notices that Ron has committed changes to a couple of files she was working on
- Event log shows her his comments
- But she wants to take a closer look at the changes
- The repository browser is a read-only viewer for the project's
Subversion repository![[Browsing Directories and Files]](./img/dev02/repo_browser_dirs_files.png)
Figure 27.3: Browsing Directories and Files
- Read-only because allowing the browser to commit files would require giving the web server permission to mess with the file system
- If you don't know why this is a bad thing, please re-read the lecture on security
- Send comments
Viewing Revision History
- The browser can also display the revision history of a file
![[Viewing Revision History]](./img/dev02/repo_browser_revision_history.png)
Figure 27.4: Viewing Revision History
- Send comments
Viewing Changesets
- Browser can also show particular changesets
![[Viewing File Changes]](./img/dev02/repo_browser_view_file_diff.png)
Figure 27.5: Viewing File Changes
- Seeing related changes together makes them easier to understand
- Send comments
Mailing Lists
- Ginny is puzzled by Ron's change, and wants to know if there was some discussion about it
- Takes a look at the project's mailing list
![[Mailing List Archive]](./img/dev02/mailing_list.png)
Figure 27.6: Mailing List Archive
- Send comments
Less Is More
DrProject doesn't try to compete with existing mail clients- No mailboxes, and no way to compose or send messages
- Instead, it creates one mailing list for each project
- Everyone who's a member of a project is automatically on that list
- Only project members can send messages to it
- Mailing lists aren't threaded by topic
- Unnecessary complexity for small projects
- Send comments
Managing Mail Addresses
- Every user must whitelist one or more email addresses with
DrProject- Specifies that mail from that address is allowed
![[Whitelisting an Email Address]](./img/dev02/whitelist_email_address.png)
Figure 27.7: Whitelisting an Email Address
- The opposite of blacklisting
DrProject will accept mail from any whitelisted address- User must specify which of those addresses to forward project mail to
- This way, people can send from any external mail account, but all project mail arrives in one place
- Send comments
Issue Tracker
- Now that Ginny is satisfied with Ron's changes, she wants to remind herself what she is supposed to be working on
- As a student, she kept a to-do list in a lab notebook
- And then in the Palm Pilot her grandmother gave her
- But those are easy to lose, and hard to share
- An issue tracker is a tool for managing a shared to-do list
- Each task, problem, or question is represented by a ticket in the database
- Often called a bug tracker
- Essential tool for managing long-lived or multi-person projects
- But only as useful as the information in it
DrProject's issue tracker is simpler than most, in order to make entering data very easy- Most large projects use more complicated tools, such as
Bugzilla and Roundup
- Send comments
Creating and Viewing Tickets
- Click on “New Ticket” to add a ticket to the database
![[Creating a New Ticket]](./img/dev02/issue_tracker_new_ticket.png)
Figure 27.8: Creating a New Ticket
- To see what tickets are already in the system, follow the “View Tickets” link
![[Viewing Tickets]](./img/dev02/view_tickets.png)
Figure 27.9: Viewing Tickets
- Tickets can be sorted and filtered in various ways
- Send comments
When To Create, How To Use
- Create tickets for:
- All action items from meetings are immediately converted to tickets
- Everything that occurs to you while coding that would be a distraction from your flow
- Any missing or erroneous installation instructions
- Any problems you notice while chasing some other bug
- Use tickets to:
- Make sure people are working on the right problems (tackle in priority order)
- Create meeting agendas: going through the open high-priority bugs is a good way to keep people focused
- Send comments
How to Write Tickets
- A badly-written ticket is better than no ticket at all, but not by much
- The summary should be short and informative
- Aim is to help people who are looking at 100 summaries find the ones they care about
- “Bug in seq comp” is bad
- “Sequence comparison returns wrong probabilities for bivalves” is much better
- Description must provide all the information someone needs to know to address the issue
- Software configuration, package versions, operating system, etc.
- Sequence of steps or unit test case that triggers the fault
- Configuration files, input data, screenshots of the fault, etc.
- Remember, it may be weeks or months before the issue is addressed
- Send comments
Other Fields
- What type of ticket it is:
- Defect (i.e., a bug)
- Enhancement (feature request)
- Some other kind of task (e.g., a question that needs to be answered)
- How important it is
- When it needs to be finished
- See the discussion of milestones below
- Who is responsible
- Leave this as “nobody” if it's someone else's job to assign work
- Any keywords that might help with later searches
- Send comments
Updating Tickets
- Most important thing about a ticket is whether it is open or closed
- Number of open tickets shows how much work needs to be done
- And the rate at which new tickets are being created is a good indication of whether the project is stabilizing or not
- It's not unusual to re-open tickets after they've been closed
- E.g., the bug wasn't actually fixed, or the question wasn't completely answered
- Next most important thing is who the ticket is assigned to
- Default view in
DrProject only shows those tickets that are assigned to you
- Add a comment to the ticket every time you change something in it
- Like commenting on changes you submit to version control
- Send comments
Roadmap and Milestones
- Group tickets according to milestones (due dates)
- Milestone names are initially symbolic, like “version 2 beta release”
- That way, when the date changes, you don't have a milestone called “April 1” whose due date is May 15
- The roadmap lists all future milestones
![[Roadmap]](./img/dev02/roadmap.png)
Figure 27.10: Roadmap
- Bars show what fraction of tickets belonging to each have been closed
- Which is a good way to see if you're going to make the deadline or not…
- Send comments
Priorities And Triage
- Whole team must agree on what ticket priorities mean in order for them to be useful
- Low: cosmetic UI problem, minor workflow annoyance with obvious workaround, etc.
- High: program crashes, returns wrong result, inadvertently launches nuclear assault on Iceland, etc.
- Medium: anything in between
- In the weeks and days leading up to a release, perform regular triage
- Move low-priority tickets to the next milestone
- Make sure someone is working on every high-priority one…
- …or that there's a workaround for the problem, or that it's mentioned in the release notes
- Send comments
Workflow
- Most issue trackers support (or impose) more complex workflow than
DrProject's- The larger the team, the more structure is required
- For example:
![[A Workflow for Larger Teams]](./img/dev02/complex_workflow.png)
Figure 27.11: A Workflow for Larger Teams
- Ticket is initially unassigned
- Project manager allocate it (and specifies when it is to be completed)
- Developer changes its status to “in progress” when she starts work
- Marks it “ready for test” when work is done
- QA marks it “ready for integration” once it passes tests
- Build manager closes it once it's in the installer
- It can also be suspended, rejected as a duplicate or irreproducible, re-opened, etc.
- Send comments
Wiki
- The only high-priority ticket assigned to Ginny is to design the new data formatter
- She has some ideas, but wants to share them with the rest of the team before she starts coding
- She doesn't want to start another meandering email discussion
DrProject provides each project with a wiki- A simple web-based whiteboard
- Good for storing meeting minutes, developer-oriented documentation, etc.
- Send comments
Wiki Syntax
- Clicking the “Edit” link on a wiki page brings up a text edit box
![[Editing a Wiki Page]](./img/dev02/wiki_edit.png)
Figure 27.12: Editing a Wiki Page
- Wiki syntax is simpler than standard HTML
- Blank lines separate paragraphs
- Any word in CamelCase is automatically interpreted as a link to a page with that name
= Title = creates a level-1 heading, == Subtitle == creates a level-2 heading, etc.- Indented lines beginning with
* become a point-form list - Anything ending in
.png, .jpg, or .gif is automatically displayed as an image
- Send comments
Saving Changes
- Once you've made changes, you can preview the page, or commit the changes
- Pages are saved in the database, rather than in
Subversion- Translated from wiki syntax into HTML when the page is viewed
- Note: no conflict resolution mechanism
- If two people try to edit a wiki page at the same time, the second one to commit will be denied
- So put large and/or frequently-updated documents under version control instead
- Send comments
Tying It All Together
- Why use a wiki, rather than storing documents in
Subversion? - Because wiki syntax offers shortcuts for referring to everything else in the project
#22 links to ticket 22[94] links to revision 94 in the version control repositorysource:path/to/filename.txt links to a particular file in the Subversion repositorysource:path/to/filename.txt#94 links to a particular version of the file
@41 links to email message 41
- You can use this same syntax (almost) everywhere:
- Tickets (to refer to changesets, file versions, email messages, etc.)
- Email messages (as long as you send plain text, rather than HTML)
Subversion commit comments
- Send comments
Rules of the Road
- Every project is different, but most successful ones follow a few common-sense rules
- Be polite: the rest is details
- Keep email messages short and to the point
- Quote only as much of preceding messages as you absolutely have to
- Change the subject line to reflect topic drift
- Avoid “me too!” messages
- If it's clear that discussion isn't going to reach a consensus, call a vote
- Just calling a vote can help clarify the question
- +1 for “yes”, 0 for “don't care”, and -1 for “over my dead body”
- Decide in advance whether majority wins, or whether any -1's constitute a veto
- Send comments
More Rules
Subversion commit comments should mention tickets- E.g.,
"Fixes #456: normalizing bivalve sequences before comparison"
- Changeset IDs should appear in ticket comments
- It only takes a few keystrokes to add this information, but the payoff is tremendous
- When in doubt, think about what will make things easier to search for
- E.g., add keywords to tickets to identify project components, platforms, etc.
- Send comments
Summary
- Ginny is now busy writing up her design ideas
- Which
DrProject will automatically hyperlink to old tickets, source files, and other pages
- As her team grows, it may need something larger than
DrProject - Which one you use is much less important than the fact that you use something
- And that you remember that technology is no substitute for politeness, good will, and common sense
- Send comments
Exercises
Exercise 27.1:
Can you find out what bugs are currently being worked on?
What feature requests have been deferred? Which files were changed
to fix a problem? What fixes are currently being tested? How long it
took to fix/implement something?
Exercise 27.2:
What is the status of the overnight build? The overnight
regression tests? The issue database? The team's
discussions?
Send comments
Backward, Forward, and Sideways
Introduction
- This course has introduced you to the skills and tools that differentiate productive programmers from unproductive ones
- But we've really just scratched the surface
- This lecture looks at a few of the next steps you might want to take
- Send comments
Classic Mistakes
- People
- Adding people to a late project (just makes it later [Brooks 1995])
- Relying on heroics (don't scale [Robinson 2005])
- Lack of support from project sponsors
- Lack of user input
- Silver bullet syndrome
- Product
- Process
- Overly optimistic schedules
- Remember, products take three times longer to create than programs
- Short-changing upstream activities
- “We have to start coding right away, or we won't have time to fix all our bugs.”
- Failure to track progress
- Abandoning plan under pressure
- Send comments
Branching, Merging, and Tagging
- Sometimes want to work on several different versions of software at once
- Example: need to do bug fixes on Version 3 while making incompatible changes toward Version 4
- Or want two sets of developers to be able to write and test large changes independently, then put things back together
- All modern version control systems allow you to branch a repository
- Create a “parallel universe” which is initially the same as the original, but which evolves independently
- Can later merge changes from one branch to another
![[Branching and Merging]](./img/summary/branch_and_merge.png)
Figure 28.1: Branching and Merging
- Also common to create tags
- Symbolic labels that identify particular revisions, such as “Release_2.0”
- Makes it easy to go back to an important revision later
- Send comments
Managing Branches
- Much better than just copying all the source files
- The version control system remembers where the branch came from, and can trace its history back
- Example: fix a bug on one branch, merge the changes into other branches that have the same bug
- Warning: many people become over-excited about branching when they first start to use it
- Keeping track of what's going on where can be a considerable management overhead
- On a small project, very rare to need more than two active branches
- Send comments
Patching
- Often need a way to send or archive differences between two versions of a program
- E.g., someone finds and fixes a bug in your open source software…
- …but doesn't have permission to commit the change to your version control repository
- Common to use the
patch program to do this- Takes the output of
diff and applies it to the original file to produce the modified file
- Send comments
A Better Way to Build
- Said back in Automated Builds that
Make has turned into a clumsy programming language - The same thing happens eventually to every other build management tool
- So why not start with a real programming language, and embed a build management tool in that?
SCons combines the most useful features of Make with the full power of Python- Instead of a Makefile, you write an SConstruct file
- Use function calls in that file to tell SCons what you want to build, and how
- Like Make, it has a rich set of default rules
- But when you want to do something complicated, you can use any feature of Python you want
- E.g., build up a list of filenames, fetch a build rule from a database, etc.
- Send comments
SCons Example
- Example: build either a normal or a debugging version of a program, and include an extra source file on Windows
# What does the program depend on?
dependencies = ['file1.c', 'file2.c']
if os.platform == 'win32':
dependencies.append('win32.c')
# Which version are we building?
if 'debug' in COMMAND_LINE_TARGETS:
Program('hello_dbg', dependencies)
else:
Program('hello', dependencies)
- Pro:
- Don't have to learn a new language
- And the authors of the build tool don't have to create and maintain one, either
- There's a debugger
- Con:
- Much less widely used than
Make or Ant
- Send comments
Persistence
- Often want to save the state of a running program
- E.g. checkpoint a long-running program so that it can be restarted in case of a crash
- Don't want to have to rewrite this code every time the program changes
- Use a persistence framework like Python's
pickle module- Walk through the objects in the program
- Save the atomic values (integers, strings, etc.) as they are
- Write lists, dictionaries, and other collections in a standard way
- Classes can define special methods to tell the framework what values to save
- Send comments
Pickling Example
Object-Relational Mapping
- Increasingly popular alternative is to use an object/relational mapping framework like
SQLObject- Define mapping between classes and database tables
- Framework then creates code to translate objects to rows and back
- Usually keeps a cache of recently-used objects to improve performance
![[]](./img/summary/orm.png)
Figure .:
- Pro:
- Takes advantage of databases' strengths (high performance, concurrency control, etc.)
- Con:
- Objects don't naturally fit into tables
- E.g., how to represent many-to-many relationships?
- Can be very hard to debug when things go wrong
- I.e., when the generated code is doing what you said, but you asked it to do the wrong thing
- Send comments
Web Development Frameworks
- Handwritten CGI applications are going out of style
- We've learned enough since the 1990s to write code at a higher level
- One widely used alternative is Java servlets
- User writes a class whose methods do application-specific work
- The servlet container loads the class, and calls those methods
Ruby on Rails is another popular choice- Python has several similar frameworks, such as
Django and TurboGears- But none are as widely used or as well documented
- Competition isn't always healthy…
- Send comments
Refactoring
- Refactoring is the process of cleaning up code
- Often described in terms of “code smells” and corresponding cures
- See [Fowler 1999] for a comprehensive catalog
- Very important to have unit tests in place before starting to refactor
- Without this, you have no way of knowing what else your refactoring might have broken
- [Feathers 2005] is an excellent guide to how to fit useful tests back onto inherited applications
- Send comments
Refactoring Examples
- Smell: method or function body runs to several pages
- Cure: use Extract Method to break the method into smaller meaningful pieces
- But do not break up arbitrarily, just to satisfy coding conventions
- Each new method should make sense on its own
- Smell: function or method with many parameters
- If a method takes eleven strings as parameters, sooner or later you'll pass them in the wrong order
- Cure: store values as members of the object, rather than passing them as parameters
- But it's bad to introduce members called
param1 and param2
- Alternative cure: introduce a new object that combines parameters into one value
- Send comments
More Refactoring Examples
- Smell: duplicated code
- Cure: use Extract Method once again
- But what if code is only almost duplicated?
- Use Introduce Parameter to give callers a way to signal exactly what they want…
- …or use Pull Up Method to move shared code into parent…
- …and Form Template Method to have that shared code call something that each child class defines
- Smell: complex Boolean expressions in conditionals
- Cure: Introduce Explaining Variable to give sub-expressions meaningful names
- Particularly effective when it's used to turn nested if-then-elses into a lookup table
- Send comments
Refactoring Tools
- Refactoring complements design patterns
- But there's a difference: modern IDEs can do refactoring for you
- Highlight a method, say “Rename”, and the IDE finds and changes all the calls
- Able to do this because it continuously re-parses source code as you type
- Move methods up into parent class, split classes in two, and much, much more
- If this doesn't convince you to upgrade from a dumb editor, I don't know what will…
- Send comments
Code Reviews
- Code reviews are more effective at finding bugs than testing [Fagan 1986]
- A consequence of Boehm's Law: the earlier you find a problem, the cheaper it is to fix
- Unfortunately, very little has been written about how to read code
- Many open source projects require that changes be reviewed before being committed
- Not just for finding bugs: want to make sure that:
- Team members are following style guidelines
- They've implemented what they're supposed to
- You understand how it works well enough to maintain it when they're gone
- Diminishing returns as programmers become more experienced, and more familiar with a project's idioms
- But they're a great way for new team members to learn their way around
- And an equally great way for experienced members to catch newcomers' mistakes early
- Send comments
Reading Code
- Print it out: paper has 4-10 times the resolution of even the best screen
- And print out the ticket the work was done to resolve
- Sit somewhere comfortable
- Away from email, the web, your pager, and other distractions
- Trace execution
- Find an entry point (like
main) - Skip over argument parsing, file I/O, etc.
- Put a question mark beside everything that doesn't immediately make sense
- Draw pictures of data structures, data flow, etc.
- Once you're done, go back and try to answer your questions
- But don't cross them off: if you didn't understand it when you first read it, it needs to be clarified
- Finally, jot down overall comments
- Send comments
Code Review Checklist
- Is the code documented in a consistent and readable way?
- Are file, class, method, parameter, and variable names descriptive?
- Are all function inputs used? Are all required function outputs produced?
- Is the flow of control easy to follow? What about the class structure?
- Do conditionals cover all cases? Do all loops have an exit condition? Do they handle the zero-pass case?
- Do functions check that their inputs are valid?
- Do callers check return values for errors, or handle exceptions that might be thrown?
- Are errors handled in a standard manner? Are error messages descriptive and helpful?
- Are there any magic numbers or machine-specific filenames?
- Send comments
User Interface Design
- Doesn't matter what your software does if no one can use it
- Usability and user interface design are not black arts
- Need a certain amount of natural talent to be a great UI designer
- But anyone who can learn to program can learn to avoid basic mistakes
- Basic rules from [Johnson 2000]:
- Focus on the users and their tasks, not the technology
- Consider function first, presentation later
- Conform to the users' view of the task
- Don't make anything more complicated than it already is
- Make it easy for users to learn new things
- Deliver information, not just data
- Design the interface to be responsive
- Try it out on users, then fix it!
- Send comments
Paper Prototyping
- Paper prototyping is the fastest and cheapest way to design an interface
- Create rough sketches showing menus, buttons, etc.
- Use Post-It notes for pulldowns
- Keep it rough: the more polished it is, the less likely people are to give you critical feedback
- Find a volunteer, then play computer
- Set them a task
- Show them what happens when they click, type, etc.
- Do not answer questions or provide hints, or get into a discussion of how to fix things
- Two or three sessions will be enough to tell you what you need to fix
- Send comments
Where To Go Next
- [Brand 1995] looks at how buildings can be designed to change and grow gracefully over time
- Everything he says is directly applicable to large programs
- [Steele 1999] shows you how a great computer scientist thinks about a particularly hard problem
- [Margolis & Fisher 2002] describes a project at Carnegie-Mellon University aimed at making Computer Science a more congenial environment for women and other underrepresented groups
- Note: while the gender ratio in computing as a whole is about six to one, the ratio in open source is closer to 200:1
Internet Groupware for Scientific Collaboration talks about how much more the web could do for scientists- Send comments
The Rules
- A week of hard work can sometimes save you an hour of thought.
- Anything worth repeating is worth automating.
- Anything repeated in two or more places will eventually be wrong in at least one.
- The three chief virtues of a programmer are laziness, impatience, and hubris.
- It's not what you know, it's what you can.
- The deadline isn't when you're supposed to finish; the deadline is when it starts to be late.
- Never debug standing up.
- Tools are signposts, not destinations.
- Not everything worth doing is worth doing well.
- Code unto others as you would have others code unto you.
- Every complex file format eventually turns into a badly-designed programming language.
- Tools are amplifiers: they allow good programmers to be better, and bad ones to be worse.
- They call it computer science because it's experimental.
- Programs come and go; data is forever.
- There's no such thing as one program.
- Discipline matters more than genius; reality matters more than rulebooks.
- Send comments
Conclusion
- “A good teacher has a value above pearls, but a good student has a value above rubies.”
- Thank you, and good luck
- Send comments
Acknowledgments
Support
- The
Python Software Foundation, for the grant that made this work possible - The
University of Toronto, for letting me test this version of this course on its students YesLogic and WingIDE for generously donating licenses for Prince (a spiffy XML-to-PDF converter) and Wing (an equally spiffy Python development environment) respectively- The creators of
Apache, Cygwin, DrProject, Firefox, Gnumeric, Make, Python, SQLite, Subversion, and all the other fine open source tools that made this course possible - Send comments
Major Contributors
- Greg Wilson, who wrote the first version of these notes
- Brent Gorda, who helped create an early version of this course
- Andrew Lumsdaine and Peter Gottschling, who beta-tested this course in the fall of 2005
- Adam Goucher, who critiqued every lecture in detail
- Nick Discenza, for the diagrams
- Jeff Strunk, and the other good folks at Enthought, for providing a permanent home for this material
- Send comments
Comments and Corrections
| Ranjan Abhishek | Donald Altman | Jorge Aranda | David Ascher |
| Nick Barnes | Hossein Bidhendi | Cornelia Boldyreff | Stephane Bortzmeyer |
| José Brandao-Neto | Titus Brown | Josh Calahan | Guido Carballo |
| Ralph Corderoy | Jim Cordy | Michelle Craig | Mike Davis |
| Sean Dawson | Martin de Lasa | Jim DeWees | Tom Diamond |
| Simon Duane | Paul Dubois | Neil Ernst | Isaac Ezer |
| Hans Fangohr | Eric Firing | Mike Firth | Karl Fogel |
| Grig Gheorghiu | Brent Gorda | Peter Gottschling | Adam Goucher |
| Steve Graham | Steve Graham | Perry Greenfield | Paul Gries |
| Nick Groll | Alan Grosskurth | Goran Gugic | Danny Heap |
| Randy Heiland | Michael Hoffman | Jeremy Hoisak | Steve Holden |
| Frank Horowitz | Bowen Hui | Kent Johnson | Steven Johnson |
| Calahan Josh | Brandon King | Niels Klitgord | Harald Koch |
| Ryan Krauss | Deanna Langer | Chris Lasher | Christopher Lenz |
| Christian Lessig | Catherine Letondal | Michelle Levesque | Yonggang Liu |
| Vinicius Lobosco | Gary Loescher | Steve Loughran | Stephanie Ludi |
| Andrew Lumsdaine | Neil MacDonald | Laurie MacDougall | Edoardo Marcora |
| Ryan Maw | Gary McGraw | Luke McKinney | Christian Meesters |
| Keir Mierle | Simona Mindy | Ken Miura | Matthew Moelter |
| Andrew Mole | Kit-Sun Ng | Stephan Nies | Dirkjan Ochtman |
| Chris Poirier | Victor Putz | Irving Reid | Karen Reid |
| Michael Rennie | Arnold Rosenbloom | Mario Ruggier | Paul Salvini |
| Oliver Sander | Herb Schilling | Erich Schwarz | Diomidis Spinellis |
| Bill Spotz | Boris Steipe | James Stovall | Nick Stuifbergen |
| Jonathan Taylor | Diane Trout | Nicky Van Foreest | Tom Van Vleck |
| Jim Vickroy | Kristina Visscher | James White | Peter Wilkinson |
| Blake Winton | Dave Wortman | Kai Zhuang | |
- Send comments
Prior Art
- We didn't know it at the time, but [Hammond 1994] invented the term “software carpentry” several years before this course was conceived
- Send comments
Dedication
- For Frank Willison—I'm sorry this one was finished too late for you to tune up.
- And for Charles Darwin, John Scopes, and everyone who believes that the truth is more important than doctrine.
- Send comments
Bibliography
 | [Agans 2002]:
David J. Agans:
Debugging.
American Management Association,
2002,
0814471684.
Its first sentence says, “This book tells you how to find
out what's wrong with stuff, quick,” and that's exactly what
it does. In fifteen (very) short chapters, the author presents
nine simple rules to help you track down and fix problems in
software, hardware, or anything else. His war stories are
entertaining (although I think one or two are urban myths), and
his advice is eminently practical.
|
 | [Andrews & Whittaker 2006]:
Mike Andrews and James A. Whittaker:
How to Break Web Software.
Addison-Wesley,
2006,
0321369440.
This practical companion to [Whittaker 2003] catalogs things you can do to
break web-based applications.
|
 | [Beck & Cunningham 1989]:
Kent Beck and Ward Cunningham:
"A Laboratory for Teaching Object-Oriented Thinking",
SIGPLAN Notices,
vol. 24,
no. 10,
pp. -,
1989.
The first description of CRC cards.
|
 | [Boehm 1988]:
Barry Boehm:
"A Spiral Model of Software Development and Enhancement",
IEEE Computer,
vol. ,
no. ,
pp. -,
1988.
Boehm's landmark description of spiral software development.
|
 | [Brand 1995]:
Stewart Brand:
How Buildings Learn.
Penguin USA,
1995,
0140139966.
This beautiful, thought-provoking book starts with the observation
that most architects spend their time re-working or extending
existing buildings, rather than creating new ones from scratch. Of
course, if Brand had written “program” instead of
“building”, and “programmer” where he'd
written “architect”, everything he said would have
been true of computing as well. A lot of software engineering
books try to convey the same message about allowing for change,
but few do it so successfully. By presenting examples ranging from
the MIT Media Lab to a one-room extension to a house, Brand
encourages us to see patterns in the way buildings change (or, to
adopt Brand's metaphor, the way buildings learn from their
environment and from use). Concurrently, he uses those insights to
argue that since buildings are always going to be modified, they
should be designed to accommodate unanticipated change.
|
 | [Brooks 1995]:
Frederick P. Brooks:
The Mythical Man Month: Essays on Software Engineering.
Addison-Wesley,
1995,
0201835959.
The classic text in software engineering, most famous for
its discussion of how adding people to a project that's late will
only make it later.
|
 | [Castro 2002]:
Elizabeth Castro:
HTML for the World Wide Web.
Peachpit Press,
2000,
0321130073.
A clean, clear, comprehensive guide to creating HTML for the web,
with good coverage of Cascading Style Sheets (CSS).
|
 | [Castro 2000]:
Elizabeth Castro:
XML for the World Wide Web.
Peachpit Press,
2000,
0201710986.
Like other books in Peachpit's Visual Quickstart series, this one
is beautifully designed, and easy to read without ever being
condescending. Its 16 chapters and 4 appendices are organized into
1- and 2-page explanations of particular topics, from writing
non-empty elements to namespaces, schemas, and XML
transformation. Throughout, Castro strikes a perfect balance
between “what”, “why”, and
“how”, and provides a surprising amount of detail
without ever overwhelming the reader.
|
 | [Chase & Simon 1973]:
W.G. Chase and H.A. Simon:
"Perception in chess",
Cognitive Psychology,
vol. 4,
no. ,
pp. 55-81,
1973.
The original paper comparing the performance of novice and master
chess players when confronted with actual and random positions.
|
 | [Collins-Sussman et al 2004]:
Ben Collins-Sussman, Brian W. Fitzpatrick, and C. Michael Pilato:
Version Control with Subversion.
O'Reilly,
2004,
0596004486.
A good tutorial and reference guide for Subversion, which is also Version Control with Subversion.
|
 | [Doar 2005]:
Matt Doar:
Practical Development Environments.
O'Reilly,
,
0596007965.
Matt Doar has produced a practical guide to what should be in
every team's toolbox, how competing entries stack up, and how they
ought to be used. This book covers everything from configuration
management tools like CVS and Subversion, to build
tools (make, GNU's Autotools, Ant, Jam, and SCons), various
testing aids, bug tracking systems, documentation generators, and
we're still only at the halfway mark. He names names, provides
links, and treats free and commercial offerings on equal terms.
My copy currently has 28 folded-down corners, which is 28 more
than most books get.
|
 | [Eick et al 2001]:
Stephen G. Eick, Todd L. Graves, Alan F. Karr, J.S. Marron, and Audris Mockus:
"Does Code Decay? Assessing the Evidence from Change Management Data",
IEEE Transactions on Software Engineering,
vol. 27,
no. 1,
pp. -,
2001.
Analyzes the evolution of several million lines of telephone
switching software over fifteen years to show that code quality,
comprehensibility, and maintainability decline over time.
|
 | [Fagan 1986]:
Michael E. Fagan:
"Advances in Software Inspections",
IEEE Transactions on Software Engineering,
vol. 12,
no. 7,
pp. -,
1986.
Empirical data showing that code reviews are the most effective
way known to find bugs.
|
 | [Fehily 2006]:
Chris Fehily:
Python.
Peachpit Press,
2006,
0321423135.
A gentle introduction to Python, beautifully typeset, with lots of
helpful examples.
|
 | [Fehily 2003]:
Chris Fehily:
SQL.
Peachpit Press,
2003,
0321118030.
This very readable book describes the subset of SQL that covers
most real-world needs. While the book moves a little slowly in
some places, the examples are exceptionally clear.
|
 | [Feldman 1979]:
Stuart I. Feldman:
"Make—A Program for Maintaining Computer Programs",
Software: Practice and Experience,
vol. 9,
no. 4,
pp. 255-265,
1979.
The original description of Make. Last time I checked, Stu
Feldman was a vice president at IBM, which shows you just how far
a good tool can take you…
|
 | [Feathers 2005]:
Michael C. Feathers:
Working Effectively with Legacy Code.
Prentice-Hall PTR,
2005,
0131177052.
Most programmers spend most of their time fixing bugs, porting to
new platforms, adding new features—in short, changing
existing code. If that code is exercised by unit tests, then
changes can be made quickly and safely; if it isn't, they can't,
so your first job when you inherit legacy code should be to write
some. That's where this book comes in. What to know three
different ways to inject a test into a C++ class without changing
the code? They're here. Want to know which classes or methods to
focus testing on? Read his discussion of pinch points. Need to
break inter-class dependencies in Java so that you can test one
module without having to configure the entire application? That's
in here too, along with dozens of other useful bits of
information. Everything is illustrated with small examples, all
of them clearly explained and to the point. There are lots of
simple diagrams, and a short glossary; all that's missing is hype.
|
 | [Fogel 2005]:
Karl Fogel:
Producing Open Source Software.
O'Reilly,
2005,
0596007590.
A community is more than just a bunch of people. It's a shared
set of values, and rules for how to behave. By this standard, the
open source community isn't just what some programmers choose to
do with their time, and why; it's also how they do it.
This book is an excellent guide to that “how”. Every
page offers practical advice; every point is made clearly and
concisely, and clearly draws upon the author's extensive personal
experience. Want to know how to earn commit privileges on a
project? It's here. Do you and other project members have
irreconcilable differences? Fogel explains when and how to fork,
and what the pros and cons are. Want to get your project more
attention? Want to take something closed, and open it up? It's
all here, and much more.
|
 | [Fowler 1999]:
Martin Fowler:
Refactoring.
Addison-Wesley Professional,
1999,
0201485672.
Like architects, most programmers spend most of their time
renovating, rather than creating something completely new on a
blank sheet of paper. This book presents and analyzes patterns
that come up again and again when programs are being
reorganized. Some of these are well-known, such as placing common
code in a utility method. Others, such as replacing temporary
objects with queries, or replacing constructors with factory
methods, are subtler, but no less important. Each entry includes a
section on motivation, the mechanics of actually carrying out the
transformation, and an example in Java.
|
 | [Friedl 2002]:
Jeffrey E. F. Friedl:
Mastering Regular Expressions.
O'Reilly,
2002,
0596002890.
The definitive programmer's guide to regular expressions.
|
 | [Gamma et al 1995]:
Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides:
Design Patterns.
Addison-Wesley,
1995,
0201633612.
The book that started the software design patterns movement. Much
of the discussion has been superseded by more recent books, and
the use of C++ and Smalltalk for examples feels a little dated,
but it is still a landmark in programming.
|
 | [Glass 2002]:
Robert L. Glass:
Facts and Fallacies of Software Engineering.
Addison-Wesley Professional,
2002,
0321117425.
I really wish someone had given me something like this book when I
took my first programming job. If nothing else, it would have been
a better way to start thinking about the profession I had stumbled
into than the “everybody knows” factoids that I soaked
up at coffee time. Some of what he says is well-known: good
programmers are up to N times better than bad ones (his value for
N is 28), reusable components are three times harder to build than
non-reusable ones, and so on. Other facts aren't part of the
zeitgeist, though they should be. For example, most of us know
that maintenance consumes 40-80% of software costs, but did you
know that roughly 60% of that is enhancements, rather than bug
fixes? Or that if more than 20-25% of a component has to be
modified, it is more efficient to re-write it from scratch? Best
of all, Glass backs up every statement he makes with copious
references to the primary literature; if you still disagree with
him, you'd better be sure you have as much evidence for your point
of view as he has for his.
|
 | [Goerzen 2004]:
John Goerzen:
Foundations of Python Network Programming.
APress,
2004,
1590593715.
This book looks at how to handle several common protocols,
including HTTP, SMTP, and FTP. Goerzen also doesn't delve as
deeply into their internals, but instead on how to build clients
that use them. His approach is to build solutions to complex
problems one step at a time, explaining each addition or
modification along the way. He occasionally assumes more
background knowledge than most readers of this book are likely to
have, but only occasionally, and makes up for it by providing both
clear code, and clear explanations of why this particular function
has to do things in a particular order, or why that one really
ought to be multithreaded.
|
 | [Good 2005]:
Nathan A. Good:
Regular Expression Recipes.
APress,
2005,
159059441X.
A great how-to for regular expressions, with examples in many
different languages.
|
 | [Gunderloy 2004]:
Mike Gunderloy:
Coder to Developer.
Sybex,
2004,
078214327X.
This practical, readable book is subtitled “Tools and
Strategies for Delivering Your Software”, and that's exactly
what it's about. Project planning, source code control, unit
testing, logging, and build management are all there. Importantly,
so are newer topics, like building plugins for your IDE, code
generation, and things you can do to protect your intellectual
property. Everything is clearly explained, and illustrated with
well-chosen examples. While the focus is definitely on .NET,
Gunderloy covers a wide range of other technologies, both
proprietary and open source. I'm already using two new tools based
on references from this book, and plan to make the chapter on
“Working with Small Teams” required reading for my
students.
|
 | [Hammond 1994]:
Nick Hammond:
"Software Carpentry --- A Tool-Based Approach to Monte Carlo Radiation Transport",
Proc. 8th Int'l Conference on Radiation Shielding,
vol. ,
no. ,
pp. -,
1994.
A prior use of the phrase “software carpentry”.
|
 | [Harold 2004]:
Elliotte Rusty Harold:
Effective XML.
Addison-Wesley,
2004,
0321150406.
This book explains which of XML's many features should be used
when: Item 12 tells you to store metadata in attributes, and then
spends six pages explaining why, while Item 24 analyzes the
strengths and weaknesses of various schema languages, and Item 38
covers character set encodings. It's more than most developers
will ever want to know, but when you need it, you really need it.
|
 | [Hock 2004]:
Roger R. Hock:
Forty Studies that Changed Psychology.
Prentice Hall,
2004,
0131147293.
In forty short chapters, Hock describes the turning points in our
understanding of how our minds work. The book isn't just about
psychology; you'll also learn a lot about how science gets done,
and about the scientists who do it.
|
 | [Humphrey 1996]:
Watts S. Humphrey:
Introduction to the Personal Software Process.
Addison-Wesley,
1996,
0201548097.
A methodology for improving programmers' productivity by having
them record and track just about everything they do. The idea has
a lot of merit, but in practice, the cost of record keeping can
outweigh the benefits.
|
 | [Hunt & Thomas 1999]:
Andrew Hunt and David Thomas:
The Pragmatic Programmer.
Addison-Wesley,
1999,
020161622X.
This book is about those things that make up the difference
between typing in code that compiles, and writing software that
reliably does what it's supposed to. Topics range from gathering
requirements through design, to the mechanics of coding, testing,
and delivering a finished product. The second section, for
example, covers “The Evils of Duplication”,
“Orthogonality”, “Reversibility”,
“Tracer Bullets”, “Prototypes and Post-It
Notes”, and “Domain Languages”, and illuminates
each with plenty of examples and short exercises.
|
 | [Johnson 2000]:
Jeff Johnson:
GUI Bloopers.
Morgan Kaufmann,
2000,
1558605827.
Most books on GUI design are long on well-meaning aesthetic
principles, but short on examples of what it means to put those
principles into practice. In contrast, GUI Bloopers presents case
study after case study: what's wrong with this dialog? What should
its creators have done instead. And, most importantly, why? The
net effect is to teach all of the same principles that other books
try to, but in a grounded, understandable way.
|
 | [Kernighan & Pike 1984]:
Brian W. Kernighan and Rob Pike:
The Unix Programming Environment.
Prentice Hall,
1984,
013937681X.
I have long believed that this book is the real secret to Unix's
success. It doesn't just show readers how to use Unix—it
explains why the operating system is built that way, and
how its “lots of little tools” philosophy keeps simple
tasks simple, while making hard ones doable.
|
 | [Kernighan & Ritchie 1998]:
Brian W. Kernighan and Dennis Ritchie:
The C Programming Language.
Prentice Hall PTR,
1998,
0131103628.
The classic description of the one programming language every
serious programmer absolutely, positively has to learn.
|
 | [Knuth 1998]:
Donald E. Knuth:
The Art of Programming.
Addison-Wesley,
1998,
0201485419.
The lifework of the man who invented many of the basic concepts of
algorithm analysis, these massive tomes are like Everest:
awe-inspiring, but not for the weak of heart. Most readers will
find [Sedgewick 2001] much more
approachable.
|
 | [Langtangen 2004]:
Hans P. Langtangen:
Python Scripting for Computational Science.
Springer-Verlag,
2004,
3540435085.
The book's aim is to show scientists and engineers with little
formal training in programming how Python can make their lives
better. Regular expressions, numerical arrays, persistence, the
basics of GUI and web programming, interfacing to C, C++, and
Fortran: it's all here, along with hundreds of short example
programs. Some readers may be intimidated by the book's weight,
and the dense page layout, but what really made me blink was that
I didn't find a single typo or error. It's a great achievement,
and a great resource for anyone doing scientific programming.
|
 | [Lutz & Ascher 2003]:
Mark Lutz and David Ascher:
Learning Python.
O'Reilly,
2003,
0596002815.
This is not only the best introduction to Python on the market, it
is one of the best introductions to any programming language that
I have ever read. Lutz and Ascher cover the entire core of the
language, and enough of its advanced features and libraries to
give readers a feeling for just how powerful Python is. In keeping
with the spirit of the language itself, their writing is clear,
their explanations lucid, and their examples well chosen.
|
 | [Margolis & Fisher 2002]:
Jane Margolis and Allan Fisher:
Unlocking the Clubhouse.
MIT Press,
2002,
0262133989.
This book describes a project at Carnegie-Mellon University that
tried to figure out why so few women become programmers, and what
can be done to correct the imbalance. Its first six chapters
describe the many small ways in which we are all, male and female,
are conditioned to believe that computers are “boy's
things”. Sometimes it's as simple as putting the computer
in the boy's room, because “he's the one who uses it
most”. Later on, the “who needs a social life?”
atmosphere of undergraduate computer labs drives many women away
(and many men, too). The last two chapters describe what the
authors have done to remedy the situation at high schools and
university. This work proves that by being conscious of the many
things that turn women off computing, and by viewing computer
science from different angles, we can attract a broader
cross-section of society, which can only make our discipline a
better place to be. The results are impressive: female
undergraduate enrolment at CMU rose by more than a factor of four
during their work, while the proportion of women dropping out
decreased significantly.
|
 | [Martelli 2005]:
Alex Martelli, Anna Ravenscroft, and David Ascher:
Python Cookbook.
O'Reilly,
2005,
0596007973.
A useful reference for every serious Python programmer, this book
is a collection of tips and tricks, some very simple, others so
complex that they require careful line-by-line reading. The book's
companion web site is updated regularly.
|
 | [Mason 2005]:
Mike Mason:
Pragmatic Version Control Using Subversion.
Pragmatic Bookshelf,
2005,
0974514063.
Yet another book from the folks at Pragmatic, this one is
everything you'll ever need to know about Subversion, which is on
its way to becoming the version control system of choice for open
source development.
|
 | [McConnell 2004]:
Steve McConnell:
Code Complete.
Microsoft Press,
2004,
0735619670.
This classic is a handbook of do's and don'ts for working
programmers. It covers everything from how to avoid common
mistakes in C to how to set up a testing framework, how to
organize multi-platform builds, and how to coordinate the members
of a team. In short, it is everything I wished someone had told
me before I started my first full-time programming job.
|
 | [McConnell 1996]:
Steve McConnell:
Rapid Development.
Microsoft Press,
1996,
1556159005.
This book describes what it takes to develop robust code quickly,
what mistakes are often made in the name of rapid development, and
how to identify and analyze potential risks. It includes a list
of 25 best practices, and discusses things that most other books
leave out (like recovering from disasters and dealing with
impossible demands). Unlike most “how to do it
better” books, it isn't try to sell any particular practice
or style, which adds even more weight to McConnell's carefully
balanced opinions.
|
 | [Pilgrim 2004]:
Mark Pilgrim:
Dive Into Python.
APress,
2004,
1590593561.
A good introduction to Python, which is also available on-line at
Dive Into Python.
|
 | [Prechelt 2000]:
Lutz Prechelt:
"An Empirical Comparison of Seven Programming Languages",
IEEE Computer,
vol. 33,
no. 10,
pp. 23-29,
2000.
Some hard data on the relative effectiveness of C, C++, Java,
Perl, Python, Rexx, and Tcl.
|
 | [Ray & Ray 2003]:
Deborah S. Ray and Eric J. Ray:
Unix.
Peachpit Press,
2003,
0321170105.
A gentle introduction to Unix, with many examples.
|
 | [Robinson 2005]:
Evan Robinson:
"Why Crunch Mode Doesn't Work: 6 Lessons",
http://www.igda.org/articles/erobinson_crunch.php
(viewed 2006-02-26).
An incisive summary of the effect of fatigue on human
productivity, the conclusion of which is that crunch mode winds up
making projects later.
|
 | [Rosen 2005]:
Lawrence Rosen:
Open Source Licensing: Software Freedom and Intellectual Property Law.
Prentice Hall PTR,
2005,
0131487876.
If you're involved in open source software in any way, shape, or
form, then this book is a useful read. Its author is intimately
familiar with the field; here, he lays out a general background
for discussion of intellectual property, and the history of
free/open source software, then discusses what various popular
licenses actually mean. The book closes with chapters on topics
such as how to choose a license, litigation, and standards. The
writing is clear—exceptionally so by legal
standards—and he takes time to explain terms and
assumptions that most software developers won't have encountered
before. What's more, he doesn't seem to have any particular axes
to grind: the book is US-centric, but his treatment of the various
options open to today's developers is very even-handed.
|
 | [Royce 1970]:
W. W. Royce:
"Managing the Development of Large Software Systems",
Proceedings of IEEE WESCON,
vol. ,
no. ,
pp. -,
1970.
The original description of the waterfall model of software
development.
|
 | [Schneier 2003]:
Bruce Schneier:
Beyond Fear.
Springer,
2003,
0387026207.
A thought-provoking look at how we are encouraged to think about
security, and how much security is actually desirable. For
example, he explains why security systems must not just work well,
but fail well, and why secrecy often undermines security instead
of enhancing it.
|
 | [Schneier 2005]:
Bruce Schneier:
Secrets and Lies.
Wiley,
2005,
0471453803.
Having written the standard book on cryptography, Schneier now
argues that technology alone can't solve most real security
problems. The book covers systems and threats, the technologies
used to protect and intercept data, and strategies for proper
implementation of security systems. Rather than blind faith in
prevention, Schneier advocates swift detection and response to an
attack, while maintaining firewalls and other gateways to keep out
the amateurs.
|
 | [Sedgewick 2001]:
Robert Sedgewick:
Algorithms in C, Parts 1-5.
Addison-Wesley Professional,
2001,
0201756080.
Far too many programmers still think and code as if resizeable
vectors and string-to-pointer hash tables were the only data
structures ever invented. These books are a guide to all the
other conceptual tools that working programmers ought to have at
their fingertips, from sorting and searching algorithms to
different kinds of trees and graphs. The analysis isn't as deep
as that in Knuth's monumental The Art of Programming, but that makes the book far more accessible.
And while the author's use of C may seem old-fashioned in an age
of Java and C#, it does ensure that nothing magical is hidden
inside an overloaded operator or virtual method call.
|
 | [Skoudis 2004]:
Ed Skoudis:
Malware.
Prentice-Hall,
2004,
0131014056.
This 647-page tome is a survey of harmful software, from viruses
and worms through Trojan horses, root kits, and even malicious
microcode. Each threat is described and analyzed in detail, and
the author gives plenty of examples to show exactly how the attack
works, and how to block (or at least detect) it. The writing is
straightforward, and the case studies in Chapter 10 are funny
without being too cute.
|
 | [Spinellis 2006]:
Diomidis Spinellis:
Code Quality.
Addison-Wesley,
2006,
0321166078.
A companion to the same author's earlier [Spinellis 2003],
this book concentrates on what distinguishes good code from bad. The first one
was great; this one is even better.
|
 | [Spinellis 2003]:
Diomidis Spinellis:
Code Reading.
Addison-Wesley,
2003,
0201799405.
The book's preface says it best: “The reading of code is
likely to be one of the most common activities of a computing
professional, yet it is seldom taught as a subject or formally
used as a method for learning how to design and program.”
Spinellis isn't the first person to make this point, but he is the
first person I know of to do something about it. In this book, he
walks through hundreds of examples of C, C++, Java, and Perl,
drawn from dozens of Open Source projects such as Apache, NetBSD,
and Cocoon. Each example illustrates a point about how programs
are actually built. How do people represent multi-dimensional
tables in C? How do people avoid nonreentrant code in signal
handlers? How do they create packages in Java? How can you
recognize that a data structure is a graph? A hashtable? That it
might contain a race condition? And on, and on, real-world issue
after real-world issue, each one analyzed and cross-referenced.
There's also a section on additional documentation sources, and a
chapter on tools that can help you make sense of whatever you've
just inherited.
|
 | [Steele 1999]:
Guy L. Steele Jr.:
"Growing a Language",
Journal of Higher-Order and Symbolic Computation,
vol. 12,
no. 3,
pp. 221-236,
1999.
The best (and wittiest) discussion ever published of how
programming languages ought to evolve.
|
 | [Spolsky 2004]:
Joel Spolsky:
Joel on Software.
APress,
2004,
1590593898.
Joel on Software collects
some of the witty, insightful articles Spolsky has blogged over
the past few years. His observations on hiring programmers,
measuring how well a development team is doing its job, the API
wars, and other topics are always entertaining and
informative. Over the course of forty-five short chapters, he
ranges from the specific to the general and back again, tossing
out pithy observations on the commoditization of the operating
system, why you need to hire more testers, and why NIH (the
not-invented-here syndrome) isn't necessarily a bad thing.
|
 | [Thompson & Chase 2005]:
Herbert H. Thompson and Scott G. Chase:
The Software Vulnerability Guide.
Charles River Media,
2005,
1584503580.
My current favorite guide to computer security for programmers,
this books walks through each major family of security holes in
turn: faulty permission models, bad passwords, macros, dynamic
linking and loading, buffer overflow, format strings and various
injection attacks, temporary files, spoofing, and more.
|
 | [Ullman & Liyanage 2004]:
Larry Ullman and Marc Liyanage:
C Programming.
Peachpit Press,
2004,
0321287630.
A gentle introduction to C, with many examples.
|
 | [Whittaker 2003]:
James A. Whittaker:
How to Break Software.
Addison-Wesley,
2003,
0201796198.
A slim catalog of things testers can do to break software.
|
 | [Whittaker & Thompson 2004]:
James A. Whittaker and Herbert H. Thompson:
How to Break Software Security.
Addison-Wesley,
2004,
0321194330.
This practical companion to [Whittaker 2003] catalogs things you can do to
test (and break) security measures in programs.
|
 | [Williams & Kessler 2003]:
Laurie Williams and Rober Kessler:
Pair Programming Illuminated.
Addison-Wesley,
2003,
0201745763.
A combination of an instruction manual, a summary of the authors'
empirical studies of pair programming's effectiveness, and
advocacy, this book is the reference guide for anyone who wants to
introduce pair programming into their development team.
|
 | [Wilson 2005]:
Greg Wilson:
Data Crunching.
Pragmatic Bookshelf,
2005,
0974514071.
Every day, all around the world, programmers have to recycle
legacy data, translate from one vendor's proprietary format into
another's, check that configuration files are internally
consistent, and search through web logs to see how many people
have downloaded the latest release of their product. It may not be
glamorous, but knowing how to do it efficiently is essential to
being a good programmer. This book describes the most useful data
crunching techniques, explains when you should use them, and shows
how they will make your life easier.
|
 | [Zeller 2006]:
Andreas Zeller:
Why Programs Fail: A Guide to Systematic Debugging.
Morgan Kaufmann,
2006,
1558608664.
This well-written, copiously-illustrated book from the creator of
DDD (a graphical front end for the GNU debugger) is a survey of
current and next-generation debugging tools. Some are old
friends, like bug trackers and symbolic debuggers. Others are
new: there's a detailed look at the pros and cons of replay
debugging, an automatic divide-and-conquer tool that can strip
test cases down to their essentials, and a whole chapter on how
dependency analysis and program slicing can be used to isolate
faults. If, ten years from now, debuggers have taken a
much-needed leap forward, much of the credit will go to this book.
|
Glossary
A
- absolute path:
A path that refers to a
particular location in a file system. Absolute paths are usually
written with respect to the file system's root directory, and begin with
either “/” (on Unix) or “\” (on Microsoft
Windows).
See also:
relative path.
- absolute reference:
A spreadsheet cell reference that
is not automatically adjusted when a formula is moved from one
location to another. Absolute references are created by putting
"$" in front of the row and/or column designation, as in
$C$4.
See also:
relative_reference.
- abstract data type (ADT):
A specification of a set of values, and the operations that can
be performed on them. The term “abstract” means that
the implementation of the ADT is hidden from other code.
- abstract syntax tree (AST):
A data structure that represents the structure of a program or
program fragment. Its leaves are literals, such as numbers and
variable names, while its internal nodes represent higher-level
structures, such as loops and expressions.
- access control:
A way to specify who has permission to view, edit, delete, run,
or otherwise interact with something, by explicitly listing what
rights each individual or group has. This is in contrast with the
standard Unix authorization mechanism, which only
allows a fixed set of privileges to be listed for owner, one
group, and everyone else.
- access control list (ACL):
A list that explicitly describes who can do what to a file,
directory, or other entity. ACLs permit finer control over a
computer's resources than Unix's classic user/group/all system,
but are more complicated to administer.
- ACID:
An acronym for atomic, consistent, isolated, and durable, which
are the properties that a database
transaction must guarantee.
- acquire a lock:
To claim a lock in order to establish
exclusive access to some resource.
See also:
release a lock.
- action:
The steps a build tool must take to
bring a file or other object up to date.
See also:
dependency,
prerequisite,
target.
- actual outcome:
The actual result of a unit test.
If this matches the expected
outcome, the test passes.
- aggregate:
To create a single value by combining multiple values, e.g. by
adding or averaging.
- algorithmic complexity:
The rate at which the work performed by an algorithm grows as a
function of problem size, ignoring constant factors. Algorithmic
complexity is usually expressed using O-notation; for
example, the time required to compare each value in a list to each
other value is O(N2).
- alias:
A second (or subsequent) reference to a single piece of data.
Aliasing can make programs more difficult to understand, since
changes made through one reference “magically” affect
the other.
- analysis and estimation (A&E):
The step in a software development process in which developers
figure out how they're going to implement the desired features,
and how long they expect it will take. The term is also applied
to the summary documents this process produces.
- anchor:
An element of a regular
expression that matches a location, rather than a sequence
of characters.
«^» matches the beginning of a line,
«\b» matches the break between word and non-word
characters, and «$» matches the end of a line.
- Application Binary Interface (API):
The calling conventions, data structures, and other interface
elements that compiled code exposes to other programs.
See also:
Application Programming Interface (API).
- Application Programming Interface (API):
The source-level external interface that a library or operating
system provides for other programs to use.
See also:
Application Binary Interface (API).
- arc:
A connection between two nodes in a
graph. Arcs may be directed (i.e.,
unidirectional) or undirected (i.e., bidirectional).
- assertion:
An expression which is supposed to be true at a particular
point in a program. Programmers typically put assertions in their
code to check for errors; if the assertion fails (i.e., if the
expression evaluates as false), the program halts and produces an
error message.
- asymmetric cipher:
A cipher which has two keys, each of which undoes the other's effects.
See also:
symmetric cipher.
- atomic:
Not interruptible. An atomic operation is one that always takes
effect as a whole, no matter what else the system is doing.
- attribute:
An extra property added to an XML element. Attributes are represented as
name/value pairs; a given name may appear at most once for any
particular element.
- authentication:
The act of establishing someone's identity. This is almost
always done by requiring them to produce some credentials, such as
a password.
See also:
authorization,
access control.
- authorization:
The part of a computer security system that keeps track of
who's allowed to do what.
See also:
authentication,
access control.
- automatic variable:
In
Make, a variable whose value is automatically
redefined for each rule. Automatic variables include $@,
which holds the rule's target, and
$^, which holds its prerequisites. Automatic variables are
typically used in pattern rules.
B
- basic authentication:
A simple username/password authentication mechanism that is part
of the HTTP standard. It sends passwords as cleartext (actually,
as base-64 encoded text), so it should never be used.
- Big Design Up Front (BDUF):
A somewhat pejorative term applied to development processes that
rely on careful up-front analysis and design to prevent errors
from occurring.
- big-endian:
Having the most significant byte in the memory location with
the lowest address. In a big-endian system, the integer
0x12345678 is stored as [0x78, 0x45, 0x34, 0x12].
See also:
little-endian.
- binary data:
Non-textual data. All data is “binary”, in the
sense that it's represented as 1's and 0's, but many tools
distinguish between 1's and 0's that represent printable
characters, and 1's and 0's that don't.
- binary mode:
Python (and some other programming languages) automatically
convert Windows-style line endings (carriage return followed by
newline) to Unix-style line endings (newline only) when reading
and writing files. This is appropriate for textual data, but not
for binary data, such as images.
If the file is in binary mode, this conversion is not done.
- binary search:
A search technique which divides the values being searched in
half at each step, just as a person would go to the middle of a
phonebook, then the middle of either the upper or lower half, and
so on when looking for a name. The algorithmic complexity of
binary search is O(log2 N).
- bitwise operations:
Operations that act at the level of the bits making up a value,
rather than on what those bits mean. The four most common bitwise
operations are and, or, xor, and not.
- blacklist:
A list of addresses from which email will not be accepted, which
is part of an “allow unless forbidden” authorization policy.
See also:
whitelist.
- blog:
Short for weblog; an on-line diary or
forum to which authors append new content. Unlike mailing lists, blogs use a publish-subscribe model: readers
pull content when they want it, rather than having it sent to
them.
- boilerplate:
The standardized parts of a family of programs that don't change
from instance to instance.
- branch:
A separate line of development managed by a version control system.
Branches help projects manage incompatible sets of changes that
are being made concurrently.
See also:
merge.
- breakpoint:
A marker put in a program by a debugger that causes it to pause so that the
program's internal state can be inspected (and possibly
modified).
- buffer:
A block of memory used to store values temporarily in order to
“smooth out” communication.
- buffer overflow attack:
A method used to attack programs (primarily those written in C and
C++) that injects code by writing past the end of a buffer.
- bug tracker:
See issue tracker.
- build tool:
A piece of software, such as
Make, whose main
purpose is to rebuild software, documentation, web sites, and
other things after changes have been made.
- burn rate:
The rate at which project tasks are actually being completed.
Comparing a project's actual burn rate with its schedule tells
developers when to start scaling back their plans, and/or moving
their deadlines.
C
- cache:
A data structure, or portion of a disk, that stores temporary
copies of values. Caches are normally used when fetching items is
expensive: keeping copies of values that are likely to be needed
again close at hand can make a program much faster, at the cost of
requiring extra synchronization effort.
- call stack:
A data structure used to keep track of functions that are
currently being executed. Each time a function is called, a new
stack frame is put on the top of
the stack to hold that function's local variables. When the
function returns, the stack frame is discarded.
See also:
heap,
static space.
- camel case:
Text that is formatted with InternalCapitalLetters.
- catch exception:
To handle an exception.
See also:
raise exception.
- cell range:
An expression specifying a contiguous block of cells in a spreadsheet. The cell range
C4:E5, for example, includes all the cells in the rectangle
bounded by C4 (upper left corner) and E5 (lower right corner).
- chain:
A sequence of method calls, each of which uses the result of
the previous one, as in
"x".upper().center(5).
- changeset:
A set of changes to files committed
to a repository in a single
operation.
- checkpoint:
To save the state of a program so that it can be restarted later
(for example, in case of a computer crash). The saved state is
also called a checkpoint.
See also:
persistence.
- child class:
In object-oriented programming, a new class derived from an
existing one (called the parent
class).
See also:
inheritance.
- chunk:
A group of objects that are stored together in short-term
memory, such as the seven digits in a North American phone
number.
- cipher:
An algorithm used to encrypt and
decrypt data.
- ciphertext:
The encrypted form of a message.
Ciphertext is usually produced from plaintext by a combination of a cipher algorithm and a key.
- class:
A definition that specifies the properties of a set of objects.
- class browser:
A tool that shows an outline view of the classes making up a
software project, their methods, and their inheritance relations;
usually part of an integrated development
environment.
- client:
A software application that accesses data over a network. The
provider is called a server.
- client/server architecture:
An asymmetric system in which many clients communicate with a single centralized
server.
- code review:
The act of inspecting code to find errors, violations of style
guidelines, etc. While labor intensive, code reviews are a good
way to transfer knowledge between project members, and can be more
effective at finding bugs than testing.
- collision:
A situation in which one or more values are mapped to the same
location by a hash function.
Has tables typically handle this by
storing colliding values in a sublist.
- conditional breakpoint:
A breakpoint that only causes the
program to pause under certain conditions. For example, a debugger may specify that the program is to
pause only when a certain function parameter is an empty string,
or when a loop index is greater than a specified value.
- comma-separated values:
A format for representing tabular data. Each row in the table
is represented by a line of text; the values in that row are
separated by commas.
- command-line flag:
A terse way to specify an option or setting to a command-line
program. By convention, Unix applications use a dash followed by
a single letter, such as
-v, or two dashes followed by a
word, such as --verbose, while DOS applications use a
slash, such as /V. Depending on the application, a flag
may be followed by a single argument, as in -o
/tmp/output.txt.
- commit:
To send changes from a working
copy to a version control's
repository to create a new
revision of the affected
file(s). Changes must be committed in order for other users to
see them.
See also:
update.
- conflict:
A change made by one user of a version control system
that is incompatible with changes made by other users. Helping
users resolve conflicts is one
of the version control
system's major tasks.
- conflict marker:
A string such as
"<<<<<<",
"======", or ">>>>>>" put into a local
copy of a file by a version
control system to indicate where local changes overlap
with incompatible changes made by someone else. The version control system
will typically not allow the user to commit changes until all conflicts
have been resolved.
- Common Gateway Interface (CGI):
A protocol for communication between web
servers and external programs. The web server passes data
to the external program through environment variables and standard input, and reads the data to
be sent back to the client from the
external program's standard
output.
- component object model:
A software architecture that specifies how components communicate
with each other, without specifying how they are implemented.
Microsoft's COM is the most widely used desktop example; modern
web services are increasingly emulating its most important
features.
- concurrency:
The situation in which two or more things are going on at once.
See also:
serialization.
- connection:
A communication channel between a program and a database.
- constructor:
A special method called when creating a new instance of a class
that initializes the instance's state.
- cookie:
A short piece of text created by a server and passed to a client, which the client can later return in order to identify
itself. Cookies were invented to get around the fact that Hypertext Transfer Protocol (HTTP) is a stateless
protocol.
- core dump:
A file containing a byte-for-byte representation of the contents
of a program's memory. On some operating systems, programs produce
core dumps whenever they terminate abnormally (e.g., try to divide
by zero, or access memory that is out of bounds). Core dumps are
often used as the basis for post
mortem debugging.
- CRC cards:
A design aid used in object-oriented
programming, in which the responsibilities and
collaborators of each class are written out on a 3×5 index
card [Beck & Cunningham 1989].
- cross product:
A pairing of all elements of one set with all elements of another.
The cross product of two N-element vectors L and
R is an N×N matrix M, in which
Mi,j=LiRj.
- Cascading Style Sheets (CSS):
A language used to describe how HTML pages should be formatted for
display.
- current working directory:
The directory that relative
paths are calculated from; equivalently, the place
where files referenced by name only are searched for. Every
process has a current working
directory. The current working directory is usually referred to
using the shorthand notation
. (pronounced
“dot”).
- cursor:
A pointer into a database that keeps track of outstanding
transactions and other operations.
D
- data scrubbing:
Reformatting, rescaling, or otherwise cleaning up data to make it
easier to process.
- database column:
A set of data values of a particular type, one for each row in the
table.
See also:
database row.
- database row:
A set of related values making up a single entry in a database table.
See also:
database column,
record.
- database table:
A set of values in a relational
database that are organized into columns and rows.
- database management system (DBMS):
A software package that manages access to a relational database.
- dead code:
A block of code which can never be reached, such as the body of
an
if statement whose conditional expression is always
false. Dead code often occurs when programmers modifying an
inherited program leave something in because they're not sure it's
safe to take out.
- deadlock:
Any situation in which no one can proceed unless someone else
does first (analogous to having two locked boxes, each of which
holds the key to the other).
See also:
race condition.
- debuggee:
See target program.
- debugger:
A computer program that is used to control and inspect another
program (called the target
program). Most debuggers are symbolic debuggers
that show the target program's state in terms of the variables
that the programmer created, rather than showing the raw contents
of memory.
- declarative:
A programming system in which the relationships between values are
stated, rather than the algorithm used to compute or update those
values. All widely-used programming languages are imperative, but build tools like
Make,
spreadsheets, and some high-level
programming languages, are declarative.
- decorator:
An advanced programming construct in Python that allows one
function to wrap or modify another.
- decryption:
The process of translating encrypted
ciphertext back into the original
plaintext.
See also:
cipher,
key.
- default target:
The target that a build
system will try to bring up to date if no other target is
specified.
- defensive programming:
The practice of checking input values, invariants, and other aspects of a program
in order to catch errors as early as possible.
- denial of service (DoS):
An attack designed to overwhelm a system so that it cannot service
legitimate requests. DoS does not destroy data, or reveal
secrets, but making a site or service unavailable can be just as
damaging.
- dependency:
In a build system, a file whose state some other file depends
on. If any of a file's dependencies are newer than the file
itself, the file must be updated. A file's dependencies are also
called its prerequisites.
See also:
action,
target.
- derive:
To create one class from another using inheritance.
- design by contract:
A design methodology in which programmers define checkable
interface specifications using pre-conditions, post-conditions, and invariants.
- design pattern:
A standard solution to a commonly-occurring problem.
- dictionary:
A mutable unordered collection that pairs each key with a single value. Dictionaries are also
known as maps, hashes, or associative arrays, and are typically
implemented using hash tables.
- digital signature:
A block of data attached to a message to prove that the message
was created by a particular person, and has not been tampered
with. Digital signatures are usually created by encrypting a message digest with the private key of an asymmetric cipher.
- directed graph:
A graph whose arcs have a direction, i.e., if an arc connects two nodes A and B, then it is possible to reach
B from A, but not necessarily possible to reach A from B.
Directed graphs are often used to visualize dependencies in build systems.
- directory tree:
File system directories are normally organized hierarchically:
each directory except the root has a single parent, and each
may have zero or more children. This means that directories may
be viewed as a tree. Since files may not contain directories or
other files, they are always leaf nodes of this tree.
- Domain Name System (DNS):
A system which maps numeric Internet
Protocol addresses, such as
"128.100.171.16", to
human-readable names, such as "pyre.third-bit.com".
- docstring:
Short for “documentation string”, this refers to
textual documentation embedded in Python programs. Unlike
comments, docstrings are preserved in the running program, and can
be examined in interactive sessions.
- document:
A well-formed instance of XML. Documents
can be represented as trees (using DOM), stored as files on disk,
etc.
- Document Object Model (DOM):
A cross-language standard for representing XML documents as
trees.
- drive:
A disk drive is a piece of computer hardware used to store data
on a rotating disk. In older operating systems, each drive was a
separate file system;
modern version of Microsoft Windows still use this notion, placing
one or more file systems on
each physical drive, and giving each a separate drive letter (such
as the familiar
C:).
- driver:
A software module designed to communicate with an external device
or software package. A device driver is a piece of software that
can control a piece of hardware; a database driver is one that
knows how to open connections and send commands to a particular
database manager.
- duck typing:
An informal name for dynamic type systems that relies on
objects being able to do the same things, rather than on inheritance or formal specification of
properties. The term comes from the saying, “If it walks
like a duck, and quacks like a duck, it's a duck.”
E
- element:
A named item in an XML document, which has a unique parent, and may
contain attributes, text, and other
elements.
See also:
tag (in XML).
- elevator pitch:
Another name for a vision
statement.
- embed:
To place code written in one programming language inside code
written in another.
- encapsulation:
The practice of hiding the implementation details of a class or
module; one of the three defining principles of object-oriented
programming.
See also:
inheritance,
polymorphism.
- encryption:
The process of translating plaintext
that anyone can understand into ciphertext that can only be understood by
someone possessing the correct cipher
and key.
- environment variable:
A named value associated with a running process by the
operating system. Typical environment variables include
HOME (the user's home directory) and PWD (the
process's present working directory). Environment variables are
typically used to specify things that many applications may want
to know, or to provide default configuration values.
- epoch:
The moment from which times are measured. On Unix, the epoch
is midnight, January 1, 1970; on Windows, the epoch is January 1,
1601 (further proof that Microsoft takes backward compatiblity
very seriously).
- escape sequence:
A sequence of characters that represents some other character
or special entity.
"\t" and "\n" are escape sequences
in normal Python strings that represent tab and newline characters
respectively; "<" and "&" are escape
sequences in HTML and XML that represents the less than sign and
ampersand.
- event-driven programming:
A style of programming in which a framework triggers events in the user's
program. Event-driven programming is used by most graphical user
interfaces, and by CGI programs.
- event log:
A chronological list of recent events in a project, such as repository updates, changes to tickets, messages to mailing lists, wiki page edits, and so on. A project's event
log is often provided as a blog.
- exception:
An object that represents an error condition. As a program
executes, it creates a stack of exception handlers. When an
exception is raised, the
program searches this stack for the top-most handler, which catches and handles the exception.
Exceptions typically contain information such as the file and line
where the error occurred, the type of the error, and an error
message.
- exception handler:
A block of code that deals with the error signaled by an exception.
See also:
catch exception,
raise exception.
- expected outcome:
The outcome a test must produce in order to pass. If the actual outcome is different, the test
fails.
- exponent:
The power by which the mantissa in
a floating-point number is multiplied. The exponent in
2.7×103 is 3.
- Extreme Programming:
A programming methodology which emphasizes quick reaction over
forward planning. Its most widely-known components are probably
pair programming and
relentless refactoring.
F
- feature creep:
Changes in the aims or scope of a project over time. The usual
result is that everyone spends so much time rewriting code that
the project never moves closer to completion.
- file system:
A set of files, directories, and I/O devices (such as
keyboards, screens, printers, and so on). A file system may be
spread across many physical devices, or many file systems may be
stored on a single physical device. The operating system will only allow
some file operations (such as copying, or creating symbolic links
or shortcuts) within a file system.
- filename extension:
The portion of a file's name that comes after the final
“.” character. By convention, this identifies the
file's type:
.txt means “text file”,
.png means “Portable Network Graphics file”,
and so on. These conventions are not enforced by most
operating systems: it is perfectly possible to name an MP3 sound
file homepage.html. Since many applications use filename
extensions to identify the MIME
type of the file, misnaming files may cause those
applications to fail.
- filter:
A program that transforms a stream of data. Many Unix
command-line tools are written as filters: they read data from
standard input, process
it, and write the result to standard output. Image
processing applications are often constructed by connecting
filters to one another.
- finite state machine (FSM):
A mathematical model of computation consisting of a finite
number of discrete states connected by transitions. FSMs are often
visualized as directed graphs, in
which the nodes are states, and the arcs are transitions.
See also:
regular expression (RE).
- fixture:
The particular configuration of a system that is the subject of
a unit test. It is a good practice
to create a fresh fixture for each test, so that the actions and
outcomes of early tests cannot affect later ones.
- foreign key:
One or more values in a database
table that identify a row
in another table.
- form:
A web page that allows users to enter data.
See also:
Common Gateway Interface (CGI),
Hypertext Transfer Protocol (HTTP).
- framework:
A library, or set of libraries, that implements the generic
parts of a family of applications. Developers customize the
framework for a particular application by replacing generic
placeholders with more specific code.
G
- garbage collection:
Automatically reclaiming objects in memory when they are no longer
being used.
See also:
reference counting.
- gold plating:
Adding more features to the system than it needs, or making parts
of it much more elaborate than is required.
- graph:
A mathematical structure that consists of nodes connected by arcs.
Graphs may be directed (i.e., the arcs are unidirectional) or
undirected (i.e, the arcs are bidirectional), and are used to
represent everything from program structure to bus routes.
- greedy matching:
In a regular expression,
the policy of matching as much as possible, as early as
possible.
See also:
reluctant matching.
- group:
A sub-match in a regular
expression.
H
- hash code:
The output of a hash
function. Typically, a hash code is a seemingly-random
integer, which is then used to determine to put or look for an
object in a hash table.
- hash function:
A function which takes an object as its input, and produces an
integer value as its output. Good hash functions produce outputs
that are as random as possible, i.e., they have the property that
different inputs are likely to produce different outputs.
- hash table:
A data structure which allows programs to look up objects by
value, rather than by location. Hash tables do this by using
a hash function to calculate
seemingly-random identifiers for values, and using those as
indices into an array. Under normal conditions, it takes
constant time to find a value in a hash table.
- heap:
An area of memory out of which a program can dynamically
allocate blocks of various sizes in order to store values.
See also:
call stack,
static space.
- heisenbug:
A bug that hides when you are looking for it. Bugs can arise in
sequential programs (for example, adding a
printf call to a
C program may move things around in memory so that the bug is no
longer triggered), but are much more common in concurrent programs.
- hexadecimal:
A base-16 numeric representation in which the letters A-F (or
a-f) are used to represent the “digits” 10-15. The
decimal integer 61 is 3D in hexadecimal:
3×161+D(=13)×160.
- hijack:
To take control of a connection between a user and a web
application after the user has authenticated, e.g., to impersonate
a user after he or she logs in.
- host address:
A computer's Internet address.
- Hypertext Transfer Protocol (HTTP):
A set of rules for exchanging data (especially files) on the World
Wide Web.
- HTTP header:
A name/value pair at the start of an HTTP request or response.
Unlike dictionary keys, names are not required to be unique.
I
- idiom:
A manner of expression commonly used by native speakers of a
language. A programming language's idioms are the ways that most
programmers habitually express their ideas.
- immutable:
Unchangeable. The value of immutable data cannot be altered
after it has been created.
See also:
mutable.
- imperative:
A programming system in which the steps taken to calculate values
are specified explicitly. All widely-used programming languages
are imperative.
See also:
declarative.
- in-place operator:
An operator such as
+= that provides a shorthand
notation for the common case in which the variable being assigned
to is also an operand on the right hand side of the assignment.
The statement x += 3 means the same thing as x = x +
3.
- inheritance:
In object-oriented programming, the practice of defining a new
class as an extension or specialization
of an existing one.
See also:
encapsulation,
polymorphism.
- inner join:
A join in which rows are combined only
where values in corresponding columns satisfy some condition
(usually equality).
- invariant:
An expression whose value doesn't change during the execution
of a program. For example, an invariant property of a loop
indexed by a variable
i might be that the value of the
variable M is always greater than or equal to the values of
the array elements whose indices are less than i.
See also:
pre-condition,
post-condition.
- instance:
An object created from a specific
class is called an instance of that
class.
- instruction pointer:
A register that points at either
the instruction the program is currently executing, or the one
that it is to execute next (depending on the computer). When a
function is called, the instruction pointer's value is copied onto
the call stack, along with the
values of the function's parameters, so that the program can
return to the point of the call when the function finishes.
- Integrated Development Environment (IDE):
A program that combines several software development tools into
one. Typically, an IDE contains a “smart” editor
(that automatically indents and colorizes code), a build system
(for languages that need to be compiled), a class browser, a debugger, and a graphical GUI designer.
- integration test:
A test that checks whether the parts of a program work
together.
See also:
unit test.
- Internet Protocol (IP):
A family of communication protocols, the most widely used of which
are UDP and TCP.
- invert:
To invert a dictionary is to
swap its keys and values; in mathematical terms, this is the same
as inverting the discrete function that the dictionary
represents. Any inversion algorithm must deal with the fact that
values are not guaranteed to be unique.
- issue tracker:
A tool that keeps track of a project's outstanding work items, or
tickets; a to-do list for the project.
Issue trackers are sometimes called bug
trackers, since many of the items they record are bugs.
J
- join:
A database operation that combines values from two or more tables.
See also:
inner join.
- Java Server Page (JSP):
A Java-based template system, in which
programmers mix HTML and Java code in a single file. The file is
automatically translated to create a pure Java program that prints
pure HTML.
K
- key:
The data that is used to index a particular entry in a dictionary. In a phone book, for example,
people's names are keys.
L
- Liskov Substitution Principle:
The principle that it should be possible to use an instance of
a child class anywhere that
instances of any of its parent
classes can be used.
See also:
inheritance,
polymorphism,
post-condition,
pre-condition.
- literate programming:
The practice of writing computer programs using a mix of
natural language, mathematics, and code, in order to make them
easier for human beings to read. Tools are used to translate
literate programs into code (for compilation and execution) and
documentation (for human consumption).
- little-endian:
Having the least significant byte in the memory location with
the lowest address. In a little-endian system, the integer
0x12345678 is stored as [0x12, 0x34, 0x56, 0x78].
See also:
big-endian.
- lock:
A mechanism used to control access to resources in concurrent systems. If a process A tries to acquire a lock held by some other process B, A is forced to wait until B
releases it.
- logging:
The act of recording program events in a systematic way so that
they can be examined later; a morally-defensible refinement of the
practice of using
print statements to debug.
- long integer:
An integer whose value takes up as many words of computer
memory as necessary. Most programming languages use 32 bits to
represent integers, which permits values in the range
-231…231-1 (or -2147483648 to
2147483647). In contrast, a language will allocate as many words
of computer memory to a long integer's value as that value needs.
The advantage is that very large values can be represented and
manipulated; the disadvantage is that operating on such values is
much slower than operating on native ones.
- lookup table:
In a spreadsheet, a pair of rows or
columns in which the first is used to select a value from the
second.
M
- macro:
A variable in a Makefile.
- mailing list:
A set of addresses used to send email to many recipients at once.
Most mailing lists keep a searchable archive of past messages.
See also:
blacklist,
whitelist.
- Makefile:
A configuration file for
Make that describes what
depends on what, and how to bring
things up to date.
- mantissa:
The fractional part of a floating-point number. The mantissa
in 2.7×103 is 2.7.
- match object:
The object returned after a successful match by a regular expression that contains
information about which parts of the text were matched by which
parts of the RE.
- member:
A variable contained within an object
that stores part of the object's state.
- merge:
To combine the contents of two or more versions of a file in order
to resolve overlapping edits; also, to
combine material from two or more branches.
- message digest:
A fixed-length summary of a message whose value appears to be
random, so that it is practically impossible to construct a
message with a specific digest. Digital signatures are usually
created by encrypting message
digests using asymmetric
ciphers.
- metadata:
Literally, “data about data”, i.e., data such as a
format descriptor, which describes other data.
- method:
In object-oriented programming, a function which is tied to a
particular object. Typically, each of
an object's methods implements one of the things it can do, or one
of the questions it can answer.
- milestone:
A date by which some work has to be completed. Milestones are
usually given symbolic names, such as “First Beta
Release”, to accommodate date changes.
- module:
A set of functions and variables that are grouped together to
make them more manageable. In Python, every source file is
automatically a module; in other languages, source files may
contain many modules, or a single module may span several
files.
- multi-valued assignment:
An assignment statement which changes several values at once.
For example,
a,b = 2,3 sets a to 2 and b to
3, while a,b = b,a swaps those variables' values.
- Multipurpose Internet Mail Extensions:
An Internet standard for the format of email that also
specifies which filename suffixes should be used to identify
particular types of content (such as
".png" for a PNG-format
image).
- mock object:
A stand-in for a real object that mimics behavior using a fixed
set of preprogrammed responses. Mock objects are used in testing
in order to isolate components, and/or improve performance.
- mutable:
Changeable. The value of mutable data can be updated in
place.
See also:
immutable.
N
- nested query:
A query whose results are used as input
by some other query.
- nimble language:
A language designed to facilitate rapid development, rather
than high performance or static safety checks. Nimble languages
are often called “scripting” or “agile”
languages, and include Python, Perl, Ruby, Tcl, Rexx, and
Scheme.
See also:
sturdy language.
- node:
An element in a graph that may be
connected to other nodes by arcs.
- normal form:
One of the conditions a database must satisfy to conform with best
practices.
- normalize:
To make a database satisfy widely-used normal forms.
O
- object:
A combination of data and functions (called methods) that are meant to work together.
In most programming languages, objects are instances of classes; each object represents one
“thing” that the program can operate on.
- object-oriented programming:
A way to structure programs as collections of objects that invoke one another's methods.
See also:
class,
procedural programming.
- object-relational mapping:
A persistence strategy that stores
objects in database tables, then translates the rows in those
tables back into objects as necessary.
- operating system:
The software responsible for managing a computer's hardware and
other processes. Operating systems are also responsible for
making different computers present the same interface to other
programs, so that applications like word processors and compilers
don't have to be re-written each time a new generation of chips
comes out. Popular desktop operating systems include Microsoft
Windows, Linux, and Mac OS X.
- operator overloading:
Redefining the behavior of a built-in operator, such as
+,
by overriding a specially-named
method. C++ and Python permit it; Java does not.
- optimistic concurrency:
Any scheme in which different processes are allowed to make
changes that may prove incompatible, so long as they resolve them later.
See also:
pessimistic concurrency.
- override:
To replace a method in a parent class with one in a child class.
See also:
inheritance,
polymorphism.
P
- pack:
To put data in a contiguous block of memory; also called
marshalling.
- packet:
The smallest unit of data exchange on a computer network. A
packet consists of a header specifying its length, destination,
and other values, and a payload containing the data to be
transmitted.
See also:
Internet Protocol (IP),
Transmission Control Protocol (TCP),
User Datagram Protocol (UDP).
- pair programming:
The practice of having two programmers sit together in front of
a single keyboard when writing code. Pair programming is part of
the core of extreme
programming; its advocates claim that it improves code
quality and intra-project communication [Williams & Kessler 2003].
- paper prototyping:
A user interface design technique in which designers create
low-fidelity sketches of UI features, then assign users tasks and
“play computer”.
- parent class:
In object-oriented programming, the class from which a new
child class is derived.
- parent directory:
The directory “above” a particular directory;
equivalently, the directory that “contains” the one in
question. Every directory in a file system except the root must a unique parent. A
directory's parent is usually referred to using the shorthand
notation
.. (pronounced “dot dot”).
- path:
A non-empty string specifying a single file or directory.
Paths consist of zero or more directory names, optionally followed
by a filename. Directory and file names are separated by
“/” (on Unix) or “\” (on Microsoft
Windows). If the path begins with this character, it is an
absolute path; otherwise,
it is a relative path.
On Microsoft Windows, a path may optionally begin with a drive letter.
- pattern rule:
In
Make, a rule that specifies a general way to
manage an entire class of files. For example, a pattern rule
might specify how to compile any C file, rather than just a
particular C file. Pattern rules typically make use of automatic variables.
- peer-to-peer architecture:
A symmetric system in which all participants communicate equally.
- persistence:
Saving data structures on disk, or in other long-term storage, so
that they can be recreated later.
See also:
checkpoint,
object-relational mapping.
- Personal Software Process (PSP):
An approach to improving software development practices in which
programmers record how long they spend on every task, how many
errors they make, and so on.
- pessimistic concurrency:
Any scheme which prevents different processes from ever making
conflicting changes to a shared resource.
See also:
optimistic concurrency.
- phishing:
Tricking someone into providing information they shouldn't, e.g.,
by impersonating a trusted web site.
- phony target:
In a build system, a target
that does not correspond to a file or other object. Phony targets
are usually just symbolic names for sequences of actions.
- pipe:
A connection from the output of one program to the input of
another. When two or more programs are connected in this way,
they are called a “pipeline”.
- plaintext:
Data that has not been encrypted,
i.e., data that is in its original, readable, form.
See also:
cipher,
ciphertext.
- polymorphism:
A mechanism that allows objects of
different classes to be treated in the
same way. In most languages, polymorphism depends on inheritance, but some languages (such as
Python) allow duck typing, so that
classes without a common ancestor that implement the same methods
can be treated polymoprhically.
See also:
encapsulation,
inheritance.
- port:
A non-negative integer that identifies a socket connection on a
particular machine. Ports 0-1023 are reserved for the operating
system's use.
- post mortem:
The final phase of a software project, in which the team discusses
what went right and what went wrong.
- post mortem debugging:
The act of debugging a program after it has terminated,
typically by inspecting a core
dump.
- post-condition:
A condition which a function or method guarantees will be true
if it terminates normally.
See also:
design by contract,
invariant,
pre-condition.
- pre-condition:
A condition which must be true at the start of a function or
method in order for it to execute correctly.
See also:
design by contract,
invariant,
post-condition.
- prerequisite:
In a build system, a file whose state some other file depends
on. If any of a file's prerequisites are newer than the file
itself, the file must be updated. A file's prerequisites are also
called its dependencies.
See also:
action,
target.
- primary key:
One or more columns in a database table whose values are
guaranteed to be unique for each row, i.e., whose values uniquely identify
the entry.
- principle of least privilege:
Granting users only the privileges they actually need in order to
accomplish a specific operation, and no others.
- private key:
One of the two keys used in an asymmetric cipher. The private key
is kept secret, while the public key
is shared with anyone the key's owner wishes to communicate with.
- procedural programming:
A way to structure programs that separates data (the
“what”) from functions (the “how”).
See also:
object-oriented programming.
- process:
A running instance of a program, containing code, variable
values, open files and network connections, and so on. Processes
are the “actors” that the operating system manages;
typically, the OS runs each process for a few milliseconds at a
time to give the impression that they are executing
simultaneously.
- program slice:
The subset of a program's statements which can affect the value
of a particular variable at some point in a program.
- public key:
One of the two keys used in an asymmetric cipher. The public key
is shared with anyone the key's owner wishes to communicate with,
while the private key is kept
secret.
- public key cryptography:
A cryptographic system based on an asymmetric cipher, in which the
keys used for encryption and decryption are different, and one cannot
be guessed or calculated from the other.
See also:
private key,
public key.
- publish-subscribe:
A technique for sharing content, in which an author makes the
material available, and readers download it when they want it
(rather than having it sent to them automatically).
Publish-subscribe is sometimes called “content pull”,
to distinguish it from the “content push” model of
mailing lists.
Q
- query:
A database operation that reads values, but does not modify
anything. Queries are expressed in a special-purpose language
called SQL.
R
- race condition:
A situation in which the final state of a system depends on the
order in which two or more competing processes modifies the state
last. For example, if two people make changes to a shared file,
the final contents of the file depends on who saves their changes
last. Race conditions are usually bugs, and are notoriously hard
to track down.
- raise exception:
To signal an error by creating an exception, and triggering the process by
which the program searches for a matching handler.
See also:
catch exception.
- raw string:
In Python, a string in which the backslash character represents
itself, rather than introducing an escape sequence. Raw strings are
written with a leading
r, as in r"a\nb".
- record:
A synonym for a database row.
- refactor:
To rewrite or reorganize software in order to improve its
structure or readability [Fowler 1999].
- reference counting:
Keeping track of the number of references to an object while a
program is running, so that it can automatically be destroyed when
it is no longer in use. Reference counting is an easy way to do
garbage collection, but
isn't guaranteed to collect all objects: if A and B refer to each
other, but nothing else refers to them, their reference counts
will not be zero, and they will not be recycled.
- referential integrity:
The internal consistency of values in a database. If an entry in
one table contains a foreign key,
but the record that key is supposed to
identify doesn't exist, referential integrity has been violated.
- reflection:
Having a program treat itself as data, i.e., examine or manipulate
its own state.
- register:
A small amount (typically only 4 or 8 bytes in size) of very
fast memory that is built into a microprocessor. Most modern
computer architectures only operate on values in registers; data
must be moved from memory into registers, and results moved the
other way. The term is also used to refer to variables in virtual machines that play a similar
role.
- regression test:
A test that checks whether things that used to work are still
working; equivalently, a test that checks whether errors that had
been eliminated have been reintroduced.
See also:
integration test,
unit test.
- regular expression (RE):
A pattern that specifies a set of character strings. In
programs, REs are most often used to find sequences of characters
in strings.
See also:
anchor,
group,
match object.
- relation:
In mathematical terms, a subset of the cross product of several sets; in human
terms, a set of values which are connected in some logical way.
- relational database:
A collection of data organized into tables, each of which is made up of
columns and rows.
- relative path:
A path that specifies the
location of a file or directory with respect to the current working
directory. Any path
that does not begin with a separator character
(“/” or “\”) is a relative path.
See also:
absolute path.
- relative_reference:
A spreadsheet cell reference that
is automatically adjusted when a formula is moved from one
location to another. Relative references are created simply by
naming the cell, as in
C4.
See also:
absolute reference.
- release a lock:
To relinquish a lock in order to signal
that other processe may now use a
shared resource.
See also:
acquire a lock.
- reluctant matching:
In a regular expression,
the policy of matching as little as possible, while still
satisfying the match.
See also:
greedy matching.
- replay attack:
An attack in which messages are recorded, then played back at a
later date. For example, an attacker might record the signal that
means “open the vault”, then use it to fool the system
into opening the vault door several hours later.
- repository:
A central storage area where a version control system
stores old revisions of files,
along with information about who created them and when.
- repository browser:
A read-only interface (usually web-based) to a version control repository.
- resolve:
To eliminate the conflicts
between two or more incompatible changes to a file or set of files
being managed by a version
control system.
- revision:
A particular state of a file, or a set of files, being managed
by a version control
system.
- risk assessment:
The process of determining how a system's security could be
attacked, and what the effects of different failures would be.
- roadmap:
A display that shows a project's future milestones.
- role:
A description of what some class of users can and cannot do to a
system. A role is typically described by listing the actions its
members can perform; they simplify administration by making it
possible to redefine the capabilities of an entire group in a
single step.
- roll back:
To undo a set of revisions in a
version control system
in order to return content to a previous state.
- root directory:
The top-most directory in a file
system's directory
tree. Its name is the operating system's separator
character, i.e., “/” on Unix (including Linux and Mac
OS X), and “\” on Microsoft Windows.
- RSS:
An XML data format used for syndicating content, such as blogs: the acronym stands for Rich Site Summary,
RDF Site Summary, or Really Simple Syndication. Someone who
wishes to publish a blog creates an RSS file (typically using
off-the-shelf software) and places it on their web server.
Blogreaders can then periodically check for updates, and, if there
are any, download and display the associated articles.
- rule:
In a build system, a specification of a target's prerequisites, and what action(s) to take to bring the target up to date.
S
- screen scraping:
Using a program to extract information from an HTML page intended
for human viewing. Screen scraping is a quick way to solve simple
problems, but breaks down when the pages are complex, or their
format changes frequently.
See also:
web services,
web spider.
- search path:
The list of directories that the operating system searches when
the user asks to run a program. The search path is usually stored
in the user's
PATH environment_variable. On
Unix, entries are separated by “:”, while on Windows,
they are separated by “;”.
- seek:
To move to an arbitrary location in a file.
- sequence:
A set of objects arranged in a dense, linear fashion, so that
they may be referred to by their index. In Python, strings,
lists, and tuples are built-in sequence types, since the elements
of each may be referred to as
s[0], s[1], and so on
up to s[N-1], where N is the sequence's length.
- serialization:
The act of forcing operations to execute one at a time, instead of
concurrently.
- server:
A software application that provides data to other programs. The
consumer is called a client.
See also:
web server.
- servlet:
A Java class that is loaded and run by a servlet container to generate web
content. Servlets are an alternative to CGI scripts.
- servlet container:
A long-running server application, similar to a web server, that loads and runs Java
classes called servlets to produce web
content.
- shared library:
A compiled library that is loaded into memory at most once, and
whose contents are shared by all running programs that reference
it. Shared libraries are implemented on Windows by
.dll
files, and on Linux by .so files.
- shell:
A command-line user interface program, such as Bash (the
Bourne-Again Shell) or the Microsoft Windows DOS shell. Shells
commonly execute a read-evaluate-print cycle: when the user enters
a command in response to a prompt, the shell either executes the
command itself, or runs the program that the command has
specified. In either case, output is sent to the shell window,
and the user is prompted to enter another command. Most shells
include commands for looping, conditionals, and defining
functions, so that small (and sometimes large) programs can be
written by putting a sequence of shell commands in a file.
- short-circuit evaluation:
Evaluation of an expression from left to right that stops as
soon as the expression's final value is known. For example, if
x is false, the computer does not call the function
f in the expression x and f(x). Similarly, if
x is true, f does not have to be called in x or
f(x).
- silver bullet:
A tool or technique that purports to solve a hard problem, but
which is too good to be true. The term comes from the myth that
only silver bullets can kill werewolves (in fact, ruthenium,
rhodium, and palladium are equally effective).
- Simple Object Access Protocol (SOAP):
A misleadingly-named standard for exchanging XML documents between
programs over the Internet. SOAP is the building block for most
modern web services.
- single-step:
To advance a program by one instruction, or one line, while
debugging.
See also:
step into,
step over.
- slice:
A regular subsequence of a larger sequence, such as the first five elements,
or every second element.
- social engineering:
An attack based on deceiving users, or on relying on social
conventions. Posing as an old lady in distress, or as a bank
official who is checking information, are both attacks of this
kind.
- socket:
One end of an IP
communication channel.
- sparse:
Being mostly empty. A sparse vector or matrix is one in which
most values are zero.
- special method:
In Python, a method which has a
special meaning to the interpreter. By conventions, these
methods' names begin and end with two underscore characters.
For example, if an object has a
__str__ method, Python
automatically calls it whenever it needs a text representation of
the object.
- specification:
A formal or semi-formal description of what a piece of software
is supposed to do. Specifications may include everything from
English prose (“The system must be able to handle at least
100 requests per second”) to algebra so complex that neither
customers nor developers really understand it.
- spiral model:
A software development process which creates successively larger
prototypes on the way to delivering the final application. The
development of each prototype goes through the steps of the waterfall model.
- spreadsheet:
A program for manipulating tabular numeric data, or the data
manipulated in that way. Microsoft Excel is the most widely used
spreadsheet in the world, but many others (such as
Gnumeric) also exist.
- SQL:
A special-purpose language for describing operations on relational databases. SQL is
not actually an acronym for “Structured Query
Language”.
- stack frame:
A data structure that provides storage for a function's local
variables. Each time a function is called, a new stack frame is
created and put on the top of the call
stack. When the function returns, the stack frame is
discarded.
- stack pointer:
A register that points at the
top of the call stack.
- standard error:
A process's “other” default output stream,
typically used for error messages.
See also:
standard output.
- standard input:
A process's default input stream. In interactive command-line
applications, it is typically connected to the keyboard; in a
pipeline, it receives data from
the standard output of
the preceding process.
- standard output:
A process's default output stream. In interactive command-line
applications, data sent to standard output is displayed on the
screen; in a pipeline, it is
passed to the standard
input of the next process.
- starvation:
A situation in which a process never completes a task because
other processes are continually being given access to a resource
that the starving process needs. Starvation is not the same as
deadlock, although the
symptoms are similar.
- stateless protocol:
A communication protocol in which each basic operation is
independent of each other. HTTP is the
best-known example: servers do not
remember anything about clients between
requests.
- static space:
A portion of a program's memory reserved for storing values
that are allocated even before the program starts to run, such as
constant strings.
See also:
call stack,
heap.
- status code:
An integer value returned to the operating system by a program
when that program terminates, which indicates whether the program
terminated normally or abnormally. By convention, 0 is used to
indicate normal termination (“zero errors”), while
non-zero values indicate specific problems (e.g., 1 for
“file not found”, 2 for “no permission”,
etc.).
- step into:
To go into a function call when debugging.
See also:
single-step,
step over.
- step over:
To execute a function without going into it when debugging.
See also:
single-step,
step into.
- stored procedure:
A function or program that has been compiled and stored in a
database for more efficient execution.
- stub:
A temporary placeholder for a function or method that hasn't been
written yet. Stubs typically return the same value on every call,
or (less often) a random value.
- sturdy language:
A language that separates compilation from execution in order
to maximize performance, check safety conditions, or both. Sturdy
languages typically have longer turnaround times than nimble languages, but scale up to
very large problems better. Sturdy languages include C/C++,
Fortran, Java, and C#.
- submodule:
A module that is contained inside
another module. Large software libraries are divided into
submodules for the same reason that large programs are divided
into functions.
- suspended process:
A process which is not running. A
process may be suspended because some other process is using the
CPU, or because it is waiting to acquire
a lock.
- symmetric cipher:
A cipher in which a single key is used
for both encryption and decryption. Symmetric ciphers are less
secure than asymmetric ones,
but are typically much faster.
T
- tag (in version control):
A symbolic label in a version control system that uniquely
identifies a particular state of the repository.
See also:
branch.
- tag (in XML):
A textual representation of an XML element. Tags come in matched opening and
closing pairs, such as
<x> and </x>; if
the element the tag pair represents does not contain text or other
elements, the short form <x/> may be used.
See also:
branch.
- target:
In a build system, a thing that may be created or updated.
Targets typically have prerequisites that must be up to
date before the target itself can be updated. Targets may also be
symbolic, i.e., there may be targets that do not correspond to
files or other objects. In this case, the target is simply a
symbolic name for a set of actions.
See also:
action,
default target,
dependency,
phony target.
- target program:
The program being controlled by a debugger; also called the debuggee.
- template:
An outline of a web page, which a program then fills in with
specific content. In older systems, templates contain a mix of
program code and HTML; newer systems try to keep the two separate
in order to simplify maintenance.
- test suite:
A collection of unit tests. Tests
are grouped into test suites in order to make them easier to
manage, and so that developers can easily re-run
logically-connected sets of tests.
- test-driven development (TDD):
The practice of writing unit tests before writing
application code. TDD is a core practice in Extreme Programming, but has
been around since at least the 1970s. Its main advantage is that
it helps programmers clarify their ideas about what their code is
supposed to do before they have become emotionally attached to
that code. It also increases the odds of some tests actually
being written, and gives programmers a finish line to aim for:
when all the tests pass, the code must be done.
- text:
The non-element content of an XML document; in an
HTML page, the text is what is displayed, while the tags control its formatting.
- three-tier architecture:
An architecture in which data is stored in a database (tier 1),
which is manipulated by a server (tier
2), and viewed in a web browser (tier 3).
- ticket:
A single work item in an issue
tracker. A ticket may describe a bug that needs to be
fixed, an enhancement that is to be added, a question that needs
to be answered, or any other task.
See also:
ticket, closed,
ticket, open.
- ticket, assigned:
A ticket that someone is currently responsible for.
- ticket, closed:
A ticket that has been completed.
(Note, however, that closed tickets may later be reopened if it
turns out that a bug fix doesn't work, or that an enhancement is
incomplete.)
- ticket, open:
A ticket that has not yet been
completed.
- traceability:
The ability to determine where a piece of code or data came
from, and/or how it was produced.
- transaction:
A set of operations which take effect in a reliable, consistent
manner. If a transaction cannot be completed (e.g., because of a
system failure), it is guaranteed to have no effect.
- Transmission Control Protocol (TCP):
A communication protocol in the IP family that provides reliable
in-order delivery of data. Programs communicating via TCP can
read and write as they would with files (at least, until something
goes wrong).
See also:
socket,
User Datagram Protocol (UDP).
- triage:
The process of sorting, prioritizing, and assigning tickets. As the project deadline approaches,
triage is done more frequently, in order to keep the team focused
on things that actually need to be done.
- trigger:
A procedure which is automatically invoked when a database table
is modified. The term is also applied to code that runs whenever
the content of a version control repository are updated.
- tuple:
An immutable sequence.
- two's complement:
A way to represent signed integers in computer memory. The most
significant bit in positive integers is 0; the other bits are used
to store magnitude. Negative values “wrap around”,
like a car's odometer, so that -1 is the bit string 111…111,
-2 is 111…110, and so on, all the way to the most negative
number, which is 100…000. Note that two's complement is
asymmetric: since zero counts as a positive number, the absolute
value of the most negative number is one greater than the absolute
value of the most positive number. Put another way, N bits can
represent values from -(2N-1) to
2N-1-1. Thus, three bits can represent the
integers -4…3.
- type-switch:
A procedural way to
implement polymorphism, in which
the program tests the type of the data, then chooses which
function to call.
U
V
- validate:
To check that input data is of the right type, in range, etc.
Failing to validate data is a common source of security problems.
- verifiable deliverable:
A project task (such as implementation of a particular feature)
whose completion can be checked by an independent observer. Where
possible, features should be described in terms of verifiable
deliverables, so that there is some way to tell what's actually
done at any time.
- version control system:
A tool for managing changes to a set of files. Each set of
changes creates a new revision
of the files; the version control system allows users to recover
old revisions reliably, and
helps manage conflicting changes made by different users.
- virtual machine (VM):
A program that makes a computer behave as if it were some other
type of computer. Many modern programming languages, such as Java
and Python, run on virtual machines, rather than directly on the
computer's hardware. The main advantages are portability (once
the VM has been ported to a new machine, all of the programs
running on it will also run on that machine) and security (the VM
can enforce much more complicated safety rules than today's
hardware). The main disadvantage is speed: since the VM may have
to execute several physical instructions to simulate a single
logical instruction, programs running on a VM may be many times
slower than programs running natively.
- vision statement:
A one- or two-sentence summary of a project's purpose and plan;
also known as an elevator pitch.
W
- watchpoint:
A breakpoint that is associated
with a variable, or a region of memory, rather than with a
location in the program's source code. The program suspends
execution whenever any of the data associated with a watchpoint is
modified.
See also:
conditional breakpoint.
- waterfall model:
A software development process in which requirements analysis,
design, implementation, and testing are done strictly in that
order. The waterfall model is almost never used in practice;
instead, its main reason for existing is to give software
engineering professors something to critique.
- web services:
A software application that exchanges data with others by sending
XML data via the HTTP protocol. Most modern web services encode
data using the SOAP standard.
See also:
screen scraping.
- web server:
A server that handles HTTP requests.
- web spider:
A program that browses the web on its own by recursively following
links in the pages it finds.
- weblog:
See blog.
- whitelist:
A list of email addresses from which messages will be accepted,
which is part of a “forbid unless allowed” authorization policy.
See also:
blacklist.
- wiki:
A web site that allows users to edit pages in place, typically
using markup rules much simpler than those of standard HTML. The
name comes from the Hawaiian word for “quickly”.
- wildcard:
A character used in pattern matching. In the Unix shell, the
wildcard “*” matches zero or more characters, so that
*.txt matches all files whose names end in .txt.
- working copy:
A personal copy of the files being managed by a version control system.
Changes the user makes to the working copy do not affect other
users until they are committed
to the repository.
X
Y
Z
Online Resources
Ant:
A Java-oriented build tool that uses XML files instead of Makefiles.
(Viewed 2006-02-20.)
Apache:
The main site for the most widely-used web server in the
world.
(Viewed 2005-07-26.)
Bitten:
A tool for running builds continuously in the background across
multiple machines.
(Viewed 2005-09-01.)
Boost.Python:
A C++ library that enables interoperability with Python.
(Viewed 2006-04-28.)
Bugzilla:
An industrial-strength issue tracking system that is widely used in
open source projects.
(Viewed 2006-03-01.)
CollabNet:
A software project management portal used in both commercial and
open source projects.
(Viewed 2006-03-01.)
CruiseControl:
A framework for managing continuous builds. Each time you update
something in your version control repository, CruiseControl
recompiles your code, re-runs your tests, and lets you (and your
teammates) know if you've broken anything.
(Viewed 2005-07-26.)
CVS:
A version control system that has been the backbone of the open source movement
almost since its inception. Subversion is slowly replacing it.
(Viewed 2006-02-17.)
Cygwin:
A Linux-like environment for Windows, which brings with it a lot
of other tools (like SSH and GNU Make).
(Viewed 2005-07-26.)
DB2:
A high-end commercial databse management system from IBM.
(Viewed 2005-02-24.)
DDD:
The Data Display Debugger is a graphical front end for a variety of
debuggers, including GDB and the Python debugger.
(Viewed 2006-04-23.)
Dive Into Python:
The complete text of [Pilgrim 2004]
available on-line.
(Viewed 2005-09-06.)
Django:
A Python web application development framework with
some of the same capabilities as Ruby on Rails (but not
as many users, or as much documentation).
(Viewed 2006-04-30.)
Docutils:
Python's documentation utilities, which are designed to convert
plain text documentation (such as docstrings)
into HTML and other formats.
(Viewed 2006-04-28.)
DrProject:
An entry-level software project management portal derived from
Trac that has been tailored for classroom and
small-team use.
(Viewed 2006-03-01.)
Eclipse:
Originally developed by IBM for Java development, Eclipse is the
biggest open source development environment around these days.
There are literally hundreds of plugins for it, and hundreds of
thousands of users. It's not for the faint of heart (and it
definitely won't be happy on a four-year-old hand-me-down machine),
but it's one of the real power tools of modern programming.
(Viewed 2005-08-10.)
ElementTree:
An alternative XML manipulation library for Python that pays more
attention to the philosophy of the language than to standards like
DOM.
(Viewed 2006-04-28.)
F2PY:
An open source tool to connect Python and Fortran code.
(Viewed 2006-04-28.)
Firefox:
The best web browser around, where “best” means
“nicest interface”, “most extensible”, and
“least insecure”.
(Viewed 2005-07-26.)
GDB:
The GNU Project debugger is a program that watches and manipulates
other programs. It works with many languages, on many platforms;
when combined with DDD, it's actually not that hard
to use.
(Viewed 2006-04-23.)
Gnumeric:
A cross-platform open source spreadsheet.
(Viewed 2006-04-08.)
Internet Groupware for Scientific Collaboration:
While it is now several years old, Udell's examination of what
the web could be, and how it could help scientists collaborate more
effectively, is still as thought-provoking as it was when it first
appeared.
(Viewed 2005-07-26.)
JUnit:
A unit testing framework for Java that has inspired many workalikes
and extensions.
(Viewed 2006-02-22.)
Kid:
An HTML templating system for Python.
(Viewed 2006-03-06.)
Make:
The standard build tool for Unix. Users describe dependencies in a Makefile, along with the actions that must be
executed to update a file if it is older than any of its
dependencies. Make then determines which actions need to be
executed, and an order in which they may safely be run.
(Viewed 2006-03-01.)
Microsoft Visual Studio:
A full-featured IDE for Microsoft Windows development.
(Viewed 2006-02-20.)
MySQL:
The most popular open source database around (though many
discerning users prefer PostgreSQL).
(Viewed 2005-07-29.)
Oracle:
A high-end commercial database system produced by a company of the
same name.
(Viewed 2006-02-24.)
Politics and the English Language:
A brilliant description of how turgid language is often used as a
substitute for thought. The particular examples may be a little
dated, but Orwell's writing never is.
(Viewed 2006-04-28.)
PEP-008: Python Style Guide:
A semi-official guide to Python coding
conventions.
(Viewed 2005-07-26.)
Perforce:
An excellent commercial version control system.
(Viewed 2005-11-25.)
PostgreSQL:
The main site for an advanced open source relational database.
It may not be as popular as its main competitor, MySQL, but most
people who have used both have found PostgreSQL easier to work
with.
(Viewed 2005-07-26.)
PyAmazon:
A simple Python library for fetching data from Amazon.com.
(Viewed 2006-04-23.)
PyChecker:
A code checking tool for Python that complements
PyLint.
(Viewed 2005-09-06.)
Pyfort:
An open source tool to connect Python and Fortran code.
(Viewed 2006-04-28.)
PyLint:
A code checking tool for Python that complements
PyChecker.
(Viewed 2005-09-06.)
Python:
The main site for all things Python.
(Viewed 2005-07-26.)
Python Cookbook:
An ever-growing collection of Python tips and
tricks.
(Viewed 2005-07-29.)
Python Software Foundation:
A non-profit organization devoted to advancing open source
technology related to Python, and the main financial
sponsor of this course.
(Viewed 2005-07-26.)
RapidSVN:
A cross-platform GUI for Subversion.
(Viewed 2005-09-13.)
Roundup:
A bug-tracking system in which each ticket automatically becomes
a self-maintaining mailing list.
(Viewed 2006-04-28.)
Ruby:
A scripting language with many of the same capabilities as Python.
(Viewed 2006-04-30.)
Ruby on Rails:
A third-generation web application framework that simplifies programmers' lives
by emphasizing convention over configuration.
(Viewed 2006-04-30.)
SCons:
A powerful Python-based build management
tool.
(Viewed 2005-07-28.)
Seamonkey Code Reviewer's Guide:
A simple set of guidelines for code reviews from the Mozilla Foundation.
(Viewed 2006-03-01.)
SQLObject:
An object-relational mapping package for Python.
(Viewed 2006-03-01.)
Software Carpentry:
The permanent home for these notes.
(Viewed 2006-08-22.)
SourceForge:
A software project management portal whose main installation is a
clearing house for thousands of open source projects.
(Viewed 2006-02-19.)
SQLite:
A small, simple, and very fast relational database that can be
run on its own, or integrated into other applications.
(Viewed 2005-07-26.)
Subversion:
The main site for Subversion is aimed more at Subversion's
developers than at its users; if you're looking for a how-to,
[Mason 2005] is a good place to start.
(Viewed 2005-07-26.)
SWIG:
A tool for generating Perl, Python, and other bindings for
C programs.
(Viewed 2006-04-03.)
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!):
Joel Spolsky's 15-minute guide to character set encodings, and
what you have to do to deal with the fact that most of the world
doesn't use the standard American alphabet. This article is
reprinted in [Spolsky 2004].
(Viewed 2005-07-26.)
TortoiseSVN:
A cross-platform GUI for Subversion.
(Viewed 2005-09-13.)
Trac:
An entry-level software project management portal that is much
easier to install, administer, and use than full-sized alternatives
like CollabNet and SourceForge.
(Viewed 2006-03-01.)
TurboGears:
A Python web application development framework with
some of the same capabilities as Ruby on Rails (but not
as many users, or as much documentation).
(Viewed 2006-04-30.)
University of Toronto:
Canada's biggest university, and the host institution for much of
this work.
(Viewed 2005-07-26.)
Version Control with Subversion:
A free on-line version of Collins-Sussman et al's [Collins-Sussman et al 2004].
(Viewed 2006-01-11.)
WingIDE:
A commercial IDE targeted solely at Python
developers.
(Viewed 2005-07-29.)
YesLogic:
Makers of Prince, the document formatter and generator used to
produce the PDF version of these notes.
(Viewed 2005-10-05.)
List of Figures
List of Tables
Syllabus
Introduction:
introduction, self assessment, scientific programming today, comparison with experimental science, comparison with industry, solutions, changes on the horizon, course content, what you will need, open source vs. commercial tools, contributing, recommended reading, typographic conventions, summary.
Shell Basics:
introduction, the shell, shell vs. operating system, file system, absolute and relative paths, basic navigation commands, command execution cycle, command flags, creating files and directories, basic tools, summary.
More Shell:
introduction, wildcards, input, output, and redirection, pipes, environment variables, configuration, the PATH variable, file ownership and permissions, directory permissions, changing permissions, Windows ownership and permission, some more advanced tools, summary.
Version Control:
introduction, collaboration, version control systems, choosing a version control system, basic operations, command line and GUI clients, resolving conflicts, starvation, binary files, reverting, rolling back, creating repositories and checking out working copies, Subversion command reference, reading Subversion output, summary.
Automated Builds:
introduction, build tool requirements, introducing Make, basic features, structure of a Makefile, handling multiple targets, defining phony targets, dependencies, updating dependencies, conventions, automatic variables, pattern rules, dependencies once again, macros, getting information from the outside world, functions, pros and cons, alternatives, summary.
Basic Scripting:
motivation, Python's Strengths, Python's Weaknesses, why Python?, sturdy vs. nimble execution cycle, running Python, shortcuts, variables, printing, quoting, converting values to strings, escape sequences, numbers, arithmetic, Booleans, short-circuit evaluation, comparisons, conditionals, why indentation?, while loops, break and continue, string formatting, format specifiers, supported formats, summary.
Strings, Lists, and Files:
introduction, strings, slicing, bounds checking rules, negative indices, methods, string methods, chaining method calls, membership, lists, modifying lists, concatenation, deletion, list methods, for loops, ranges, list membership, nesting lists, aliasing, tuples, multi-valued assignment, unpacking structures loops, file I/O, file I/O example, looping over files, summary.
Functions and Libraries:
introduction, defining functions, returning values, variable scope, aliasing, default parameter values, functions are objects, function attributes, creating modules, module scope, other ways to import, import executes statements, the __name__ variable, system library, command-line arguments, standard I/O, search path, exiting the program, math library, file system programming, file and directory status, The os.path Module, summary.
Style:
why read code, cognition, Python style guide, naming, scope and size, example, function length, determining functionality, reading techniques, idioms, style-checking tools, documentation, traceability, embedding documentation, docstrings.
Quality Assurance:
introduction, limits to testing, terminology, test results and specifications, structuring tests, simple example, try and except, simple exception example, exceptions, exception hierarchy, exception handler stack, raising exceptions, when and how to use exceptions, handling errors in tests, test-driven design, design by contract, assertions, defensive programming, summary.
Sets, Dictionaries, and Complexity:
introduction, sets, set operations, example, implementation and implications, why set elements must be immutable, frozen sets, language design, quantifying efficiency, algorithmic complexity, motivating dictionaries, working with dictionaries, dictionary methods, counting frequency, ordering, inverting, dictionary string formatting, variable-length argument lists, summary.
Debugging:
introduction, what's wrong with print statements, symbolic debuggers, debugger features, kinds of debuggers, integrated development environments, command-line debuggers, inspecting values, controlling execution, how debuggers work, implementing breakpoints, advanced operations, logging, logging levels, logging example, Agans' Rules, get it right the first time, what is it supposed to do?, is it plugged in?, make it fail, divide and conquer, change one thing at a time, write it down, be humble, summary.
Object-Oriented Programming:
introduction, abstract data types, terminology, a simple class, methods, members, encapsulation, constructors, constructor style, special methods, inheritance, inheritance example, overriding methods, polymorphism, duck typing, Liskov substitution principle, ecosystem example, CRC cards, summary.
More on Objects:
introduction, overriding built-in functions, operator overloading, right-hand and left-hand operators, other special methods, sparse vectors, semantics of vector length, vector behavior, dot product, addition, testing, static data members, static methods, design patterns, singleton pattern, visitor pattern, abstract factory pattern, command pattern, other patterns, summary.
Unit Testing:
introduction, unit testing frameworks, big picture, implementing checks, simple example, running sum example, cost effectiveness, setup and teardown, testing exceptions, testing I/O, stubs and mock objects, test performance, choosing tests, rectangle overlap example, what to test first, summary.
Regular Expressions:
matching constant strings, matching alternatives, precedence, escaping special characters, raw strings, sequences, optional elements, character sets, common abbreviations, special cases, anchors, extracting matches, match objects, match groups, compiled REs, finding all matches, other patterns, summary.
Binary Data:
introduction, why use binary, representing numbers, two's complement, bitwise operators, shifting, setting and clearing bits, bit flags, floating point numbers, floating point spacing, floating point roundoff, binary I/O, binary I/O mode, packing data structures, packing, unpacking data, struct module, hexadecimal characters, format specifiers, calculating sizes, endianness, variable-length data, dynamic formats, metadata, metadata file structure, summary.
XML:
introduction, history, formatting rules, text, XHTML, critique, attributes, when to use attributes, more XHTML tags, lists and tables, images, links, DOM, basic features of DOM, DOM tree example, creating a DOM tree, converting to text, other ways to create documents, details, finding nodes, walking a DOM tree, modifying a DOM tree.
Relational Databases:
history, when to use a database, experimental data example, Using SQL, creating tables, inserting data, simple selection, sorting, selection, joins, ID translation example, keys and constraints, eliminating duplicates, aggregation, grouping, self joins, null, database design, nested queries, nested query example, further examples, application programming, Python database example, concurrency, transactions, transaction example, using transactions, testing, advanced topics, summary.
Spreadsheets:
introduction, getting started, entering data, formatting data, formulas, replicating formulas, built-in functions, commonly-used functions, dependencies, conditionals, lookup tables, lookup table example, absolute references, larger data set, creating charts, customizing charts, creating a log-log chart, analysis, programming spreadsheets, summary.
Integration:
introduction, running external programs, subprocess module, running in place, running with arguments, capturing output, providing input, deadlock, pros and cons, integrating C into Python, Python object structure, calling conventions, boilerplate, loading and calling, wrapping C++, SWIG, Embedding Python in C, loading modules, plugin frameworks, manual loading, application, namespaces, summary.
Web Client Programming:
introduction, component object models, concurrency, partial failure, underlying protocols, sockets, client/server vs. peer-to-peer, socket client example, socket server example, HTTP, HTTP request line, HTTP headers, HTTP body, HTTP response, HTTP example, urllib, urllib example, building a spider, parameterizing requests, URL encoding, encoding example, screen scraping, web services, Amazon example.
Web Server Programming:
introduction, motivation for CGI, CGI, passing information to the CGI, passing information back, MIME types, basic CGI, invoking the basic CGI, generating dynamic content, forms, creating forms, example form, parameter names, handling form data, form handling example, development tips, server-side state, maintaining state in files, HTML templating, concurrency, file locking, the problem of state, cookies, creating cookies, cookie example, cookie tips.
Security:
introduction, goals, limitations of technical solution, terminology, risk assessment, cataloguing attacks, example attack, leaking information, SQL injection, default settings and denial of service, phishing, attacking data entry, timed attacks, secure HTTP, basic cryptography, public-key cryptography, public-key cryptography in action, digital signatures, securing login, other areas of insecurity.
The Development Process:
introduction, design vs. agility, project lifecycle, vision statement, gathering requirements, from requirements to features, waterfall model, spiral model, Extreme Programming, analysis & estimation, time estimates, A&E format, reviews, prioritization, scheduling, science fiction scheduling, development, tracking progress, burn rate, winding up, post mortem, summary.
Teamware:
motivation, DrProject architecture, event log, weblogs, repository browser, mailing lists, whitelisting mail addresses, issue tracker, creating and viewing tickets, ticket guidelines, writing useful tickets, updating tickets, roadmap and milestones, priorities and triage, more complex workflows, wiki, wiki syntax, editing wiki pages, wiki links, rules of the road.
Backward, Forward, and Sideways:
introduction, classic mistakes, branching, merging, and tagging, creating patches, SCons, SCons example, persistence, pickling example, object-relational mapping, web development frameworks, refactoring, refactoring examples, refactoring tools, code reviews, reading code, code review checklist, UI design, paper prototyping, more reading, rules of programming.
| Copyright © 2005-06 Python Software Foundation.
| |