Invasion Percolation

Opening

Ethan Ecosystem is studying the way pollutants spreads through fractured rock (Figure 1). To simulate this, he wants to use a model called invasion percolation.

In its simplest form, invasion percolation represents the rock that the pollutant is spreading through as a two-dimensional grid of square cells filled with random values. The algorithm starts by marking the center cell as being polluted, then looks at that cell's four neighbors (Figure 2). The one with the lowest value is the one that has the least resistance to the spread of the pollutant, so the algorithm marks that as being filled as well. It then looks at the six neighbors of the entire filled region, and once again finds and marks the one with the lowest value. This process continues until a certain percentage of cells have been filled (i.e., there's no more pollutant), or until the pollution reaches the boundary of the grid.

Figure 2: Invasion Percolation Algorithm

If two or more cells on the boundary are tied equal for the lowest value, the algorithm can either fill them all in simultaneously, or pick one at random and fill that in. Either way, the fractal this algorithm produces will tell Ethan how quickly the pollutant will spread, and how much of the rock will be contaminated.

But if Ethan wants to look at the statistical properties of these fractals, he will need to do many simulation on large grids. That means his program has to be fast, so this lesson will look at:

how we build a program like this in the first place,
how we tell if it's working correctly, and
how we speed it up.

The order of the second and third steps is important. There's no point speeding something up if it isn't working correctly, or if we don't know whether it's working correctly or not. Once we know how to tell, on the other hand, we can focus on performance improvement, secure in the knowledge that if one of our bright ideas breaks things, our program will let us know.

Instructors

I have never gotten as far as this lesson in a two-day bootcamp aimed at novices, but I have frequently used it in week-long classes, or in classes aimed at more experienced learners. It has everything we want in an extended example: the problem being solved is easy to understand, many of the ideas we've discussed in earlier lessons come up naturally during implementation, testing is challenging but possible, and it motivates two new ideas (performance profiling and algorithmic improvement). It's also a great lead-in to a discussion of object-oriented programming: once the grid has been refactored a couple of times, it's natural to rewrite it as a class, and then swap in another version that has the same interface but uses NumPy for storage.

How this lesson is presented is just as important as its content. We teach top-down refinement, but most programmers actually start by working out a few key details, then growing toward those fixed points. For example, a programmer might build a clustering program by implementing the core algorithm and the data structures it uses, then growing outward to include I/O routines, some post-analysis, a user interface, and so on. This lesson uses that development cycle, which is why the main routine of the program doesn't appear until the middle.

The Grid

Objectives

Write code that uses a list of lists to represent a two-dimensional grid.
Explain why programmers put FIXME comments in code.

Lesson

Let's start by looking at how to represent a two-dimensional grid in Python. By "two-dimensional", we mean something that's indexed by X and Y coordinates. Our grid is discrete: we need to store values for locations (1, 1), (1, 2), and so on, but not locations like (1.512, 7.243).

Why Not Use a NumPy Array?

Because (a) we want to get some more practice manipulating lists and dealing with aliasing, and (b) because we need to leave something for learners to do as challenges.

Each cell in the grid stores a single random value representing the permeability of the rock. We also need a way to mark cells that have been filled with pollutant. Once a cell has been filled, we don't care about its value any longer, so we can use any number that isn't going to appear in the grid otherwise as a marker to show which cells have been filled. For now, we'll use -1.

Overloading Meaning

Note that this means we're using integers in two ways. The first is as actual data values; the second is as flags to represent the state of a cell. This is simple to do, but if we ever get data values that happen to contain the numbers that we're using to mark filled cells, our program will misinterpret them. Bugs like this can be very hard to track down.

Before we go any further, we also have to make some decisions about the shapes of our grids. First, do grids always have to be square, i.e., N×N, or can we have rectangular grids whose X and Y sizes are different? Second, do grids always have to be odd-sized, so that there's a unique center square for us to fill at the start, or can we have a grid that is even in size along one or both axes?

The real question is, how general should we make the first version of this program—indeed, of any program? Some people say, "Don't build it until you need it," while others say, "A week of hard work can sometimes save you an hour of thought." Both sayings are true, but as in any other intellectually demanding field, knowing what rules to apply when comes with experience, and the only way to get experience is to work through many examples (and make a few mistakes). Again, let's do the simple thing for now, and assume that grids are always square and odd-sized.

Now, how are we going to store the grid? One way is to use a nested list. This lets us use double subscripts to refer to elements, which is really what we mean by "two dimensional". Here's a piece of code that builds a grid of 1's as a list of lists; we'll come back later and show how to fill those cells with random values instead:

# Create an NxN grid of random integers in 1..Z.
assert N > 0, "Grid size must be positive"
assert N%2 == 1, "Grid size must be odd"
grid = []
for x in range(N):
    grid.append([])
    for y in range(N):
        grid[-1].append(1) # FIXME: need a random value

The first thing we do is check that the grid size N is a sensible value. We then assign an empty list to the variable grid. The first time through the outer loop, we append a new empty list to the outer list. Each pass through the inner loop, we append the value 1 to that inner list (Figure 3). We go back through the outer loop to append another sub-list, which we grow by adding more 1's, and so on until we get the grid that we wanted.

FIXME

At any point when writing a program, there may be half a dozen things that need to be done next. The problem is, we can only write one at a time. It's therefore a common practice to add comments to the code to remind ourselves of things we need to come back and fill in (or tidy up) later. It's equally common to start such comments with a word like "FIXME" or "TODO" to make them easier to find with tools like grep.

Key Points

Get something simple working, then start to add features, rather than putting everything in the program at the start.
Leave FIXME markers in programs as you are developing them to remind yourself what still needs to be done.

Challenges

Draw a diagram showing how the XY coordinates used to index the list of lists above map onto a Cartesian XY grid, then write a function show_grid that prints the grid's values in normal (Cartesian) orientation.

Aliasing

Objectives

Correctly diagnose bugs caused by aliasing.

Lesson

Before we go further with our list-of-lists implementation, we need to revisit the issue of aliasing and look at some bugs that can arise when your programs uses it. Take another look at the list-of-lists in Figure 3. A single list serves as the structure's spine, while the sublists store the actual cell values.

Here's some code that tries to create such a structure but gets it wrong:

# Incorrect code
assert N > 0, "Grid size must be positive"
assert N%2 == 1, "Grid size must be odd"
grid = []
EMPTY = []
for x in range(N):
  grid.append(EMPTY)
  for y in range(N):
    grid[-1].append(1)

The only change we've made is to introduce a variable called EMPTY so that we can say, "Append EMPTY to the grid" in our loop. How can this be a bug? Aren't meaningful variable names supposed to be a good thing?

To see what's wrong, let's trace the execution of this program. We start by assigning an empty list to the variable grid. We then assign another empty list to the variable EMPTY. In the first pass through our loop, we append the empty list pointed to by EMPTY to the list pointed to by grid to get the structure shown in Figure 4. We then go into our inner loop and append a 1 to that sublist. Going around the inner loop again, we append another 1, and another. We then go back to the outer loop and append EMPTY again.

The structure shown on the left is now broken: both cells of the list pointed to by grid point to the same sublist, because EMPTY is still pointing to that list, even though we've changed it. When we go into the inner loop the second time, we're appending 1's to the same list that we used last time.

Debugging With Pictures

Aliasing bugs are notoriously difficult to track down because the program isn't doing anything illegal: it's just not doing what we want in this particular case. Many debugging tools have been built over the last thirty years that draw pictures of data structures to show programmers what they're actually creating, but none has really caught on yet, primarily because pictures of the objects and references in real programs are too large and too cluttered to be comprehensible. As a result, many programmers wind up drawing diagrams like Figure 4 by hand while they're debugging.

Key Points

Aliased data can be modified at several different times or places in a program, which is frequently a bug.
Draw pictures of data structures to aid debugging.

Challenges

FIXME

Randomness

Objectives

Explain why computer-generated "random" numbers aren't actually random.
Create and re-create pseudo-random numbers in a program using library functions.

Lesson

Now that we have a grid, let's fill it with random numbers chosen uniformly from some range 1 to Z. We should check the science on this, as there was nothing in our original specification that said the values should be uniformly distributed, but once again we'll do something simple, make sure it's working, and then change it later. Our code looks like this:

from random import seed, randint
assert N > 0, "Grid size must be positive"
assert N%2 == 1, "Grid size must be odd"
assert Z > 0, "Range must be positive"
seed(S)
grid = []
for x in range(N):
    grid.append([])
    for y in range(N):
        grid[-1].append(randint(1, Z))

The changes are pretty small: we import a couple of functions from a library, check that the upper bound on our random number range makes sense, initialize the random number generator, and then call randint to generate a random number each time we need one.

To understand these changes, let's step back and look at a small program that does nothing but generate a few seemingly random numbers:

from random import seed, randint
seed(4713983)
for i in range(5):
    print randint(1, 10),
7 2 6 6 5

The first step is to import functions from the standard Python random number library called (unsurprisingly) random. We then initialize the sequence of "random" numbers we're going to generate—you'll see in a moment why there are quotes around the word "random". We can then call randint to produce the next random number in the sequence as many times as we want.

Pseudo-random number generators like the one we're using have some important limitations, and it's important that you understand them before you use them in your programs. Consider this simple "random" number generator:

base = 17  # a prime
value = 4  # anything in 0..base-1
for i in range(20):
    value = (3 * value + 5) % base
    print value,
0 5 3 14 13 10 1 8 12 7 9 15 16 2 11 4 0 5 3 14

It depends on two values:

The base, which is a prime number, determines how many integers we'll get before the sequence starts to repeat itself. Computers can only represent a finite range of numbers, so sooner or later, any supposedly random sequence will start to repeat. Once they do, values will appear in exactly the same order they did before.
The seed controls where the sequence starts. With a seed of 4, the sequence starts at the value 0. Changing the seed to 9 shifts the sequence over: we get the same numbers in the same order, but starting from a different place. We'll use this fact later one when it comes time to test our invasion percolation program.

These numbers aren't actually very random at all. For example, did you notice that the number 6 never appeared anywhere in the sequence? Its absence would probably bias our statistics in ways that would be very hard to detect. But look at what happens if 6 does appear:

base = 17
value = 6
for i in range(20):
    value = (3 * value + 5) % base
    print value,
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6

3 times 6 plus 5 mod 17 is 6 again, and so our sequence gets stuck. How can we prove that this won't ever happen for an arbitrary seed in a random number generator? And how can we prove that something subtler won't go wrong?

In fact, computers can't generate real random numbers. But if we're clever, they can generate numbers with many of the same statistical properties as the real thing. This is very hard to get right, so you should never try to build your own random number generator. Instead, you should always use a function from a good, well-tested library (like Python's).

Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin. For, as has been pointed out several times, there is no such thing as a random number. There are only methods to produce random numbers, and a strict arithmetic procedure of course is not such a method.
— John von Neumann

Key Points

Use a well-tested random number generation library to generate pseudorandom values.
Those values are pseudo-random.
If a random number generation library is given the same seed, it will produce the same sequence of values.

Challenges

FIXME

Neighbors

Objectives

Explain what short-circuit evaluation is, when it occurs, and how to take advantage of it.

Lesson

Now that we have:lesson filled our grid, let's find cells' neighbors. (We need to do this because pollutant can only spread to cells that are adjacent to ones that have already been filled.) The blue cell shown in Figure 5 is a neighbor of the filled region if any of the green cells already have that special marker value. Note that we're not looking at the cells that are on the diagonals: we should check the science on this, but again, we'll do the simple thing for now and change it later if we need to.

How to Put Things Off

This is the second time we've said, "We'll change it later if we need to." Each time we say this, we should design our software so that making the change is as easy as possible—ideally, a matter of changing one function call or one parameter value. We haven't done that yet, but we will by the end of this chapter.

Here's a piece of code that tests whether a cell is a candidate for being filled in:

# Is a cell a candidate for filling?
# Version 1: has bugs!
for x in range(N):
    for y in range(N):
        if is_filled(grid, x-1, y) \
        or is_filled(grid, x+1, y) \
        or is_filled(grid, x, y-1) \
        or is_filled(grid, x, y+1):
            ...cell (x, y) is a candidate...

It seems simple: for each (x, y) coordinate in the grid, look at the cells that are left, right, up, and down. If any of them is filled, then this cell is a candidate for filling.

However, this code doesn't take into account what happens at the edges. If we subtract 1 when x is zero, we get an X coordinate of -1. In Python, that means the last cell in the row (since negative indices count backward from the end of a list). That will wrap around to the far edge, which is definitely not what we want.

The situation on the other border isn't quite as bad: if we add one to the X coordinate when we're at the right-hand edge, our neighbor index will be out of bounds and Python will raise an exception, so at least we'll know we did something wrong.

Here's another version of the code that tests for this:

# Is a cell a candidate for filling?
# Version 2: long-winded
for x in range(N):
    for y in range(N):
        if x > 0:
            if is_filled(grid, x-1, y):
                ...cell (x, y) is a candidate...
        ...repeat for the other three cases...

For each (x, y) coordinate, we test to make sure that x is greater than zero, i.e, that we're not on the left edge, before checking to see if the cell to our left is filled. We repeat a similar double test for each of the other directions.

We can make this a bit more readable by combining the two tests with and:

# Is a cell a candidate for filling?
# Version 3: good enough for production
for x in range(N):
    for y in range(N):
        if (x > 0) and is_filled(x-1, y):
            ...cell (x, y) is a candidate...
        elif (x < N-1) and is_filled(x+1, y):
            ...and so on...

This works because of short-circuit evaluation. In Python (and many other languages), and only tests as many conditions as it needs to in order to determine an answer. If the first part is false, the second one doesn't need to be tested, since and is only true if both parts are true. Similarly, if the first condition in an or is true, Python doesn't both to evaluate the second condition. This allows people to write code like:

if (value != 0) and (1.0/value < 1.0):
    ...do something...

because the division in the second part will only happen if value is non-zero (i.e., if the first part is true). Similarly, it allows code like this:

next_list = current_list or ['default']

which assigns the value of current_list to next_list if current_list contains something, or the list ['default'] if current_list is empty. Some programmers find these tricks useful; most find them hard to read.

Options

There are several other good ways to structure even this short piece of code. For example, we could check that the X and Y indices are in range inside is_filled, and always return false if they're not. We could also use one big conditional instead of an if and four elifs in order to avoid duplicating the code that does something to a cell if it is indeed a candidate:

        if ((x > 0)   and is_filled(x-1, y)) \
        or ((x < N-1) and is_filled(x+1, y)) \
        or ((y > 0)   and is_filled(x, y-1)) \
        or ((y < N-1) and is_filled(x, y+1)):
            ...cell (x, y) is a candidate...

There's no clear reason to choose any of these approaches over any of the others, at least not yet, so we'll continue with the one we have.

Key Points

and and or stop evaluating arguments as soon as they have an answer.

Challenges

FIXME

Putting It All Together

Objectives

Translate complex tests into program statements step by step.

Lesson

The next thing on our to-do list is to resolve ties between cells that share the lowest value on the boundary. For example, our specification says that we should choose one of the three highlighted cells in Figure 6 at random. How do we keep track of the cells we're supposed to be choosing from?

We're going to do this using a set, which we will fill with (x,y) tuples holding the coordinates of boundary cells that have the lowest value we've seen so far, and use a separate variable to store that lowest value. Every time we look at a new cell, we will have to consider three cases:

Its value is greater than the minimum we've seen so far, so we can ignore it, because we know there are better cells elsewhere.
The value of the new cell is equal to the current minimum, so we must add the new cell's (x,y) coordinates to our set.
The new value is less than the current minimum, so we must replace all the coordinates that are currently in the set with the coordinates of the new cell, and re-set our minimum to be this new value.

An example will make this clearer. Suppose the range of values cells can take on is 1 to 10. Before we start looking at cells, we assign 11 to min_val (because it is one greater than the maximum possible value that could be in the grid) and assign an empty set to min_set (because we haven't look at any cells yet). We then take a look at our first cell (Figure 7). Its value is less than min_val, so we re-set min_val to 4 (the value of the cell), and we put the coordinates of this cell (in this case, X equals 12, Y equals 23) into the set.

When we look at the next cell, its value is greater than the currently known minimum value, so we ignore it. The third cell is tied equal for the minimum value, so we add its coordinates—in this case, (11,22)—to our set. The next cell is greater than the minimum value, so we ignore it.

The fifth cell we examine has a value less than the minimum value we've seen previously, so we throw out all of the coordinates we've saved in the set, and put the coordinates of this cell into the set in their place. Finally, the last boundary cell has a value equal to this new minimum, so we add its coordinates to the set.

Here's the code that implements all this:

# Keep track of cells tied for lowest value
min_val = Z+1
min_set = set()
for x in range(N):
    for y in range(N):
        if ...is a neighbor...:
            if grid[x][y] == min_val:
                min_set.add((x, y))
            elif grid[x][y] < min_val:
                min_val = grid[x][y]
                min_set = set([(x, y)])

Seeing What Isn't There

Notice that since we don't need to do anything when a cell's value is greater than the minimum we've seen so far, there isn't an else to handle that case. Some people would add a comment to make that explicit, so that the logic is complete:

            if grid[x][y] == min_val:
                min_set.add((x, y))
            elif grid[x][y] < min_val:
                min_val = grid[x][y]
                min_set = set([(x, y)])
            else:
                pass # do nothing if cell value > min_val

but other people would find this more confusing than helpful. As always, the most important thing is to be consistent.

Once we have the set of candidate cells, we can use the random library's choice function to pick one:

# Choose a cell
from random import ..., choice

min_val = Z+1
min_set = set()
...loop...
assert min_set, "No cells found"
candidates = list(min_set)
x, y = choice(candidates)

Before we call choice, we check that the set actually has something in it (because if there are no cells in the set when we finish our loop, then something's gone wrong with our program). We then convert the set to a list, because the random choice function requires an argument that can be indexed, and then use that function to choose one element from the list.

Key Points

Turn complex tests into conditional statements step by step.
Include do-nothing branches if it makes the code easier to understand.

Challenges

FIXME

Assembly

Objectives

Write a program that is over a hundred lines long using stepwise refinement.

Lesson

We now know how to create a grid, fill it with random numbers, mark cells that have been filled, find cells that might be filled next, and choose one of them at random. It's time to put all this together to create a complete program.

We will assemble the code in exactly the order we would write it (in fact, in the order in which it was written, because everything so far has actually been a rational reconstruction assembled after the code was actually working). We'll start at the top and work down, introducing functions and variables as we need them, and tidy up a little bit along the way. Here's what we write first:

'''Invasion percolation simulation.
usage: invperc.py grid_size value_range random_seed
'''
import sys, random

# Main driver.
if __name__ == '__main__':
    main(sys.argv[1:])

We start with a docstring to remind ourselves of what this program does. We then import the libraries we need and call a main function, passing in all of the command-line arguments except the first (which is the name of our script). That function starts like this:

    # Parse parameters.
    arguments = sys.argv[1:]
    try:
        grid_size = int(arguments[0])
        value_range = int(arguments[1])
        random_seed = int(arguments[2])
    except IndexError:
        fail('Expected 3 arguments, got %d' % len(arguments))
    except ValueError:
        fail('Expected int arguments, got %s' % str(arguments))

This code converts the first three values in arguments to integers and assign them to grid_size, value_range, and random_seed. If we get an IndexError, it means that one of the indices 0, 1, or 2 wasn't valid, so we don't have enough arguments. If we get a ValueError, it means that one of our attempts to convert a string to an integer failed, so again we print an error message.

We have used a function called fail to report errors. This doesn't exist yet, so we should go and write it:

def fail(msg):
    '''Print error message and halt program.'''
    print >> sys.stderr, msg
    sys.exit(1)

We give the function a docstring because every function should have one. Inside the function, we print the message to standard error so that it will appear on the user's console, then exit. Figure 8 shows the structure of the program so far: a documentation string, our fail function, and the main driver of our program.

The next step in main is to actually run the simulation. We do that by seeding the random number generator, creating a random grid, marking the center cell as filled, and then filling the rest of the grid:

    # Run simulation.
    random.seed(random_seed)
    grid = create_random_grid(grid_size, value_range)
    mark_filled(grid, grid_size/2, grid_size/2)
    fill_grid(grid) + 1

This code uses three functions that don't exist yet, so we will have to go back and write them. Before doing that, though, let's finish off the main body of the program. The last task we have is to report results, but we haven't actually decided what to do about this: nothing in our specification told us whether we were supposed to draw the fractal, calculate some statistics, or do something else entirely. For now, we'll just print the number of cells we have filled in:

    # Run simulation.
    random.seed(random_seed)
    grid = create_random_grid(grid_size, value_range)
    mark_filled(grid, grid_size/2, grid_size/2)
    num_filled_cells = fill_grid(grid) + 1
    print '%d cells filled' % num_filled_cells

We have changed fill_grid so that it returns the number of cells it filled in, which we then print. Note that we have to add one to the value returned by fill_grid because we marked the center cell as being filled manually. This is a little bit clumsy: someone who hasn't read our code carefully might reasonably think that fill_grid returns the total number of cells that are filled, not one less than that. We should go back and tidy that up later.

Here's the function to create a random grid, reproduced from earlier:

def create_random_grid(N, Z):
    '''Return an NxN grid of random values in 1..Z.
    Assumes the RNG has already been seeded.'''

    assert N > 0, 'Grid size must be positive'
    assert N%2 == 1, 'Grid size must be odd'
    assert Z > 0, 'Random range must be positive'
    grid = []
    for x in range(N):
        grid.append([])
        for y in range(N):
            grid[-1].append(random.randint(1, Z))
    return grid

It checks that the parameters it's been passed make sense, then it builds a list of lists of random values. It assumes that the random number generator has already been seeded, i.e., it is not going to seed the random number generator itself. Figure 9 shows where we put this function in our program file.

Next is mark_filled, which, as its name suggests, marks a grid cell as being filled:

def mark_filled(grid, x, y):
    '''Mark a grid cell as filled.'''

    assert 0 <= x < len(grid), \
           'X coordinate out of range (%d vs %d)' % \
           (x, len(grid))
    assert 0 <= y < len(grid), \
           'Y coordinate out of range (%d vs %d)' % \
           (y, len(grid))

    grid[x][y] = -1

We use assertions to test that the X and Y coordinates we've been given are actually in bounds. You might think we don't need this code, because if the X or Y coordinate is out of bounds, Python will fail and print its own error message, but there are three reasons to put these assertions in:

The assertions tell readers what we expect of X and Y.
These error messages are more meaningful that Python's generic "IndexError: index out of range" message.
Negative values of X and Y won't actually cause exceptions: they'll index backward from the ends of the lists.

The last line in this function assigns -1 to grid[x][y]. We're using -1 to indicate filled cells, but we don't know if people are going to remember that when they're reading our code: if you say "grid at X, Y assigned -1", it's not immediately clear what you're doing. So let's make a small change right now: near the top of our program we'll create a variable called FILLED, and give it the value -1, so that in our function we can say "grid at X, Y is assigned FILLED":

FILLED = -1

...other functions...

def mark_filled(grid, x, y):
    ...body of function...
    grid[x][y] = FILLED

FILLED is written in capital letters because we think of it as a constant, and as mentioned in the discussion of Python, constants are normally written in all caps. Putting constants at the top of the file is also a (strong) convention.

The next function in our to-do list is fill_grid. The docstring says that it fills an N×N grid until the filled region hits the boundary, and that it assumes that the center cell has been filled before it is called:

def fill_grid(grid):
    '''Fill an NxN grid until filled region hits boundary.'''

    N = len(grid)
    num_filled = 0
    while True:
        candidates = find_candidates(grid)
        assert candidates, 'No fillable cells found!'
        x, y = random.choice(list(candidates))
        mark_filled(grid, x, y)
        num_filled += 1
        if x in (0, N-1) or y in (0, N-1):
            break

    return num_filled

We begin by setting up N and num_filled, which are the grid size and the number of cells that this function has filled so far We then go into a seemingly-infinite loop, at the bottom of which we test to see if we're done, and if so, break out. We could equally well do something like this:

    filling = True
    while filling:
        ...
        if x in (0, N-1) or y in (0, N-1):
            filling = False

However we control filling, we use another function called find_candidates to find the set of cells that we might fill. This function hasn't been written yet, so we add it to our to-do list. We then check that the set of candidates it has found has something in it, because if we haven't found any candidates for filling, something has probably gone wrong with our program. And then, as discussed earlier, we make a random choice to choose the cell we're going to fill, then mark it and increment our count of filled cells. Figure 10 shows where this function fits in the file.

find_candidates is next on our to-do list:

def find_candidates(grid):
    '''Find low-valued neighbor cells.'''

    N = len(grid)
    min_val = sys.maxint
    min_set = set()
    for x in range(N):
        for y in range(N):
            if (x > 0) and (grid[x-1][y] == FILLED) \
            or (x < N-1) and (grid[x+1][y] == FILLED) \
            or (y > 0) and (grid[x][y+1] == FILLED) \
            or (y < N-1) and (grid[x][y+1] == FILLED):
                ...let's stop right there...

We're going to stop right there because this code is already hard to read and we haven't even finished it. In fact, it contains a bug—one of those y+1's should be a y-1—but you probably didn't notice that because there was too much code to read at once.

A good rule of thumb is, "Listen to your code as you write it." If the code is difficult to understand when read aloud, then it's probably going to be difficult to understand when you're debugging, so you should try to simplify it. This version of find_candidates introduces a helper function called is_candidate:

def find_candidates(grid):
    '''Find low-valued neighbor cells.'''
    N = len(grid)
    min_val = sys.maxint
    min_set = set()
    for x in range(N):
        for y in range(N):
            if is_candidate(grid, x, y):
                ...now we're talking...

Let's finish the function by adding the code we figured out earlier:

                if is_candidate(grid, x, y):
                    # Has current lowest value.
                    if grid[x][y] == min_val:
                        min_set.add((x, y))
                    # New lowest value.
                    elif grid[x][y] < min_val:
                        min_val = grid[x][y]
                        min_set = set([(x, y)])

As Figure 11 shows, the find_candidates function fits right above fill_grid in our file. We can then insert the is_candidate function we wrote in the previous section right above find_candidates and write it:

def is_candidate(grid, x, y):
    '''Is a cell a candidate for filling?'''

    return (x > 0) and (grid[x-1][y] == FILLED) \
        or (x < N-1) and (grid[x+1][y] == FILLED) \
        or (y > 0) and (grid[x][y-1] == FILLED) \
        or (y < N-1) and (grid[x][y+1] == FILLED)

There are no functions left on our to-do list, so it's time to run our program—except it's not. It's actually time to test our program, because there's a bug lurking in the code that we just put together. Take a moment, read over the final code, and try to find it before moving on to the next section.

Key Points

Put programs together piece by piece, starting at the top and working down.
Write one complete function at a time as if you already had the functions it needs.

Challenges

FIXME

Bugs

Objectives

Interrupt a running program.
Explain why programmers should test simple cases first.

Lesson

Let's run the program that we created in the previous section:

$ python invperc.py 3 10 17983
2 cells filled

The program tells us that 2 cells have been filled, which is what we'd expect: in a 3×3 grid, we fill always the center cell and one other, then we hit the boundary. Let's try a larger grid:

$ python invperc.py 5 10 27187
...a minute passes...
Ctrl-C

After a minute, we use Control-C to halt the program. It's time to fire up the debugger…

The initial grid looks OK: it is a 3-element list, each of whose entries is another 3-element list with values in the range 1 to 10. There's even a -1 in the right place (remember, we're using -1 to mark filled cells).

The next cell gets chosen and filled correctly, and then the program goes into an infinite loop in find_candidates. Inside this loop, min_set contains the points (2,2) and (1,2), and min_val is -1. That's our problem: our marker value, -1, is less than any of the actual values in the grid. Once we have marked two cells, each of those marked cells is adjacent to another marked cell, so we are repeatedly re-marking cells that have already been marked without ever growing the marked region.

This is easy to fix. The code that caused the problem is:

def find_candidates(grid):
    N = len(grid)
    min_val = sys.maxint
    min_set = set()
    for x in range(N):
        for y in range(N):
            if is_candidate(grid, x, y):
                ...handle == min_val and < min_val cases...

All we have to do is insert a check to say, "If this cell is already filled, continue to the next iteration of the loop." The code is:

def find_candidates(grid):
    N = len(grid)
    min_val = sys.maxint
    min_set = set()
    for x in range(N):
        for y in range(N):
            if grid[x][y] == FILLED:
                pass
            elif is_candidate(grid, x, y):
            ...handle == min_val and < min_val cases...

With this change, our program now runs for several different values of N. But that doesn't prove that it's correct; in order to convince ourselves of that, we're going to have to do a bit more work.

Key Points

Test.
Test the simplest case first.

Challenges

FIXME

Refactoring

Objectives

Identify opportunities to make code more comprehensible by refactoring.
Explain why randomness of any kind makes programs hard to test.
Replace random behavior in programs with something predictable for testing purposes.

Lesson

We have found and fixed one bug in our program, but how many others haven't we found? More generally, how do we validate and verify a program like this? Those two terms sound similar, but mean different things. Verification means, "Is our program free of bugs?" or, "Did we built the program right?" Validation means, "Are we implementing the right model?" i.e., "Did we build the right thing?" The second question depends on the science we're doing, so we'll concentrate on the first.

To begin addressing it, we need to make our program more testable. And since testing anything that involves randomness is difficult, we need to come up with examples that aren't random. One is a grid has the value 2 everywhere, except in three cells that we have filled with 1's (Figure 13). If our program is working correctly, it should fill exactly those three cells and nothing else. If it doesn't, we should be able to figure out pretty quickly why not.

How do we get there from here? The overall structure of our program is shown in Figure 14:

The function we want to test is fill_grid, so let's reorganize our code to make it easier to create specific grids. Grids are created by the function create_random_grid, which takes the grid size and random value range as arguments:

def create_random_grid(N, Z):
    ...

def main(arguments):
    ...
    grid = create_random_grid(grid_size, value_range)
    ...

Let's split that into two pieces. The first will create an N×N grid containing the value 0, and the second will overwrite those values with random values in the range 1 to Z:

...
def create_grid(N):
    ...

def fill_grid_random(grid, Z):
    ...

def main(arguments):
    ...
    grid = create_grid(grid_size)
    fill_grid_random(grid, value_range)
    ...

We can now use some other function to fill the grid with non-random values when we want to test specific cases, without duplicating the work of creating the grid structure.

Another part of the program we will need to change is the main function that takes command-line arguments and converts them into a grid size, a range of random values, and a random number seed:

def main(arguments):
    '''Run the simulation.'''

    # Parse parameters.
    try:
        grid_size = int(arguments[0])
        value_range = int(arguments[1])
        random_seed = int(arguments[2])
    except IndexError:
        fail('Expected 3 arguments, got %d' % len(arguments))
    except ValueError:
        fail('Expected integer arguments, got %s' % str(arguments))

    # Run simulation.
    ...

Let's introduce a new argument in the first position called scenario:

def main(arguments):
    '''Run the simulation.'''

    # Parse parameters.
    try:
        scenario = arguments[0]
        grid_size = int(arguments[1])
        value_range = int(arguments[2])
        random_seed = int(arguments[3])
    except IndexError:
        fail('Expected 3 arguments, got %d' % len(arguments))
    except ValueError:
        fail('Expected integer arguments, got %s' % str(arguments))

    # Run simulation.
    ...

scenario doesn't need to be converted to an integer: it's just a string value specifying what we want to do. If the user gives us the word "random", we'll do exactly what we've been doing all along. For the moment, we will fail if the user gives us anything else, but later on we will use scenario to determine which of our test cases we want to run.

But wait a moment: we're not going to use random numbers when we fill the grid manually for testing. We're also not going to need the value range, or even the grid size, so let's move argument handling and random number generation seeding into the if branch that handles the random scenario. Once we make this change, we determine the scenario by looking at the first command-line argument, and then if that value is the word "random", we look at the remaining arguments to determine the grid size, the value range, and the random seed. If the first argument isn't the word "random", then we fail:

    # Parse parameters.
    scenario = arguments[0]
    try:
        if scenario == 'random':

            # Parse arguments.
            grid_size = int(arguments[1])
            value_range = int(arguments[2])
            random_seed = int(arguments[3])

            # Run simulation.
            random.seed(random_seed)
            grid = create_random_grid(grid_size, value_range)
            mark_filled(grid, grid_size/2, grid_size/2)
            num_filled_cells = fill_grid(grid) + 1
            print '%d cells filled' % num_filled_cells

        else:
            fail('Unknown scenario "%s"' % scenario)
    except IndexError:
        fail('Expected 3 arguments, got %d' % len(arguments))
    except ValueError:
        fail('Expected integer arguments, got %s' % str(arguments))

The block of code inside the if is large enough that it's hard to see how what else and the two excepts line up with. Let's factor some more:

def do_random(arguments):
    # Parse arguments.
    grid_size = int(arguments[1])
    value_range = int(arguments[2])
    random_seed = int(arguments[3])

    # Run simulation.
    random.seed(random_seed)
    grid = create_random_grid(grid_size, value_range)
    mark_filled(grid, grid_size/2, grid_size/2)
    num_filled_cells = fill_grid(grid) + 1
    print '%d cells filled' % num_filled_cells

def main(arguments):
    '''Run the simulation.'''

    scenario = arguments[0]
    try:
        if scenario == 'random':
            do_random(arguments)
        else:
            fail('Unknown scenario "%s"' % scenario)
    except IndexError:
        fail('Expected 3 arguments, got %d' % len(arguments))
    except ValueError:
        fail('Expected integer arguments, got %s' % str(arguments))

# Main driver.
if __name__ == '__main__':
    main(sys.argv[1:])

That's easier to follow, but selecting everything but the first command-line argument in the if at the bottom, then selecting everything but the first of those values at the start of main, is a bit odd. Let's clean that up, and move the try/except into do_random at the same time (since the functions that handle other scenarios might have different error cases).

def do_random(arguments):
    '''Run a random simulation.'''

    # Parse arguments.
    try:
        grid_size = int(arguments[1])
        value_range = int(arguments[2])
        random_seed = int(arguments[3])
    except IndexError:
        fail('Expected 3 arguments, got %d' % len(arguments))
    except ValueError:
        fail('Expected integer arguments, got %s' % str(arguments))

    # Run simulation.
    random.seed(random_seed)
    grid = create_random_grid(grid_size, value_range)
    mark_filled(grid, grid_size/2, grid_size/2)
    num_filled_cells = fill_grid(grid) + 1
    return num_filled_cells
    print '%d cells filled' % num_filled_cells

def main(scenario, arguments):
    '''Run the simulation.'''

    if scenario == 'random':
        do_random(arguments)
    else:
        fail('Unknown scenario "%s"' % scenario)

# Main driver.
if __name__ == '__main__':
    assert len(sys.argv) > 1, 'Must have at least a scenario name'
    main(sys.argv[1], sys.argv[2:])

Figure 15 shows the structure of our program after refactoring. We have the documentation string, which we've updated to remind people that the first argument is the name of the scenario. Our fail function hasn't changed. We've split grid creation into two functions. Our fill_grid function now fills the middle cell and returns the count of all filled cells. And we have a function to parse command-line arguments. This argument-parsing function is actually specific to the random case. We should probably rename it, to make that clear.

Now let's step back. We were supposed to be testing our program, but in order to make it more testable, we had to refactor it first. Now that we've done this refactoring, we can write tests more easily. More importantly, now that we've thought this through, we are more likely to write the next program of this kind in a testable way right from the start.

Key Points

Putting each task in a function, and having one function for each task, makes programs easier to test.
Programs can often be refactored to make them more testable.
Replacing randomness with predictability makes testing easier.

Challenges

FIXME

Test

Objectives

Build scaffolding to aid testing.

Lesson

We can finally start testing our program systematically. To begin, let's add another clause to the main body of the program so that if the scenario is "5x5_line" we will create a 5×5 grid, fill a line of cells from the center to the edge with lower values, and then check that fill_grid does the right thing (Figure 16):

if __name__ == '__main__':
    scenario = sys.argv[1]
    if scenario == 'random':
        do_random(arguments)
    elif scenario == '5x5_line':
        do_5x5_line()
    else:
        fail('Unknown scenario "%s"' % scenario)

The function do_5x5_line is pretty simple:

def do_5x5_line():
    '''Run a test on a 5x5 grid with a run to the border.'''

    grid = create_grid(5)
    init_grid_5x5_line(grid)
    num_filled_cells = fill_grid(grid)
    check_grid_5x5_line(grid, num_filled_cells)

create_grid and fill_grid already exist: in fact, the whole point of our earlier refactoring was to let us re-use fill_grid for testing purposes. The new test-specific functions are init_grid_5x5_line and check_grid_5x5_line. (We will need a similar pair of functions for each of our tests, so we will write these, then use our experience to guide some further refactoring.) Here's the first function:

def init_grid_NxN_line(grid):
    '''Fill NxN grid with straight line to edge for testing purposes.'''

    N = len(grid)
    for x in range(N):
        for y in range(N):
            grid[x][y] = 2

    for i in range(N/2 + 1):
        grid[N/2][i] = 1

It's just as easy to write this function for the N×N case as for the 5×5 case, so we generalize early. The first part of the function is easy to understand: find the value of N by looking at the grid, then fill all of the cells with the integer 2. The second part, which fills the cells from the center to the edge in a straight line with the lower value 1, isn't as easy to understand: it's not immediately obvious that i should go in the range from 0 to N/2+1, or that the X coordinate should be N/2 and the Y coordinate should be i for the cells that we want to fill.

When we say "it's not obvious," what we mean is, "There's the possibility that it will contain bugs." If there are bugs in our test cases, then we're just making more work for ourselves. We'll refactor this code later so that it's easier for us to see that it's doing the right thing.

Here's the code that checks that an N×N grid with a line of cells from the center to the edge has been filled correctly:

def check_grid_NxN_line(grid, num_filled):
    '''Check NxN grid straight line grid.'''

    N = len(grid)
    assert num_filled == N/2 + 1, 'Wrong number filled'

    for x in range(N):
        for y in range(N):
            if (x == N/2) and (y <= N/2):
                assert grid[x][y] == FILLED, 'Not filled!'
            else:
                assert grid[x][y] != FILLED, 'Wrongly filled!'

Again, it's as easy to check for the N×N case as the 5×5 case, so we've generalized the function. But take a look at that if: are we sure that the only cells that should be filled are the ones with X coordinate equal to N/2 and Y coordinate from 0 to N/2? Shouldn't that be N/2+1? Or 1 to N/2, or maybe the X coordinate should be N/2+1.

In fact, these two functions are correct, and when they're run, they report that fill_grid behaves properly. But writing and checking two functions like this for each test won't actually increase our confidence in our program, because the tests themselves might contain bugs. We need a simpler way to create and check tests, so that our testing is actually helping us create a correct program rather than giving us more things to worry about. How do we do that?

Let's go back to the example in Figure 16. Why don't we just draw our test cases exactly as shown? Today's programming languages don't actually let us include sketches in programs, but we can get close using strings:

fixture = '''2 2 2 2 2
             2 2 2 2 2
             1 1 1 2 2
             2 2 2 2 2
             2 2 2 2 2'''

We can write the result as a similar string:

result = '''. . . . .
            . . . . .
            * * * . .
            . . . . .
            . . . . .'''

As you can probably guess, the '*' character means "this cell should be filled", while the '.' means "this cell should hold whatever value it had at the start". Here's how we want to use them:

def do_5x5_line():
    '''Run a test on a 5x5 grid with a run to the border.'''

    fixture  = '''2 2 2 2 2
                  2 2 2 2 2
                  1 1 1 2 2
                  2 2 2 2 2
                  2 2 2 2 2'''

    expected = '''. . . . .
                  . . . . .
                  * * * . .
                  . . . . .
                  . . . . .'''

    fixture = parse_grid(fixture)
    num_filled_cells = fill_grid(fixture)
    check_result(expected, fixture, num_filled_cells)

Parsing a grid of numbers is pretty easy:

def parse_grid(fixture):
    '''Turn a string representation of a grid into a grid of numbers.'''

    result = []
    for row in fixture.split('\n'):
        values = row.split()
        assert (result == []) or (len(row) == len(result[0])), \
               'Mis-match in row sizes'
        temp = []
        for v in values:
            temp.append(int(v))
        result.append(temp)
    return result

Checking cells is pretty easy too:

def check_result(expected, grid, num_filled):
    '''Check the results of filling.'''
    expected, count = convert_grid(expected)

    if len(expected) != len(grid):
        fail('Mis-match between size of expected result and size of grid')
    if count != num_filled:
        fail('Wrong number of cells filled')

    for i in range(len(expected)):
        g = grid[i]
        e = expected[i]
        if len(g) != len(e):
            fail('Rows are not the same length')
        for j in range(len(g)):
            if g[j] and (e[j] != FILLED):
                fail('Cell %d,%d should be filled but is not' % (i, j))
            elif (not g[j]) and (e[j] == FILLED):
                fail('Cell %d,%d should not be filled but is' % (i, j))
    return result

We still have one function to write, though—the one that parses a string of '*' and '.' characters and produces a grid of trues and falses. But this is almost exactly the same as what we do to parse a fixture. The only difference is how we convert individual items. Let's refactor:

def should_be_filled(x):
    '''Is this cell supposed to be filled?'''
    return x == '*'

def parse_general(fixture, converter):
    '''Turn a string representation of a grid into a grid of values.'''

    result = []
    for row in fixture.split('\n'):
        values = row.split()
        assert (result == []) or (len(row) == len(result[0])), \
               'Mis-match in row sizes'
        temp = []
        for v in values:
            temp.append(converter(v))
        result.append(temp)
    return result

def do_5x5_line():
    '''Run a test on a 5x5 grid with a run to the border.'''

    ...define fixture and expected strings...

    fixture = parse_general(fixture, int)
    num_filled_cells = fill_grid(fixture)
    expected = parse_general(fixture, should_be_filled)
    check_result(expected, fixture, num_filled_cells)

Writing the functions to parse fixture strings might seem like a lot of work, but what are you comparing it to? The time it would take to inspect printouts of real grids, or the time to step through the program over and over again in the debugger? And did you think to include the time it would take to re-do this after every change to your program? Or are you comparing it to the time it would take to retract a published paper after you find a bug in your code?

In real applications, it's not unusual for test code to be anywhere from 20% to 200% of the size of the actual application code (and yes, 200% does mean more test code than application code). But that's no different from physical experiments: if you look at the size and cost of the machines used to create a space probe, it's many times greater than the size and cost of the space probe itself.

The good news is, we're now in a position to replace our fill_grid function with one that is harder to get right, but which will run many times faster. If our tests have been designed well, they shouldn't have to be rewritten because they'll all continue to work the same way. This is a common pattern in scientific programming: create a simple version first, check it, then replace the parts one by one with more sophisticated parts that are harder to check but give better performance.

Key Points

Write support code to make testing easier.

Challenges

FIXME

Performance

Objectives

Do a rough calculation to figure out whether it is cost-effective to speed a program up or not.
Measure a program's execution time.
Estimate how a program's execution time will grow with problem size.

Lesson

Machine-independent code has machine-independent performance.
— anonymous

Now that it's easy to write tests, we can start worrying about our program's performance. When people use that phrase, they almost always mean the program's speed. In fact, speed is why computers were invented: until networks and fancy graphics came along, the whole reason computers existed was to do in minutes or hours what would take human beings weeks or years.

Scientists usually want programs to go faster for three reasons. First, they want a solution to a single large problem, such as, "What's the lift of this wing?" Second, they have many problems to solve, and need answers to all of them—a typical example is, "Compare this DNA sequences to every one in the database and tell me what the closest matches are." Finally, scientists may have a deadline and a fixed set of resources and want to solve as big a problem as possible within the constraints. Weather prediction falls into this category: given more computing power, scientists use more accurate (and more computationally demanding) models, rather than solving the old models faster.

Before trying to make a program go faster, there are two questions we should always ask ourselves. First, does our program actually need to go faster? If we only use it once a day, and it only takes a minute to run, speeding it up by a factor of 10 is probably not worth a week of our time.

Second, is our program correct? There's no point making a buggy program faster: more wrong answers per unit time doesn't move science forward (although it may help us track down a bug). Just as importantly, almost everything we do to make programs faster also makes them more complicated, and therefore harder to debug. If our starting point is correct, we can use its output to check the output of our optimized version. If it isn't, we've probably made our life more difficult.

Let's go back to invasion percolation. To find out how fast our program is, let's add a few lines to the program's main body:

if __name__ == '__main__':

    ...get simulation parameters from command-line arguments...

    # Run simulation.
    start_time = time.time()
    random.seed(random_seed)
    grid = create_random_grid(grid_size, value_range)
    mark_filled(grid, grid_size/2, grid_size/2)
    num_filled = fill_grid(grid) + 1
    elapsed_time = time.time() - start_time
    print 'program=%s size=%d range=%d seed=%d filled=%d time=%f' % \
          (sys.argv[0], grid_size, value_range, random_seed, num_filled, elapsed_time)
    if graphics:
        show_grid(grid)

The first new line records the time when the program starts running. The other new lines use that to calculate how long the simulation took, and then display the program's parameters and running time.

We need to make one more change before we start running lots of simulation. We were seeding the random number generator using the computer's clock time:

    start_time = time.time()
    ...
    random_seed = int(start_time)

But what if a simulation runs very quickly? time.time() returns a floating point number; int truncates this by throwing away the fractional part, so if our simulation runs in less than a second, two (or more) might wind up with the same seed, which in turn will mean they have the same "random" values in their grids. (This isn't a theoretical problem—we actually tripped over it while writing this lesson.)

One way to fix this is to to shift those numbers up. For now let's guess that every simulation will take at least a tenth of a millisecond to run, so we'll multiply the start time by ten thousand, then truncate it so that it is less than a million:

RAND_SCALE = 10000    # Try to make sure random seeds are distinct.
RAND_RANGE = 1000000  # Range of random seeds.
...
    random_seed = int(start_time * RAND_SCALE) % RAND_RANGE

The final step is to write a shell script that runs the program multiple times for various grid sizes:

for size in {11..81..10}
do
  for counter in {1..20}
  do
    python invperc.py -g -n $size -v 100
  done
done

(We could equally well have added a few more lines to the program itself to run a specified number of simulations instead of just one.) If we average the 20 values for each grid size, we get the following:

	11	21	31	41	51	61
cells filled	16.60	45.75	95.85	157.90	270.50	305.75
time taken	0.003971	0.035381	0.155885	0.444160	1.157350	1.909516
time/cell	0.000239	0.000773	0.001626	0.002813	0.004279	0.006245

Is that good enough? Let's fit a couple of fourth-order polynomials to our data:

	x⁴	x³	x²	x¹	x⁰
time taken	2.678×10^-07	-2.692×10^-05	1.760×10^-03	-3.983×10^-02	2.681×10^-01
time/cell	-1.112×10^-10	1.996×10^-08	4.796×10^-07	2.566×10^-05	-1.295×10^-04

According to the first polynomial, a single run on a 1001×1001 grid will take almost 68 hours. What can we do to make it faster? The wrong answer is, "Guess why it's slow, start tweaking the code, and hope for the best." The right answer is to ask the computer where the time is going.

Before we do that, though, we really ought to justify our decision to model the program's performance using a fourth-order polynomial. Suppose our grid is N×N. Each time it wants to find the next cell to fill, our program examines each of the N² cells. In the best case, it has to fill about N/2 cells to reach the boundary (basically, by racing straight for the edge of the grid). In the worst case, it has to fill all of the interior cells before "breaking out" to the boundary, which means it has to fill (N-2)×(N-2) cells. That worst case therefore has a runtime of N²(N-2)² steps, which, for large N, is approximately N⁴. (For example, when N is 71, the difference between the two values is only about 5%.)

This kind of analysis is computing's equivalent of engineers' back-of-the-envelope calculations. In technical terms, we would say that our algorithm is O(N⁴). In reality, because we're creating a fractal, we're actually going to fill about N^1.5 cells on average, so our running time is actually O(N^3.5). That's still too big for practical simulations, though, so it's time to figure out what we can do about it.

Key Points

Scientists want faster programs both to handle bigger problems and to handle more problems with available resources.
Before speeding a program up, ask, "Does it need to be faster?" and, "Is it correct?"
Recording start and end times is a simple way to measure performance.
A program's execution time varies depending on the size of its inputs in predictable ways.

Challenges

FIXME

Profiling

Objectives

Explain what an execution profiler is, and the difference between deterministic and statistical profilers.
Interpret a profiler's output correctly to identify hot spots in code.

Lesson

Timing an entire program is a good way to find out if we're making things better or not, but some way to know where the time is going would be even better. The tool that will do that for us is called a profiler because it creates a profile of a program's execution time, i.e., it reports how much time is spent in each function in the program, or even on each line.

There are two kinds of profilers. A deterministic profiler inserts instructions in a program to record the clock time at the start and end of every function. It doesn't actually modify the source code: instead, it adds those instructions behind the scenes after the code has been translated into something the computer can actually run. For example, suppose our program looks like this:

def swap(values):
    for i in range(len(values)/2):
        values[i], values[-1-i] = values[-1-i], values[i]

def upto(N):
    for i in xrange(1, N):
        temp = [0] * i
        swap(temp)

upto(100)

A deterministic profiler would insert timing calls that worked like this:

def swap(values):
    _num_calls['swap'] += 1
    _start_time = time.time()
    for i in range(len(values)/2):
        values[i], values[-1-i] = values[-1-i], values[i]
    _total_time['swap'] += (time.time() - _start_time)

def upto(N):
    _num_calls['upto'] += 1
    _start_time = time.time()
    for i in xrange(1, N):
        temp = [0] * i
        swap(temp)
    _total_time['upto'] += (time.time() - _start_time)

_num_calls['swap'] = 0
_total_time['swap'] = 0
_num_calls['upto'] = 0
_total_time['upto'] = 0
upto(100)

(Note that the profiler wouldn't actually change the source of our program; these extra operations are inserted after the program has been loaded into memory.)

Once the program has been run, the profiler can use the two dictionaries _num_calls and _total_time to report the average running time per function call. Going further, the profiler can also keep track of which functions are calling which, so that (for example) it can report times for calls to swap from upto separately from calls to swap from some other function downfrom.

The problem with deterministic profiling is that adding those timing calls changes the runtime of the functions being measured, since reading the computer's clock and recording the result both take time. The smaller the function's runtime, the larger the distortion. This can be avoided by using a statistical profiler. Instead of adding timing calls to the code, it freezes the program every millisecond or so and makes a note of what function is running. Like any sampling procedure, this produces become more accurate as more data is collected, so statistical profilers work well on long-running programs, but can produce misleading results for short ones.

Python's cProfile module is a deterministic profiler. It records times and call counts and saves data in a file for later analysis. We can use it to see where time goes in our initial list-of-lists invasion percolation program:

import cProfile, pstats
from invperc import main

cProfile.run('main(["51", "100", "127391"])', 'list.prof')
p = pstats.Stats('list.prof')
p.strip_dirs().sort_stats('time').print_stats()

We start by importing cProfile, the actual profiling tool, on line 1. We also import pstats, a helper module for analyzing the data files cProfile produces.

The second line imports the main function from our program. We give that starting point to cProfile.run on line 3, along with the name of the file we want the profiling results stored in. Notice that the call is passed as a string: cProfile uses Python's built-in eval function to run the command in this string, just as if we had typed it into the interpreter. Notice also that the arguments are passed as strings, since that's what main is expecting.

Line 4 reads the profiling data back into our program and wraps it up in a pstats.Stats object. It may seem silly to write the data out only to read it back in, but the two activities are often completely separate: we can accumulate profiling data across many different runs of a program, then analyze it all at once.

Finally, line 5 strips directory names off the accumulated data, sorts them according to run time, and prints the result. We strip directory names because all of the code we're profiling is in a single file; in larger programs, we'll keep the directory information (even though it makes the output a bit harder to read) so that we can separate fred.calculate's running time from jane.calculate's.

Here's what the output looks like:

135 cells filled
Thu Jun 28 13:54:55 2012    list.prof

         697631 function calls in 0.526 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   339489    0.355    0.000    0.376    0.000 invperc.py:64(is_candidate)
      134    0.136    0.001    0.515    0.004 invperc.py:73(find_candidates)
   340029    0.021    0.000    0.021    0.000 {len}
     2601    0.005    0.000    0.005    0.000 random.py:160(randrange)
     7072    0.004    0.000    0.004    0.000 {range}
     2601    0.002    0.000    0.007    0.000 random.py:224(randint)
        1    0.001    0.001    0.008    0.008 invperc.py:39(fill_random_grid)
        1    0.001    0.001    0.001    0.001 invperc.py:27(create_grid)
        1    0.001    0.001    0.516    0.516 invperc.py:92(fill_grid)
     2735    0.000    0.000    0.000    0.000 {method 'random' of '_random.Random' objects}
     2652    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
      134    0.000    0.000    0.000    0.000 random.py:259(choice)
      135    0.000    0.000    0.000    0.000 invperc.py:52(mark_filled)
        1    0.000    0.000    0.526    0.526 invperc.py:108(do_random)
        1    0.000    0.000    0.526    0.526 invperc.py:186(main)
        1    0.000    0.000    0.000    0.000 {function seed at 0x0221C2B0}
       40    0.000    0.000    0.000    0.000 {method 'add' of 'set' objects}
        1    0.000    0.000    0.000    0.000 random.py:99(seed)
        1    0.000    0.000    0.526    0.526 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

The columns are the number of calls, the total time spent in that function, the time per call, the total cumulative time (i.e., the total time for that function and everything it calls), the cumulative time per call, and then which function the stat is for. As we can see, is_candidate accounts for two thirds of our runtime: if we want to make this program faster, that's what we should speed up.

Wall Clock Time vs. CPU Time

When profiling programs, particularly on machines that are running other applications at the same time, it's important to remember the distinction between CPU time and wall-clock time. The first is how much time the computer's processor actually spent running the program; the second is how long the program actually took to run. The two are different because the CPU has a lot of things to do besides running our program, even on a machine that's supposedly otherwise idle. The operating system itself needs time, for example, as do disk and network I/O.

Key Points

Use a profiler to determine which parts of a program are responsible for most of its running time.

Challenges

FIXME

A New Beginning

Objectives

Explain swhy using a more efficient algorithm is usually more effective than tuning an inefficient one.
Show exmaples of programs trading space (extra memory) for running time and vice versa.

Lesson

If checking whether cells are candidates is the slowest step, let's try to reduce the number of times we have to do that. Instead of re-examining every cell in the grid each time we want to fill one, let's keep track of which cells are currently on the boundary in some kind of auxiliary data structure, then choose randomly from all the cells in that set that share the current lowest value. When we fill in a cell, we add its neighbors to the "pool" of neighbors (unless they're already there).

Here's the modified fill_grid function:

def fill_grid(grid):
    '''Fill an NxN grid until filled region hits boundary.'''

    x, y = grid.size/2, grid.size/2
    pool = set()
    pool.add((grid[x][y], x, y))
    num_filled = 0
    on_edge = False

    while not on_edge:
        x, y = get_next(pool)
        grid.mark_filled(x, y)
        num_filled += 1
        if (x == 0) or (x == grid.size-1) or (y == 0) or (y == grid.size-1):
            on_edge = True
        else:
            if x > 0:
                make_candidate(grid, pool, x-1, y)
            if x < grid.size-1:
                make_candidate(grid, pool, x+1, y)
            if y > 0:
                make_candidate(grid, pool, x,   y-1)
            if y < grid.size-1:
                make_candidate(grid, pool, x,   y+1)

    return num_filled

This function creates a set called pool that keeps track of the cells currently on the edge of the filled region. Each loop iteration gets the next cell out of this pool, fills it in, and (potentially) adds its neighbors to the set.

This function is 21 lines long, compared to 15 for our original fill_grid function, but four of those six lines are the calls to make_candidate, which adds a neighbor to the pool if it isn't already there. Let's have a look at get_next and make_candidate:

def get_next(pool):
    '''Take a cell randomly from the equal-valued front section.'''

    temp = list(pool)
    temp.sort()
    v = temp[0][0]
    i = 1
    while (i < len(temp)) and (temp[i][0] == v):
        i += 1
    i = random.randint(0, i-1)
    v, x, y = temp[i]
    pool.discard((v, x, y))
    return x, y

def make_candidate(grid, pool, x, y):
    '''Ensure that (x, y, v) is a candidate.'''

    v = grid_get(grid, x, y)
    if v == FILLED:
        return
    pool.add((v, x, y))

This is definitely more complicated that what we started with: we now have to keep a second data structure (the pool) up to date, and in sync with the grid. But look at the payoff:

	11	21	31	41	51	61
list of lists	0.000333	0.025667	0.088833	0.227167	0.455000	1.362667
set pool	0.000050	0.003000	0.009100	0.016130	0.012000	0.027050
ratio	6.66	8.56	9.76	14.1	37.9	50.4

Now that is a speedup. If we assume that we fill about N^1.5 cells in an N×N grid, the running time of our algorithm is about N^1.5 instead of N^3.5, because we only need to inspect four new cells in every iteration instead of checking all N² each time. As a result, the bigger our grids get, the bigger our savings are.

Key Points

The best way to improve a program's performance is to find a better algorithm.

Challenges

FIXME

The Bigger Picture

Objectives

Using a diagram, explain the feedback cycles in the agile development model.
Using a diagram, describe an idealized sturdy development lifecycle.
Explain the differences between a typical scientist's workflow and the two described above.

Lesson

The term "software engineering" was first used at a NATO conference in 1968, and was chosen to provoke discussion about how we ought to build large, complex software systems. People have argued ever since about what the term actually means, and whether "engineering" is the right model to aspire to. For example, should we be able to sue programmers for making flaky software the same way that we can sue engineers for designing faulty dams and bridges? On the one hand, quality would probably improve. On the other, mathematical techniques for analyzing programs are still quite limited despite fifty years of hard work, and it's hardly fair to require someone to meet standards that aren't yet achievable.

Either way, most people's definition of "software engineering" includes project management, which is what this final section will focus on. For our purposes, the difference between software carpentry and software engineering is the difference between putting up some drywall and building a bridge. A reasonably skilled amateur can do the first in a few hours or a couple of days with a little bit of planning, a couple of sketches, and a few hand tools. The second is more involved: it requires a team of people (some of them specialists), bigger tools, and a lot more planning and monitoring.

The good news is, studies show that most scientists and engineers work at the carpentry end of the scale. The bad news is, many people don't recognize when the techniques that have served them well for years are no longer adequate, which is one of the reasons many large software projects (in industry as well as in research) come in late or over budget, or wind up failing completely.

As a quick rule of thumb, if a dozen people are involved in your project, or if the work is going to take more than a dozen weeks, then you're doing engineering. And like it or not, you're going to have to think about how you plan, scope, and monitor your work. The only question is whether you're going to do this at the start of the project, or after everything has come crashing down around you.

Agile Development

The term "agile" was coined in 2001 to describe a "bottom-up" style of software project management based on short iterations and frequent feedback from both developers and customers. Agile development practices are actually as old as programming itself, but they came into their own with the rise of the World Wide Web. First, the web made it possible to "release" software weekly, daily, or even hourly since updating a server is a lot faster (and a lot less expensive) than shipping CDs to thousands of people.

Second, during the 1990s and early 2000s it seemed as if the things the web could do and how best to make it do those things were changing almost daily. Multi-year development plans didn't make a lot of sense when everything they depended on would be obsolete by the time work started, much less by the time it finished. Third, the growth of the web coincided with the growth of open source software. People couldn't help noticing that most open source projects didn't have long-range plans, but nevertheless produced high-quality software faster than many closed-source commercial projects. While open source isn't necessarily agile, and agile development projects usually aren't open source, there are a lot of similarities.

So what is agile software development? At its heart, it is any software development method that relies on continuous or nearly continuous feedback on short timescales (Figure 17). Agile developers work in short steps that are typically no more than two weeks long, and often as short as a single day. In each iteration, the team decides what to build next, designs it, builds it, tests it, and delivers it. Feedback loops at several scales help them get this right and do it better next time.

The iteration itself is the primary feedback loop. Users often don't know what they want until they see it, so short cycles are a way to avoid spending too much time building what turns out to be the wrong thing. This "exploratory" philosophy is why many people think that agile is a good way to develop research software. Most researchers are their own users, and often can't know what they should write next until they've seen the output of the current version of the program, so long-term planning often isn't possible.

Short iterations help improve efficiency in two other ways as well. First, most people can keep track of what they're doing for a few days at a time without elaborate time charts, so short cycles allow them to spend proportionally less time coordinating with one another. Second, finding bugs becomes easier: instead of looking through weeks' or months' worth of software to find out where the problem is, developers usually only have to look at what's been written in the last few days.

A typical working day starts with a stand-up meeting where everyone in the team reports what they did the day before, what they're planning to do that day, and what's blocking them (if anything). (It's called a "stand-up" meeting because it is held standing up, which encourages people to be brief.) For example, suppose Wolfman, Dracula, Frankenstein, and the Mummy are working on some new mind control software. Wolfman's report might be:

Yesterday: fixed the bug that was making the message file reader crash on accented characters, and added code to the web page generator to display accented characters properly.
Today: will get message file reader to recognize links to images and load those images.
Blockers: what should the message file reader do if the image is on the web instead of local? Should it try to read it, or is that a security hole?

Stand-up meetings are a feedback loop. Each day, the team gets feedback on the progress they're making, whether they're still on track to meet the iteration's goals, whether the technical decisions they're making are paying off, and so on. The key to making this work is that each task is either done or not done: "80% done" and "almost finished" aren't allowed. This rule encourages people to break large tasks into pieces that can be completed in a day or less so that they'll have something substantial to report every morning. This in turn encourages them to think ahead a little more, and helps make their estimation more accurate.

Once the stand-up meeting is over, everyone gets back to work. In many agile teams, this means sitting with a partner and doing pair programming. Two people work together at one machine: the "driver" does the typing, the "navigator" watches and comments, and every hour or so, the two switch roles.

Pair programming is beneficial for several reasons. First, the navigator will often notice mistakes in the driver's code, or remember design decisions that the driver is too busy typing to recall. This is the tightest of the feedback loops that make agile work, since feedback is nearly continuous.

Second, pair programming spreads knowledge around: every piece of code has been seen by at least two people, which increases the bus factor of the project. Pairing also helps people pick up new skills: if you have just seen someone do something with two clicks, you will probably do it that way when it's your turn to drive, rather than spending two minutes doing it the way you always have. And finally, most people are less likely to check Facebook every five minutes if someone else is working with them...

On the Other Hand...

Pairing improves overall productivity, but in many research settings everyone has a problem of their own to work on, so it may not be practical. People also often find it disruptive or awkward: having several conversations going on at once makes for a noisy lab, and desks often aren't built for two people to share comfortably.

For these reasons, many groups reserve pairing for:

bringing a new person up to speed (which happens a lot faster if they're paired with a more experienced team member for a few days);
tackling hard bugs (where "hard" means "anything one person can't figure out in an hour"); and
once or twice a week, just to stay in practice.

As well as pair programming, most agile teams rely on test-driven development and continuous integration, both of which we discussed in the lesson on software quality. Together, the two practices give developers more feedback: TDD lets them see how their design decisions are going to play out before they're committed to a particular implementation, while continuous integration tells them (and everyone else) when they've forgotten something, or broken something that was working a moment ago.

Is agile development right for you? As a rough guide, it works best when:

Requirements are constantly changing, i.e., long-range planning isn't possible anyway. This is often the case for scientific research, particularly at the small scale.
Developers and users can communicate continuously, or at worst daily or weekly. Again, this is normal for small-scale research, where developers and users are the same people.
The team is small, so that everyone can take part in a single stand-up meeting. This is usually also true, though getting everyone to show up for a morning meeting is a challenge in many labs.
Team members are disciplined enough not to use "agile" as an excuse for cowboy coding.
People actually like being empowered.

The last two points are the most important. Even professional developers don't like writing plans before they code, or documentation when they're done. Coincidentally, agile doesn't require them to do much of either. It's therefore all too common for developers to say "we're agile" when what they mean is "we're not going to bother doing anything we don't want to". Agile development actually requires more discipline, not less, just as improvising well requires even more musical talent than playing a score exactly.

As for the second point, many people don't like making decisions: after two decades of schooling, they want to be told what the assignment is and exactly what they have to do to get an 'A'. Many become quite defensive when told that figuring out what to do is now part of their job, but that's as essential to agile development as it is to scientific research.

Sturdy Development

Agile isn't the only way to develop software. Before the term was invented, the major alternative didn't have a name: it was just how big software engineering projects were run. Today, it's usually called "traditional" or "waterfall" development, though fans of agile often call it "big design up front" or something less kind.

We prefer the label "sturdy" because it puts a more positive spin on things. While agile is all about reacting quickly and taking advantage of opportunities as they arise, sturdy emphasizes predictability, particularly on large, long-lived projects.

Let's start by looking at what sturdy development isn't. A waterfall model of software development divides development into distinct stages, with information flowing from one stage to the next like water falling down a hill (Figure 18).

Even in 1970, when this model was first given a name, people knew it didn't work. Nobody has 20/20 foresight: requirements are always discovered or changed as software is designed, while designs are re-done based on what's learned during implementation, implementations are modified as testing finds problems, and so on.

But that doesn't mean that up-front planning and design are pointless. Thirty-five years ago, Barry Boehm and others discovered that the later a bug is found, the more expensive fixing it is. What's more, the cost curve is exponential: as we move from requirements to design to implementation to testing to deployment, the cost of fixing a problem increases by a factor of 3 to 5 at each stage, and those increases multiply (Figure 19).

The obvious implication is that time invested in up-front design can pay off many-fold if it prevents mistakes being made in the first place. It isn't always possible to do—people may not know what they want until they see something running, or tools may change so quickly that anything we design today will be obsolete by the time it's implemented—but very few programmers have ever said, "I wish I'd spent less time thinking about this before I started coding."

Figure 20 shows what a year-long sturdy development lifecycle looks like in practice. The first step in each cycle is to gather requirements, i.e., to figure out what the software is supposed to do. This is the product manager's job; while the developers are working on version 4, she talks to the customers about what they want version 5 to do.

Crucially, she should never ask them what features they want in the software—that's up to her to figure out. Instead, she should ask, "What does it do now that you don't like?" and, "What can't you do that you'd like to be able to?" She collates these needs and figures out how the software should be changed to satisfy them.

Good requirements are as unambiguous as a legal contract. "The system will reformat data files as they are submitted" isn't enough; instead, the requirements should read:

Only users who have logged in by providing a valid user name and password can upload files.
The system allows users to upload files via a secure web form.
The system accepts files up to 16MB long.
The system accepts files in PDB and RJCS-1 format.
The system converts files to RJCS-2 format before storing them.
The system displays an error message page if an uploaded file cannot be parsed, but takes no other action.

and so on.

The next step is analysis and estimation. Each team member is responsible for analyzing and estimating one or more features. She has to come up with a plausible rough design, and estimate how long it will take to implement. In fact, wherever possible she should come up with two such plans: one for doing the whole feature, and one that obeys the 80/20 rule by providing part of what's been asked for with much less effort.

Analysis and estimation presupposes some sort of overall design for the system as a whole. For example, it doesn't make sense to say, "We'll create a plugin to handle communication with the new orbiting mind control laser," unless the application has some sort of plugin system. If it doesn't, creating one is a task in its own right (probably a large one).

Once everything has been estimated, it's time to prioritize, because there's always more to do than there is time to do it. The easiest way to do this for medium-sized projects and teams is to draw a 3×3 grid on a whiteboard. One axis is "effort", broken down into "small", "medium", and "large". The other is "importance", broken down into "low", "medium", and "high". Each feature's name is put on a sticky note, and each sticky note goes into one of the nine boxes on the grid (Figure 21).

Once this has been done, it's easy to draw a diagonal line on the grid and throw away everything below it—after all, anything that's rated "high effort" but "low importance" isn't worth doing. Conversely, anything that's high importance and low effort definitely belongs in the plan.

But then there are the diagonal boxes. Should the team try to do one important, high-effort feature, or tackle a handful of things that are less important but easier? Whatever they choose, it's critical that they don't shave their time estimates to make things fit, and that the project manager doesn't either. The project manager is responsible for making sure things get built on time and to spec; if the product manager owns the feature list, the project manager owns the schedule (and yes, the similarity in those job titles is confusing). If the project manager starts shaving or squeezing estimates, developers will start padding them in self-defense. In response, the project manager will cut them back even more, until all the numbers are just so much science fiction. Lazy or timid developers can betray this trust by over-estimating, but even minimal time tracking will catch that sooner rather than later.

Once the features have been picked, it's the project manager's job to assemble them into a schedule showing who's going to do what when. The real purpose of this schedule is to help the team figure how far behind they are so that they can start cutting corners early on. A common way to do this is to keep a burn-down chart, which compares the plan with reality on a day-by-day or week-by-week basis. If and when a gap opens up, the team can either figure out when they're actually going to be done, and move the delivery date back, or go back to the 3×3 grid to figure out what they can drop or scale back to meet the current deadline.

This is why it's useful to have people estimate both the full feature and an 80/20 version. If and when time is running short, it may be possible to switch tracks and deliver most of the value with much less effort. The right time to think about this is the start of the project, when people are rested and relatively relaxed, not six weeks before a major conference deadline when tempers are already frayed.

In order for any of this to work, developers have to be fairly good at estimating how long things will take. That comes from experience, but also from careful record keeping. Just as runners and swimmers keep track of their times for doing their favorite distances, good developers keep track of how long it takes to build X or fix Y so that they'll be able to do a better job of estimating how long the next one will take. (In fact, whether or not a developer keeps track of their stats is a good way to tell how serious they are about their craft in an interview—something that would-be developers should keep in mind.)

Going back to Figure 20, you may have noticed that design overlaps estimation, and coding overlaps design. This is deliberate: it's usually impossible to figure out how to do something without writing some throwaway code. You may also have noticed that testing starts just as soon as development. This is critical to the project's success: if it is left until development is mostly done, it will inevitably be shortchanged. The consequences are usually disastrous, not least for the team's morale.

In fact, the members of the team responsible for testing should be involved in the analysis and estimation phase as well. They should review every part of the plan to ensure that what the developer is planning to is testable. Anything that isn't should be sent back to the drawing board.

Delivery starts at the same time as development. Daily or weekly, the team should "deliver" the software by creating and testing installers, deploying it on a handful of servers, burning ROMs and putting them into test rigs in the lab, or whatever it is they're going to do when the iteration is finished. Packaging and delivery are often just as complex as the software itself, and leaving it until the last moment will once again usually have disastrous consequences.

Finally, notice that development stops well before the target delivery date. Ideally, developers should stop adding new code to the project about two thirds of the way through the development cycle so that they can spend the remaining third of the time fixing the problems that testing turns up.

The two-thirds point isn't chosen at random. In all too many projects, that's when the high hopes that the team started with bump into the reality of over-optimistic scheduling, poor progress monitoring, and design decisions that weren't reviewed carefully (or at all). The common reaction is to ask the team to buckle down and put in extra hours. This almost always makes things worse: as study after study has shown, human beings are only capable of about 40 hours of productive intellectual work per week. Anything more, and the mistakes they make because of fatigue outweigh the extra time they're putting in, so that the project actually slows down.

Science Says...

Evan Robinson's excellent article "Why Crunch Mode Doesn't Work" summarizes the science behind this. Continuous work reduces cognitive function 25% for every 24 hours, which means that after two all nighters, a person's IQ is that of someone legally incompetent to care for themselves. The kicker is that, as with other forms of impairment, people don't realize they're affected: they believe they're still functioning properly.

So, is sturdy development right for you? As a rough guide, you should use it when:

Requirements are relatively fixed, or the team has enough experience in the problem domain to make estimation and planning possible.
The team is large. If some team members aren't ever going to meet others, frequent course changes probably aren't going to be possible.
Your customer insists on knowing well in advance exactly what you're going to deliver and when. Avionics software and control systems for nuclear power plants are extreme examples, but there are many other situations in which stakeholders genuinely do need to know exactly what's going to be ready to use this time next year.

More Alike Than Different

In practice, agile and sturdy development are more alike than their most passionate advocates like to pretend. Sturdy teams often use continuous integration systems and test-driven development, while agile projects often have an "iteration zero" in which they analyze architectural options and do large-scale design.

One key practice the two methods share is the use of ticketing to keep track of work. Ticketing tools are often called bug-tracking systems, since they are often used to keep track of bugs that need to be fixed, but well-organized teams use them as a shared to-do list to manage all kinds of tasks. Every task is recorded as a separate ticket, each of which has a unique number, a one-line summary, and a longer description that may include screenshots, error messages, and so on. It also has some state, such as "open", "in progress", "closed", or "rejected", and keeps track of who created it and when, and who it's assigned to. A typical ticket might look like this:

ID: 1278
Created-By: mummy
Owned-By: wolfman
State: assigned
Summary: Message file reader crashes on accented characters
Description:
1. Create a text file called 'accent.msg' containing the message
   "You vill dream of pümpernickel" (with an umlaut over the 'u').

2. Run the program with 'python mindcontrol.py --all --message accent.msg'.

Program crashes with the message "No encoding for [] on line 1 of 'accent.msg'".
([] shows where a solid black box appears in the output instead of a printable
character.)

When Wolfman checks in the code that fixes this bug, and the tests for that fix, he changes the ticket's state from "in progress" to "closed". If someone later discovers that his fix doesn't actually work, they can change the state from "closed" to "open" (meaning "we need to decide who's going to work on this") or back to "in progress" (meaning "a particular person is now responsible for working on this"). More sophisticated ticketing systems allow people to record dependencies between tickets (such as "work can't start on this one until #917 is closed"), to estimate how long work will take, to record how long work actually took, and so on. They also limit who can change the states of tickets or assign them to particular people, which is one way to implement particular workflows.

Another key practice in sturdy development is code review. As we discuss later, empirical studies have found that this is the single most cost-effective way to find bugs. It also helps spread understanding in teams that don't use pair programming (which most don't).

Code can be reviewed before it's committed to version control or after. Most teams prefer pre-commit reviews for two reasons. First, they prevent mistakes getting into the repository in the first place, which is better than putting them there and then taking them out. Second, if the team agrees that nothing gets committed until it has been reviewed, it's much more likely that reviews will actually get done. If changes can be committed, then reviewed later, that "later" may slip and slip and never come at all.

How can the Mummy review Frankenstein's code before Frankenstein checks it in? Some teams solve this problem by creating one branch per developer, or per feature. The people working in that branch can check in any time, but review has to happen before code can be merged with the main line.

Another common approach is for developers to create a patch before committing their work to version control. A patch is just a list of the differences between two sets of files, such as two different versions of the source code for a program. Developers can store their patches in the version control system, attach them to tickets, or submit them to a code review management tool like ReviewBoard. Someone else can then comment on the patch, which the author can then revise. In large open source projects like Python and Firefox, it's common for patches to be reviewed and updated a dozen times or more before finally being committed to version control. Newcomers often find this frustrating, but experience shows that as projects become larger, "measure twice, cut once" pays off.

The Evidence

We've made a lot of claims in this section about how development should be done. The time has come to look at the evidence that backs them up.

Before we begin, though, it's important to realize most programmers know as little about research in their field as doctors knew about science a hundred years ago. Someone doing a degree in geology or biochemistry probably does at least one laboratory experiment a week, but if that same person does a computer science degree, they will probably only collect and analyze data once or twice during four years. Most programmers therefore don't have a hands-on appreciation for what evidence looks like and how to assess it; as a result, many think that a couple of pints, a loud voice, and a "just so" story about a startup in Silicon Valley constitute proof of whatever they want to prove.

The good news is, things are finally changing. While data on real programmers and real software were a rarity well into the 1990s, papers published today that describe new tools or new working practices routinely report results from some kind of empirical study. Many of these studies are still smaller in scale than we'd like, or methodologically naive, but standards are improving rapidly.

Here's one classic result. In the mid-1970s, Michael Fagan shows that reading code carefully is the most cost-effective way to find bugs—it can find 60-90% of all the bugs in a piece of software before it's run for the first time. Thirty years later, Cohen and others refined this result by looking at data collected by a web-based code review tool at Cisco. They found that almost all of the value of code reviews came from the first reviewer, and the first hour they were reviewing code. Basically, extra reviewers beyond the first don't find enough bugs to make their reviews cost-effective, and if someone spends more than an hour reading code, they become fatigued and stop finding anything except trivial formatting errors.

In light of this, it's not surprising that code review has become a common practice in most open source projects: given the freedom to work any way they want, most top-notch developers have discovered for themselves that having someone else look over a piece of code before it's committed to version control produces working code in less time.

Getting scientists to adopt code reviews has proven surprisingly difficult, especially considering that peer review of publications is a normal part of scientific life. The main reason seems to be that most pieces of small-scale scientific are written by one person for themselves, and it's hard to get people to review code that they're not going to use themselves.

Here's another interesting result. After Windows Vista shipped, several researchers threw some machine learning algorithms at the data Microsoft had collected during its construction. They had a record of every change made to the software, every meeting ever scheduled, where developers sat, what tests they had run, what they'd said via instant messaging, and so on.

One of the things they looked at was whether developers who sat together produced better code (i.e., fewer bugs) than developers who were geographically distributed. Much to their surprise, they found that physical geography was a weak predictor of software quality. What correlated much more strongly was how far apart developers were in the company's organization chart. Basically, the higher you had to go in the org chart to find a common boss who could set direction and resolve disputes, the more bugs there would be in the software.

Here are a few other interesting results:

Is there anything we can measure in software that's bettr at predicting either the number of bugs or the effort required to build it than simply counting the number of lines of code?: No: every sophisticated measure we have correlates closely with the simplest.
Do more frequent releases improve software quality?: Yes, but they also change the nature of the bugs that do occur.
Are some programmers really 40 times better than others?: Probably not: studies that seem to show this were often comparing the best to the worst, which exaggerates the effect. What does seem true is the claim that any given programmer produces roughly the same number of lines of code per day, regardless of what language she is working in.
Does knowledge and use of design patterns improve the quality of code?: Yes, and there is a causative relationship: it isn't simply the case that good developers are more likely to know about patterns.
Are women intrinsically less likely to be interested in programming, or are social forces largely responsible for the seven-to-one gender disparity among programmers?: Social forces (some subtle, some not) are overwhelmingly responsible.
Are code modules with clear "owners" less likely to contain errors than code modules authored by many people?: Yes, but keep the bus factor in mind: a module with one or a few owners, also has higher risk of being orphaned in the long run.
Are some programming languages harder to learn than others?: Almost certainly: in fact, it appears as though Perl is as hard to learn as a language whose syntax has been "designed" by a random number generator.

Even when people do know about research, they are often to change what they do to reflect what we know (just as they are with smoking). One example is test-driven development. Many programmers believe quite strongly that this is the "right" way to program, and that it leads to better code in less time. However, a meta-analysis of over thirty studies found no consistent effect: some of the studies reported benefits, some found that it made things worse, and some were inconclusive. One clear finding, though, was that the better the study, the weaker the signal. This result may be disappointing to some people (it certainly was to me), but progress sometimes is. And even if these studies are wrong, figuring out why, and doing better studies, will advance our understanding.

Books like Glass's Facts and Fallacies of Software Engineering and the collection Making Software are good summaries of what we actually know and why we belive it's true. If you're going to spend any significant time programming (or arguing with programmers) it's easier than ever to find out what the evidence actually is. And if we want people to pay attention to our science, we have an obligation to base our opinions on theirs.

Key Points

Agile development relies on feedback loops at several timescales to keep projects on track.
Sturdy development places more emphasis on up-front design.
In practice, the two are more alike than different, but what researchers actually do is different from both.
Arguments about programming should be based on empirical studies of software development.

Challenges

FIXME

Summary

The key idea of this chapter is that good software comes from iteration with feedback. The way to create a complex program that is fast and correct is to make something simple and slow, get tests in place to make sure that it keeps producing the right answer as changes are made, and then to add features and make performance improvements one at a time. Choosing the right algorithms and data structures can yield enormous speedups, so we should always look there first for performance gains. This is where a broad knowledge of computer science comes in handy: any good book on data structures and algorithms describes dozens or hundreds of things that are exactly what's needed to solve some obscure but vital performance problem.

And improving the quality of our code improves both our performance and the machine's: it is the opposite of an either/or tradeoff. Well-structured programs are easier to optimize than poorly-structured ones, so if we build our program as a collection of functions more-or-less independent functions, we can improve those functions more or less independently.

The same ideas apply at a larger scale. If we want the next project to take less time (and hurt less) than this one, we need to improve our development process rather than putting more hours into doing things the wrong way. In the long run, what distinguishes successful teams and projects from unsuccessful ones isn't any specific set of practices, but humility.