Instructor's Guide for Web Programming

Opening

Carla Climate is studying climate change in the Northern and Southern hemispheres. As part of her work, she wants to see whether the gap between annual temperatures in Canada and Australia increased during the Twentieth Century. The raw data she needs is available online; her goal is to get it, do her calculations, and then post her results so that other scientists can use them.

This chapter is about how she can do that. More specifically, it's about how to fetch data from the web, and how to create web pages that are useful to both human beings and computers. What we will not cover is how to build interactive web applications; making those secure is more work than we can cover in the time we have. However, everything in this chapter is a prerequisite for interactive apps, and there are other good tutorials available if you decide that's what you really need. Carla's goal is to share with everyone, and that's the easiest kind of site to create.

Instructors

FIXME

How We Got Here

Objectives

  • Distinguish between human-readable and machine-readable data.
  • Explain the relationship between HTML and XML.

Lesson

To start, let's have another look at the hearing tests from our chapter on Python programming. Most people would probably store these results in a plain text file with one row for each test:

Date         Experimenter        Subject          Test       Score
----------   ------------        -------          -----      -----
2011-05-02   A. Binet            H. Ebbinghaus    DL-11      88%
2011-05-07   A. Binet            H. Ebbinghaus    DL-12      71%
2011-05-02   A. Binet            W. Wundt         DL-11      29%
2011-05-02   C. S. Pierce        W. Wundt         DL-11      45%

This is pretty much what a conscientious researcher would write in a lab notebook, and is easy for a human being to read. It's a lot harder for a computer to understand, though. Any program that wanted to load this data would have to know that the first line of the file contains column titles, that the second can be ignored, that the first field of each row thereafter should be translated from text into a date, that the fields after that start in particular columns (since the number of spaces between them is variable, and the number of spaces inside names can also vary—compare "A. Binet" with "C. S. Pierce"), and so on. Such a program would not be hard to write, but having to write, debug, and maintain a separate program for each data set would be tedious.

Now consider something like this quotation from Richard Feynman's 1965 Nobel Prize acceptance speech:

As a by-product of this same view, I received a telephone call one day at the graduate college at Princeton from Professor Wheeler, in which he said, "Feynman, I know why all electrons have the same charge and the same mass." "Why?" "Because, they are all the same electron!"

A lot of information is implicit in these four sentences, like the fact that "Wheeler" and "Feynman" are particular people, that "Princeton" is a place, that the speakers are alternating (with Wheeler speaking first), and so on. None of that is "visible" to a computer program, so if we had a database containing millions of documents and wanted to see which ones mentioned both John Wheeler (the physicist, not the geologist) and Princeton (the university, not the glacier), we might have to wade through a lot of false matches. What we need is some way to explicitly tell a computer all the things that human beings are able to infer.

An early effort to tackle this problem dates back to 1969, when Charles Goldfarb and others at IBM created the Standard Generalized Markup Language, or SGML. It was designed as a way of adding extra data to medical and legal documents so that programs could search them more accurately. SGML was very complex (the specification is over 500 pages long), and unless you were a specialist, you probably didn't even know it existed: all you saw were the programs that used it.

But in 1989 Tim Berners-Lee borrowed the syntax of SGML to create the HyperText Markup Language, or HTML, for his new "World Wide Web". HTML looked superficially the same as SGML, but it was much (much) simpler: almost anyone could write it, so almost everyone did.

However, HTML only had a small vocabulary, which users could not change or extend. They could say, "This is a paragraph," or, "This is a table," but not, "This is a chemical formula," or, "This is a person's name." Instead of adding thousands of new terms for different application domains, a new standard for defining terms was created in 1998. This standard was called the Extensible Markup Language (XML); it was much more complex than HTML, but hundreds of specialized vocabularies have now been defined in terms of it, such as the Chemical Markup Language for describing chemical compounds and related concepts.

More recently, a new version of HTML called HTML5 has been created. Web programmers are very excited about it, primarily because its new features allow them to create sophisticated user interfaces that run on smart phones and tablets as well as conventional computers. In what follows, though, we'll focus on some basics that haven't changed (much) in 20 years.

Key Points

  • Structured data is much easier for machines to process than unstructured data.
  • Markup languages like HTML and XML can be used to add semantic information to text.

Challenges

FIXME

Formatting Rules

Objectives

  • Explain the difference between text, elements, and tags.
  • Explain the difference between a model and a view, and correctly identify instances of each.
  • Write correctly-formatted HTML (using escape sequences for special characters).
  • Identify and fix improperly-nested HTML.

Lesson

A basic HTML document contains text and elements. (The full specification allows for many other things with names like "external entity references" and "processing instructions", but we'll ignore them.) The text in a document is just characters, and as far as HTML is concerned, it has no intrinsic meaning: "Feynman" is just seven characters, not a person.

Elements are metadata that describe the meaning of the document's content. For example, one element might signal a heading, while another might indicate that something is a cross-reference.

Elements are written using tags, which must be enclosed in angle brackets <…>. For example, <cite> is used to mark the start of a citation, and </cite> is used to mark its end. Elements must be properly nested: if an element called inner begins inside an element called outer, inner must end before outer ends. This means that <outer>…<inner>…</inner></outer> is legal HTML, but <outer>…<inner>…</outer></inner> is not.

Here are some commonly-used HTML tags:

Tag Usage
html Root element of entire HTML document.
body Body of page (i.e., visible content).
h1 Top-level heading. Use h2, h3, etc. for second- and third-level headings.
p Paragraph.
em Emphasized text.

Finally, every well-formed document started with a DOCTYPE declaration, which looks like:

<!DOCTYPE html>

This tells programs what kind of elements are allowed to appear in the document: 'html' (by far the most common case), 'math' for MathML, and so on. Here is a simple HTML document that uses everything we've seen so far:

<!DOCTYPE html><html><body><h1>Dimorphism</h1><p>Occurring or existing in two different <em>forms</em>.</p></body></html>

A web browser like Firefox might present this document as shown in Figure XXX. Other devices will display it differently. A phone, for example, might use a different background color for the heading, while a screen reader for people with visual disabilities would read the text aloud.

A Very Simple Web Page
Figure XXX: A Very Simple Web Page

These different presentations are possible because HTML separates content from presentation, or in computer science jargon, separates models from views. The model is the data itself; the view is how that data is displayed, such as a particular pattern of pixels on our screen or a particular sequence of sounds on our headphones. A given model may be viewed in many different ways, just as what files are on your hard drive can be viewed as a list, as snapshots, or as a hierarchical tree (Figure XXX).

Different Views of a File System
Figure XXX: Different Views of a File System

People can construct models from views almost effortlessly—if you are able to read, it's almost impossible not to see the letters "HTML" in the following block of text:

*   *  *****  *   *  *
*   *    *    ** **  *
*****    *    * * *  *
*   *    *    *   *  *
*   *    *    *   *  ****

Computers, on the other hand, are very bad at reconstructing models from views. In fact, many of the things we do without apparent effort, like understanding sentences, are still open research problems in computer science. That's why markup languages were invented: they are how we explicitly specify the "what" that we infer so easily for computers' benefit.

There are a couple of other formatting rules we need to know in order to create and understand documents. If we are writing HTML by hand instead of using a WYSIWYG editor like LibreOffice or Microsoft Word, we might lay it out like this to make it easier to read:

<!DOCTYPE html>
<html>
  <body>
    <h1>Dimorphism</h1>
    <p>Occurring or existing in two different <em>forms</em>.</p>
  </body>
</html>

Doing this doesn't change how most browsers render the document, since they usually ignore "extra" whitespace (highlighted above). As we'll see when we start writing programs of our own, though, that whitespace doesn't magically disappear when a program reads the document.

Second, we must use escape sequences to represent the special characters < and > for the same reason that we have to use \" inside a double-quoted string in a program. where do we explain escape sequences? In HTML and XML, an escape sequence is an ampersand '&' followed by the abbreviated name of the character (such as 'amp' for "ampersand") and a semi-colon. The four most common escape sequences are:

Sequence Character
&lt; <
&gt; >
&quot; "
&amp; &

One final formatting rule is that every document must have a single root element, i.e., a single element must enclose everything else. When combined with the rule that elements must be properly nested, this means that every document can be thought of as a tree. For example, we could draw the logical structure of our little document as shown in Figure XXX.

Tree View of a Very Simple Web Page
Figure XXX: Tree View of a Very Simple Web Page

A document like this, on the other hand, is not strictly legal:

<h1>Dimorphism</h1>
<p>Occurring or existing in two different <em>forms</em>.</p>

because it has two top-level elements (the h1 and the p). Most browsers will render it correctly, since they're designed to accommodate improperly-formatted HTML, but most programs won't, because they're not.

Beautiful Soup

There are a lot of incorrectly-formatted HTML pages out there. To deal with them, people have written libraries like Beautiful Soup, which does its best to turn real-world HTML into something that a run-of-the-mill program can handle. It almost always gets things right, but sticking to the standard makes life a lot easier for everyone.

Key Points

  • HTML documents contain elements and text.
  • Elements are represented using tags.
  • Different devices may display HTML differently.
  • Every document must have a single root element.
  • Tags must be properly nested to form a tree.
  • Special characters must be written using escape sequences beginning with &.

Challenges

FIXME

Attributes

Objectives

  • Explain what element attributes are, and what they are for.
  • Write HTML that uses attributes to alter a document's appearance.
  • Explain when to use attributes rather than nested elements.

Lesson

Elements can be customized by giving them attributes. These are name/value pairs enclosed in the opening tag like this:

<h1 align="center">A Centered Heading</h1>

or:

<p class="disclaimer">This planet provided as-is.</p>

Any particular attribute name may appear at most once in any element, just like keys may be present at most once in a dictionary, so <p align="left" align="right">…</p> is illegal. Attributes' values must be in quotes in XML and older dialects of HTML; HTML5 allows single-word values to be unquoted, but quoting is still recommended.

Another similarity between attributes and dictionaries is that attributes are unordered. They have to be written in some order, just as the keys and values in a dictionary have to be displayed in some order when they are printed, but as far as the rules of HTML are concerned, the elements:

<p align="center" class="disclaimer">This web page is made from 100% recycled pixels.</p>

and:

<p class="disclaimer" align="center">This web page is made from 100% recycled pixels.</p>

mean the same thing.

HTML and Version Control

explain

When should we use attributes, and when should we nest elements? As a general rule, we should use attributes when:

  • each value can occur at most once for any element;
  • the order of the values doesn't matter; and
  • those values have no internal structure, i.e., we will never need to parse an attribute's value in order to understand it.

In all other cases, we should use nested elements. However, many widely-used XML formats break these rules in order to make it easier for people to write XML by hand. For example, in the Scalable Vector Graphics (SVG) format used to describe images as XML, we would define a rectangle as follows:

<rect width="300" height="100" style="fill:rgb(0,0,255); stroke-width:1; stroke:rgb(0,0,0)"/>

In order to understand the style attribute, a program has to somehow know to split it on semicolons, and then to split each piece on colons. This means that a generic program for reading XML can't extract all the information that's in SVG, which partly defeats the purpose of using XML in the first place.

Key Points

  • Elements can be customized by adding key-value pairs called attributes.
  • An element's attributes must be unique, and are unordered.
  • Attribute values should not have any internal structure.

Challenges

FIXME

More HTML

Objectives

  • Write correctly-formatted HTML pages containing lists, tables, images, and links.
  • Add correctly-formatted metadata to the head of an HTML page.

Lesson

As anyone who has surfed the web has seen, web pages can contain a lot more than just headings and paragraphs. To start with, HTML provides two kinds of lists: ul to mark an unordered (bulleted) list, and ol for an ordered (numbered) one (Figure XXX). Items inside either kind of list must be wrapped in li elements:

<!DOCTYPE html>
<html>
  <body>
    <ul>
      <li>A. Binet
        <ol>
          <li>H. Ebbinghaus</li>
          <li>W. Wundt</li>
        </ol>
      </li>
      <li>C. S. Pierce
        <ol>
          <li>W. Wundt</li>
        </ol>
      </li>
  </body>
</html>
Nested Lists
Figure XXX: Nested Lists

Note how elements are nested: since the ordered lists "belong" to the unordered list items above them, they are inside those items' <li>…</li> tags. And remember, the indentation used to make this list easier for people to read means nothing to the computer: we could put the whole thing on one line, or write it as:

<!DOCTYPE html>
<html>
<body>
  <ul>
    <li>A. Binet
  <ol>
    <li>H. Ebbinghaus</li>
    <li>W. Wundt</li>
  </ol>
    </li>
    <li>C. S. Pierce
  <ol>
    <li>W. Wundt</li>
  </ol>
    </li>
</body>
</html>

and the computer would interpret and display it the same way. A human being, on the other hand, would find the inconsistent indentation of the second layout much harder to follow.

HTML also provides tables, but they are awkward to use: tables are naturally two-dimensional, but text is one-dimensional. This is exactly like the problem of representing a two-dimensional array in memory, which we saw in the NumPy and development lessons. We solve it in the same way: by writing down the rows, and the columns within each row, in a fixed order. The table element marks the table itself; within that, each row is wrapped in tr (for "table row"), and within those, column items are wrapped in th (for "table heading") or td (for "table data"):

<!DOCTYPE html>
<html>
  <body>
    <table>
      <tr>
        <th></th>
        <th>A. Binet</th>
        <th>C. S. Pierce</th>
      </tr>
      <tr>
        <th>H. Ebbinghaus</th>
        <td>88%</td>
        <td>NA</td>
      </tr>
      <tr>
        <th>W. Wundt</th>
        <td>29%</td>
        <td>45%</td>
      </tr>
    </table>
  </body>
</html>
A Simple Table
A Simple Table

Tables, Layout, and CSS

Tables are sometimes used to do multi-column layout, as well as for tabular data, but this is a bad idea. To understand why, consider two other HTML tags: i, meaning "italics", and em, meaning "emphasis". The former directly controls how text is displayed, but by doing so, it breaks the separation between model and view that is the heart of markup's usefulness. Without understanding the text that has been italicized, a program cannot understand whether it is meant to indicate someone shouting, the definition of a new term, or the title of a book. The em tag, on the other hand, has exactly one meaning, and that meaning is different from the meaning of dfn (a definition) or cite (a citation).

Conscientious authors use Cascading Style Sheets (or CSS) to describe how they want pages to appear, and only use table elements for actual tables. CSS is beyond the scope of this lesson, but is described briefly in the appendix.

HTML pages can also contain images. (In fact, the World Wide Web didn't really take off until the Mosaic browser allowed people to mix images with text.) The word "contain" is misleading, though: HTML documents can only contain text, so we cannot store an image "in" a page. Instead, we must put it in some other file, and insert a reference to that file in the HTML using the img tag. Its src attribute specifies where to find the image file; this can be a path to a file on the same host as the web page, or a URL for something stored elsewhere. For example, when a browser displays this:

<!DOCTYPE html>
<html>
  <body>
    <p>My daughter's first online chat:</p>
    <img src="madeleine.jpg"/>
    <p>but probably not her last.</p>
  </body>
</html>

it looks for the file madeleine.jpg in the same directory as the HTML file:

Simple Images
Figure XXX: Simple Images

Notice, by the way, that the img element is written as <img…/>, i.e., with a trailing slash inside the <> rather than with a separate closing tag. This makes sense because the element doesn't contain any text: the content is referred to by its src attribute. Any element that doesn't contain anything can be written using this short form.

Images don't have to be in the same directory as the pages that refer to them. When the browser displays this:

<!DOCTYPE html>
<html>
  <body>
    <p>Yes, she knows she's cute:</p>
    <img src="img/cute-smile.jpg"/>
  </body>
</html>

it looks in the directory containing the page for a sub-directory called img, and loads the image file from there, while if it's given:

<!DOCTYPE html>
<html>
  <body>
    <img src="http://software-carpentry.org/img/software-carpentry-logo.png"/>
  </body>
</html>

it downloads the image from the URL http://software-carpentry.org/img/software-carpentry-logo.png and displays that.

It's Always Interpreted

The path is always interpreted (web browser config)

Whenever we refer to an image, we should use the img tag's alt attribute to provide a title or description of the image. This is what screen readers for people with visual handicaps will say aloud to "display" the image; it's also what search engines rely on, since they can't "see" the image either. Adding this to our previous example gives:

<!DOCTYPE html>
<html>
  <body>
    <p>My daughter's first online chat:</p>
    <img src="madeleine.jpg" alt="Madeleine's first online chat"/>
    <p>but probably not her last.</p>
  </body>
</html>

We can use URLs for images, but their most important use is to create the links within and between pages that make HTML "hypertext". This is done using the a element. Whatever is inside the element is displayed and highlighted for clicking; this is usually a few words of text, but it can be an entire paragraph or an image.

The a element's href attribute specifies what the link is pointing at; as with images, this can be either a local filename or a URL. For example, we can create a listing of the examples we've written so far like this (Figure XXX):

<!DOCTYPE html>
<html>
  <body>
    <p>
      Simple HTML examples for
      <a href="http://software-carpentry.org">Software Carpentry</a>.
    </p>
    <ol>
      <li><a href="very-simple.html">a very simple page</a></li>
      <li><a href="hide-paragraph.html">hiding paragraphs</a></li>
      <li><a href="nested-lists.html">nested lists</a></li>
      <li><a href="simple-table.html">a simple table</a></li>
      <li><a href="simple-image.html">a simple image</a></li>
    </ol>
  </body>
</html>
Using Hyperlinks
Figure XXX: Using Hyperlinks

The hyperlink element is called a because it can also used to create anchors in documents by giving them a name attribute instead of an href. An anchor is simply a location in a document that can be linked to. For example, suppose we formatted the Feynman quotation given earlier like this:

<blockquote>
  As a by-product of this same view, I received a telephone call one day
  at the graduate college at <a name="pu">Princeton</a>
  from Professor Wheeler, in which he said,
  "Feynman, I know why all electrons have the same charge and the same mass."
  "Why?"
  "Because, they are all the same electron!"
</blockquote>

If this quotation was in a file called quote.html, we could then create a hyperlink directly to the mention of Princeton using <a href="quote.html#pu">. The # in the href's value separates the path to the document from the anchor we're linking to. Inside quote.html itself, we could link to that same location simply using <a href="#pu">.

Using the a element for both links and targets was poor design—programs are simpler to write if each element has one purpose, and one alone—but we're stuck with it now. A better way to create anchors is to add an id attribute to some other element. For example, if we wanted to be able to link to the quotation itself, we could write:

<blockquote id="wheeler-electron-quote">
  As a by-product of this same view, I received a telephone call one day
  at the graduate college at <a name="pu">Princeton</a>
  from Professor Wheeler, in which he said,
  "Feynman, I know why all electrons have the same charge and the same mass."
  "Why?"
  "Because, they are all the same electron!"
</blockquote>

and then refer to quote.html#wheeler-electron-quote.

Finally, well-written HTML pages have a head element as well as a body. The head isn't displayed; instead, it's used to store metadata about the page as a whole. The most common element inside head is title, which, as its name suggests, gives the page's title. (This is usually displayed in the browser's title bar.) Another common item in the head is meta, whose two attributes name and content let authors add arbitrary information to their pages. If we add these to the web page we wrote earlier, we might have:

<!DOCTYPE html>
<html>
  <head>
    <title>Dimorphism Defined<title>
    <meta name="author" content="Alan Turing"/>
    <meta name="institution" content="Euphoric State University"/>
  </head>
  <body>
    <h1>Dimorphism</h1>
    <p>Occurring or existing in two different <em>forms</em>.</p>
  </body>
</html>

Well-written pages also use comments (just like code), which start with <!-- and end with -->.

Hiding Content

Commenting out part of a page does not hide the content from people who really want to see it: while a browser won't display what's inside a comment, it's still in the page, and anyone who uses "View Source" can read it. For example, if you are looking at this page in a web browser right now, try viewing the source and searching for the word "Surprise".

If you really don't want people to be able to read something, the only safe thing to do is to keep it off the web.

Key Points

  • Put metadata in meta elements in a page's head element.
  • Use ul for unordered lists and ol for ordered lists.
  • Add comments to pages using <!-- and -->.
  • Use table for tables, with tr for rows and td for values.
  • Use img for images.
  • Use a to create hyperlinks.
  • Give elements a unique id attribute to link to it.

Challenges

FIXME

Creating Documents

Objectives

  • Explain how page templating works.
  • Use Jinja2 to create and compile a templated page that uses conditionals and loops.

Lesson

Turning a Python list into an HTML ol or ul list seems like a natural thing to do, so you might expect that programmers would have created libraries to do it. In fact, they have gone one step further and creating systems that allow people to put bits of code directly into HTML files. Such a file is usually called a template, since it is the general pattern for any number of potential pages.

Here's a simple example. Suppose we want to create a set of web pages to display point-form biographies of famous scientists. We want each page to look like this:

<html>
  <head>
    <title>Biography of Beatrice Tinsley</title>
  </head>
  <body>
    <h1>Beatrice Tinsley</h1>
    <ol>
      <li>Born 1941</li>
      <li>Died 1981</li>
      <li>Studied stellar aging</li>
    </ol>
  </body>
</html>

but since we expect to have hundreds of such pages, we don't want to write each one by hand. (We certainly don't want to have to revise each one by hand when the university decides it wants them in a slightly different format...) To make things easier on ourselves, let's create a single template page called biography.html that contains:

<html>
  <head>
    <title>Biography of {{name}}</title>
  </head>
  <body>
    <h1>{{name}}</h1>
    <ol>
      {% for f in facts %}
      <li>{{f}}</li>
      {% endfor %}
    </ol>
  </body>
</html>

This has the same general structure as a general biography, but there are a few changes: it uses instead of the scientist's name, and rather than listing each biographical detail, it has something that looks a lot like a for loop that iterates over something called facts.

What we need next is a program that can expand this template using particular values for name and facts. We will use a Python template library called Jinja2 to do this; there are many others but they all work in more or less the same way (which means, "They each have their own slightly different rules for what can go in a page and how it's expanded.").

First, let's put all the values we want to customize the page with into variables:

who = 'Beatrice Tinsley'
what = ['Born 1941', 'Died 1981', 'Studied stellar aging']

Next, we have to import the Jinja2 library and do a bit of magic to load the template for our page:

import jinja2

loader = jinja2.FileSystemLoader(['.'])
environment = jinja2.Environment(loader=loader)
template = environment.get_template('biography.html')

We start by importing the jinja2 library, and then create an object called a "loader". Its job is to find template files and load them into memory; its argument is a list of the directories we want it to search (in order). For now, we are only looking in the current directory, so the list is just ['.'] (i.e., the current directory).

Once we have that loader, we use it to create a Jinja2 "environment", which—well, honestly, we don't need two separate objects for what we're doing, but more complicated applications might need several loaders, or might be expanding different sets of templates in different ways, and the Environment object is where all that is handled.

What we really want is the last line, which asks the environment to load the template file 'biography.html' and give us an object that knows how to expand itself. We're now ready to do the actual expansion:

result = template.render(name=who, facts=what)
print result

When we call template.render, we pass it any number of name-value pairs. (Remember, the odd-looking expression name=who in the function call means, "Assign the value of the variable who in the calling code to the parameter called name inside the function.") Those names are turned into variables, and can be used inside the template, so that {{name}} is given the string 'Beatrice Tinsley' and facts is given our list of facts about her.

The method call template.render "runs" the template as if it were a program, and returns the string that's created. When we print it out, we get:

<html>
  <head>
    <title>Biography of Beatrice Tinsley</title>
  </head>
  <body>
    <h1>Beatrice Tinsley</h1>
    <ol>
      
      <li>Born 1941</li>
      
      <li>Died 1981</li>
      
      <li>Studied stellar aging</li>
      
    </ol>
  </body>
</html>

Why go to all of this trouble? Because if we want to create another page with exactly the same format, all we have to do is call:

result = template.render(name='Helen Sawyer Hogg',
                         facts=['Born 1905',
                                'Died 1993',
                                'Studied globular clusters',
                                'Wrote a popular astronomy column for 30 years'])

and we will get:

<html>
  <head>
    <title>Biography of Helen Sawyer Hogg</title>
  </head>
  <body>
    <h1>Helen Sawyer Hogg</h1>
    <ol>
      
      <li>Born 1905</li>
      
      <li>Died 1993</li>
      
      <li>Studied globular clusters</li>
      
      <li>Wrote a popular astronomy column for 30 years</li>
      
    </ol>
  </body>
</html>

Pros and Cons of Templating

Putting code in HTML templates and then expanding that to create actual pages has advantages and disadvantages. The main advantage is that simple things are simple to do: the biography template shown above is a lot easier to understand than either a bunch of print statements, or a set of functions that construct a document in memory and then turn the result into a string.

The other big advantage of templating is that all of the generated pages are guaranteed to have the same format. If subsections are marked with an h2 heading in one, they'll be marked with an h2 in all the others. This makes it easier for programs to read and process those pages.

The biggest drawback of templating is the lack of support for debugging. It's very common for template expansion to do what you said, rather than what you meant, and working backward from a page that has the wrong content to the bits of template that weren't quite right can be complicated. One way to keep it manageable is to keep the templates as simple as possible. Any calculations more complicated than simple addition should be done in the program, and the result passed in as a variable. Similarly, while deeply-nested conditional statements in programs are hard to understand, their equivalents in templates are even harder, and so should be avoided.

Jinja2 templates support all the basic features of Python. For example, we can modify our template file to say:

<html>
  <head>
    <title>Biography of {{name}}</title>
  </head>
  <body>
    <h1>{{name}}</h1>
    {% if facts %}
      <ol>
        {% for f in facts %}
        <li>{{f}}</li>
        {% endfor %}
      </ol>
    {% else %}
      <p>No facts available.<p>
    {% endif %}
  </body>
</html>

so that if the list facts is empty, the page displays a paragraph saying that, rather than an empty ordered list. We can also tell Jinja2 to include one template in another, so that if we want every page to have the same logo and license statement, we can use:

{% include "logo.html" %}

at the top, and:

{% include "license.html" %}

at the bottom.

Key Points

  • Use a page templating system like Jinja2 to generate web pages from data.

Challenges

FIXME

How the Web Works

Objectives

  • Explain what IP addresses, host names, and sockets are.
  • Draw a diagram of HTTP's request-response cycle and explain the major steps.
  • Draw a diagram showing what information HTTP requests and responses contain.
  • Explain the difference between client-server and peer-to-peer architectures, and give an example of each.

Lesson

Now that we know how to read and write the web's most common data format, it's time to look at how data is moved around on the web. Broadly speaking, web applications are built in one of two ways. In a client/server architecture many clients communicate with a central server (Figure XXX). This model is asymmetric: clients ask for things, and servers provide them. Web browsers and web servers like Firefox and Apache are the best-known example of this model, but many database management systems also use a client/server architecture.

Client-Server Architecture
Figure XXX: Client-Server Architecture

In contrast, a peer-to-peer architecture is one in which all processes exchange information equally (Figure XXX). This is symmetric: every participant both provides and receives data. The most widely used example today is probably BitTorrent, but again, there are many others. Peer-to-peer systems are generally harder to design than client-server systems, but they are also more resilient: if a centralized web server fails, the whole system goes down, while if one node in a filesharing network goes down, the rest can (usually) carry on.

Peer-to-Peer Architecture
Peer-to-Peer Architecture

Under the hood, both kinds of systems (and pretty much every other program that uses the network) run on a family of communication standards called Internet Protocol (IP). IP breaks messages down into small packets, each of which is forwarded from one machine to another along any available route to its destination, where the whole message is reassembled (Figure XXX).

Packet-Based Communication
Figure XXX: Packet-Based Communication

The only part of IP that concerns us is the Transmission Control Protocol (TCP/IP). It guarantees that every packet we send is received, and that packets are received in the right order. Putting it another way, it turns an unreliable stream of disordered packets into a reliable, ordered stream of data, so that communication between computers looks as much as possible like reading and writing files. (Figure XXX).

Building Streams Out of Packets
Figure XXX: Building Streams Out of Packets

Programs using IP communicate through sockets. Each socket is one end of a point-to-point communication channel, just like a phone is one end of a phone call. A socket is identified by two numbers. The first is its host address or IP address, which identifies a particular machine on the network. This address consists of four 8-bit numbers, such as 208.113.154.118. The Domain Name System (DNS) matches these numbers to symbolic names like software-carpentry.org that are easier for human beings to remember. We can use tools like nslookup to query DNS directly:

$ nslookup software-carpentry.org
Server:  admin1.private.tor1.mozilla.com
Address:  10.242.75.5

Non-authoritative answer:
Name:    software-carpentry.org
Address:  173.236.199.157

A socket's port number is just a number in the range 0-65535 that uniquely identifies the socket on the host machine. (If the IP address is like a university's phone number, then the port number is the extension.) Ports 0-1023 are reserved for the operating system's use; anyone else can use the remaining ports (Figure XXX).

Ports
Figure XXX: Ports

The Hypertext Transfer Protocol (HTTP) sits on top of TCP/IP. It describes one way that programs can exchange web pages and other data, such as image files. The communicating parties were originally web browsers and web servers, but HTTP is now used by many other kinds of applications as well.

In principle, HTTP is simple: the client sends a request specifying what it wants over a socket connection, and the server sends some data in response. The data may be HTML copied from a file on disk, a similar page generated dynamically by a program, an image, or just about anything else (Figure XXX).

HTTP Request Cycle
Figure XXX: HTTP Request Cycle

The Internet vs. the Web

A lot of people use the terms "Internet" and "World Wide Web" synonymously, but they're actually very different things. The Internet is what lets (almost) any computer communicate with (almost) any other. That communication can be email, File Transfer Protocol (FTP), streaming video, or any of a hundred other things. The World Wide Web, on the other hand, is just one particular way to share data on top of the network that the Internet provides.

An HTTP request has three parts (Figure XXX). The HTTP method is almost always either "GET" (to fetch information) or "POST" (to submit form data or upload files). The URL specifies what the client wants; it may be a path to a file on disk, such as /research/experiments.html, but it's entirely up to the server to decide what to send back. The HTTP version is usually "HTTP/1.0" or "HTTP/1.1"; the differences between the two don't matter to us.

HTTP Request
Figure XXX: HTTP Request

An HTTP header is a key/value pair, such as the three shown below:

Accept: text/html
Accept-Language: en, fr
If-Modified-Since: 16-May-2005

A key may appear any number of times, so that (for example) a request can specify that it's willing to accept several types of content.

The body is any extra data associated with the request. This is used when submitting data via web forms, when uploading files, and so on. There must be a blank line between the last header and the start of the body to signal the end of the headers; forgetting it is a common mistake.

One header, called Content-Length, tells the server how many bytes to expect to read in the body of the request. There's no magic in any of this: an HTTP request is just text, and any program that wants to can create one or parse one.

HTTP Response
Figure XXX: HTTP REsponse

HTTP responses are formatted like HTTP requests (Figure XXX). The version, headers, and body have the same form and mean the same thing. The status code is a number indicating what happened when the request was processed by the server. 200 means "everything worked", 404 means "not found", and other codes have other meanings (Figure XXX). The status phrase repeats that information in a human-readable phrase like "OK" or "not found".

Code Name Meaning
100 Continue Client should continue sending data
200 OK The request has succeeded
204 No Content The server has completed the request, but doesn't need to return any data
301 Moved Permanently The requested resource has moved to a new permanent location
307 Temporary Redirect The requested resource is temporarily at a different location
400 Bad Request The request is badly formatted
401 Unauthorized The request requires authentication
404 Not Found The requested resource could not be found
408 Timeout The server gave up waiting for the client
418 I'm a teapot No, really
500 Internal Server Error An error occurred in the server that prevented it fulfilling the request
601 Connection Timed Out The server did not respond before the connection timed out
Figure XXX: HTTP Codes

The one other thing that we need to know about HTTP is that it is stateless: each request is handled on its own, and the server doesn't remember anything between one request and the next. If an application wants to keep track of something like a user's identity, it must do so itself. The usual way to do this is with a cookie, which is just a short character string that the server sends to the client, and the client later returns to the server (Figure XXX). When a user signs in, the server creates a new cookie, stores it in a database, and sends it to their browser. Each time the browser sends the cookie back, the server uses it to look up information about what the user is doing (e.g., what wiki page they are editing).

Cookies
Figure XXX: Cookies

Key Points

  • Most communication on the web uses TCP/IP and sockets.
  • A socket endpoint is identified by a host address and a port number.
  • The Domain Name System (DNS) translates between human-readable names and host addresses.
  • An HTTP request contains a method, headers, and a body.
  • An HTTP response also contains a response code.
  • HTTP is a stateless request-response protocol.
  • Many web sites use cookies to keep track of state.

Challenges

FIXME

Getting Data

Objectives

  • Write a program that downloads a data file given its URL.
  • Format values as URL query parameters.

Lesson

Opening sockets, constructing HTTP requests, and parsing responses is tedious, so most people use libraries to do most of the work. Python comes with such a library called urllib2 (because it's a replacement for an earlier library called urllib), but it exposes a lot of plumbing that most people never want to care about. Instead, we recommend using the Requests library. Here's an example that uses it to download a page from our web site:

import requests
response = requests.get("http://guide.software-carpentry.org/web/testpage.html")
print 'status code:', response.status_code
print 'content length:', response.headers['content-length']
print response.text
status code: 200
content length: 126
<!DOCTYPE html>
<html>
  <head>
    <title>Software Carpentry Test Page</title>
  </head>
  <body>
    <p>Use this page to test requests.</p>
  </body>
</html>

request.get does an HTTP GET on a URL and returns an object containing the response. That object's status_code member is the response's status code; its content_length member is the number of bytes in the response data, and text is the actual data (in this case, an HTML page).

One at a Time

no images etc. fetched

Sometimes a URL isn't enough on its own: for example, we have to specify what our search terms are if we are using a search engine. We could add these to the path in the URL, but that would be misleading (since most people think of paths as identifying files and directories), and we've have to decide whether /software/carpentry and /carpentry/software were the same search or not.

What we should do instead is add parameters to the URL by adding a '?' to the URL followed by 'key=value' pairs separated by '&'. For example, the URL http://www.google.ca?q=Python ask Google to search for pages related to Python—the key is the letter 'q', and the value is 'Python'—while the longer query http://www.google.ca/search?q=Python&client=Firefox tells Google that we're using Firefox. We can pass whatever parameters we want, but it's up to the application running on the web site to decide which ones to pay attention to, and how to interpret them.

You Are Who You Say You Are

Yes, this means that we could write a program that tells websites it is Firefox, Internet Explorer, or pretty much anything else. We'll return to this and other security issues later.

Of course, if '?' and '&' are special characters, there must be a way to escape them. The URL encoding standard represents special characters using "%" followed by a 2-digit code, and replaces spaces with the '+' character (Figure XXX). Thus, to search Google for "grade = A+" (with the spaces), we would use the URL http://www.google.ca/search?q=grade+%3D+A%2B.

Character Encoding
"#" %23
"$" %24
"%" %25
"&" %26
"+" %2B
"," %2C
"/" %2F
":" %3A
";" %3B
"=" %3D
"?" %3F
"@" %40
Figure XXX: URL Encoding

Encoding things by hand is very error-prone, so the Requests library lets us use a dictionary of key-value pairs instead via the keyword argument params:

import requests
parameters = {'q' : 'Python', 'client' : 'Firefox'}
response = requests.get('http://www.google.com/search', params=parameters)
print 'actual URL:', response.url
actual URL: http://www.google.com/search?q=Python&client=Firefox

You should always let the library build the URL for you, rather than doing it yourself: there are subtleties we haven't covered, and even if there weren't, there's no point duplicating code that's already been written and tested.

Suppose we want to write a script that actually does search Google. Constructing a URL is easy. Sending it and reading the response is easy too, but parsing the response is hard, since there's a lot of stuff in the page that Google sends back. Many first-generation web applications relied on screen scraping to get data, i.e., they would search for substrings in the HTML using something like Beautiful Soup. They had to do this because a lot of hand-written HTML was improperly formatted: for example, it was quite common to use <br> on its own to break a line.

Screen scraping is always hard to get right if the page layout is complex. It is also fragile: whenever the layout of the pages changes, the application will most likely break because data is no longer where it was.

Most modern web applications try to sidestep this problem by providing some sort of web services interface, which is a lot simpler than it sounds. When a client sends a request, it indicates that it wants machine-oriented data rather than human-readable HTML by using a slightly different URL (Figure XXX). When asked for data, the server sends back JSON, XML, or something else that is easy for a program to handle. If the client asks for HTML, on the other hand, the application turns that data into HTML pages with italics and colored highlights and the like to make it easy for human beings to read.

Web Services
Figure XXX: Web Services

Using "live" data from a web service is a powerful way to get a lot of science done in a hurry, but only when it works. As a case in point, we wanted to use bird-watching data from ebird.org in this example, but their server was locked down for security reasons when it came time for us to write our examples. (This is another way in which software is like other experimental apparatus: odds are that when you need it most, it will be broken or someone will have borrowed it.) We therefore chose to use climate data from the World Bank instead. According to the documentation, data for a particular country can be found at:

http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/VARIABLE/year/ISO.FORMAT

where:

  • VARIABLE is either "pr" (for precipitation) or "tas" (for temperature at surface);
  • ISO is the International Standards Organization's 3-letter country code for the country of interest, and
  • FORMAT is "JSON" for JSON, and other strings for other formats.

Let's try getting temperature data for France:

>>> import requests
>>> url = 'http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/tas/year/FRA.JSON'
>>> response = requests.get(url)
>>> print response.text
[{"year":1901, "data":9.748865},
 {"year":1902, "data":9.864603},
 {"year":1903, "data":10.130159},
 ...
 {"year":2009,"data":11.709985}]

This is straightforward to interpret: the outer list element contains a dictionary for each year, each of which contains "year" and "data" entries. Let's use this to write a program that compares the data for two countries (which is the problem Carla wanted to solve at the start of this chapter). We need to know which countries to compare:

def main(args):
    first_country = 'AUS'
    second_country = 'CAN'
    if len(args) > 0:
        first_country = args[0]
    if len(args) > 1:
        second_country = args[1]

    result = ratios(first_country, second_country)
    display(result)

def ratios(first, second):
    '''Calculate ratio of average temperatures for two countries over time.'''
    return {} # FIXME: fill in

def display(values):
    '''Show dictionary entries in sorted order.'''
    keys = values.keys()
    keys.sort()
    for k in keys:
        print k, values[k]

if __name__ == '__main__':
    main(sys.argv[1:])

The pattern here should be familiar: we solve the top-level problem as if we already have the functions we need, then come back and fill them in. In this case, this function to be filled in is ratios, which fetches data and calculates our result:

def ratios(first, second):
    '''Calculate ratio of average temperatures for two countries over time.'''
    first = get_temps(first)
    second = get_temps(second)
    assert len(first) == len(second), 'Length mis-match in results'
    result = {}
    for (i, first_entry) in enumerate(first):
        year = first_entry['year']
        second_entry = second[i]
        assert second_entry['year'] == year, 'Year mis-match'
        result[year] = first_entry['data'] / second_entry['data']
    return result

It depends in turn on get_temps:

URL = 'http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/tas/year/%s.JSON'

...all the code written so far...

def get_temps(country_code):
    '''Get annual temperatures for a country.'''
    response = requests.get(URL % country_code)
    assert response.status_code == 200, \
           'Failed to get data for %s' % country_code
    return json.loads(response.text)

But wait a second: judging from the sample response shown earlier, temperatures are being reported in Celsius. We should probably convert them to Kelvin to make the ratios more meaningful (and to avoid the risk of dividing by zero). Let's modify get_temps:

def get_temps(country_code):
    '''Get annual temperatures for a country.'''
    response = requests.get(URL % country_code)
    assert response.status_code == 200, \
           'Failed to get data for %s' % country_code
    result = json.loads(response.text)
    for entry in result:
        result['data'] = kelvin(result['data'])
    return result

and add the required conversion function:

def kelvin(celsius):
    '''Convert degrees C to degrees K.'''
    return celsius + 273.15

Let's try running this program with no arguments to compare Australia to Canada:

$ python temperatures.py
1901 1.10934799048
1902 1.11023963325
1903 1.10876094164
...  ...
2007 1.10725265753
2008 1.10793365185
2009 1.10865537105

and then with arguments to compare Malaysia to Norway:

$ python temperatures.py MYS NOR
1901 1.08900632708
1902 1.09536126502
1903 1.08935268463
...  ...
2007 1.08564675748
2008 1.08481881663
2009 1.08720464013

Only six lines in this program do anything webbish (i.e., format the actual URL and get the data). The remaining 47 lines are the user interface (handling command-line arguments and printing output) data manipulation (converting temperatures and calculating ratios), import statements, and docstrings. It really is that simple.

Key Points

  • Use Python's Requests library to make HTTP requests.
  • Let the library format URL parameters.
  • Ask web sites for data instead of scraping it from HTML pages.
  • The URLs and query parameters needed to fetch data are specified by the web site.

Challenges

FIXME

Providing Data

Objectives

  • Explain how server applications provide data to clients, and why doing this securely is hard.
  • Recognize when dynamically-generated static pages are a usable alternative, and explain how this differs from truly dynamic service.
  • Generate pages containing data in human-readable form.

Lesson

The next logical step is to provide data to others by writing some kind of server application. The basic idea is simple (Figure XXX):

  1. wait for someone to connect to your server and send you an HTTP request;
  2. parse that request;
  3. figure out what it's asking for;
  4. fetch that data (or run a program to generate some data dynamically);
  5. format the data as HTML or XML; and
  6. send it back.
Web Application Lifecycle
Web Application Lifecycle

As simple as this is, we're not going to show you how to do it, because experience has shown that all we can actually do in a short lecture is show you how to create security problems. Here's just one example. Suppose you want to write a web application that accepts URLs of the form http://my.site/data?species=homo.sapiens and fetches a database record containing information about that species. One way to do it in Python might look like this:

def get_species(url):
    '''Get data for a species given a URL with the species name as a query parameter.'''
    params = url.split('?')[1]                                # Get everything after the '?'.
    pairs = params.split('&')                                 # Get the name1=value1&name2=value2 pairs.
    pairs = [pairs.split('=') for p in pairs]                 # Split the name=value pairs.
    pairs = dict(pairs)                                       # Convert to a {name : value} dictionary.
    species = pairs['species']                                # Get the species we want to look up.
    sql = '''SELECT * FROM Species WHERE Name = "%s";'''      # Template for SQL query.
    sql = sql % species                                       # Insert the species name.
    cursor.execute(sql)                                       # Send query to database.
    results = cursor.fetchall()                               # Get all the results.
    return results[0]

We've taken out all the error-checking—for example, this code will fail if there aren't actually any query parameters, or if the species' name isn't in the database—but that's not the problem. The problem is what happens if someone sends us this URL:

http://my.site/data?species=homo.sapiens";DROP TABLE Species"--

Why? Because the dictionary of query parameters produced by the first five lines of the function will be:

{'species' : 'homo.sapiens";DROP TABLE Species;--'}

which means that the SQL query will be:

SELECT * FROM Species WHERE Name = "homo.sapiens";DROP TABLE Species;--";

which is the same as:

SELECT * FROM Species WHERE Name = "homo.sapiens";
DROP TABLE Species;

In other words, this query selects something from the database, then deletes the entire Species table.

This is called an SQL injection attack, because the user is injecting SQL into our database query. It's just one of hundreds of different ways that evil-doers can try to compromise a web application. Built properly, web sites can withstand such attacks, but learning what "properly" is and how to implement it takes more time than we have.

Instead, we will look at how to write programs that create static HTML pages that can then be given to clients by a standard web server. Using the ratios of average annual temperatures as our example, we'll create pages whose names look like http://my.site/tempratio/AUS-CAN.html, and which contain data formatted like this:

<html>
  <head>
    <meta name="revised" content="2013-09-15" />
  </head>
  <body>
    <h1>Ratio of Average Annual Temperatures for AUS and CAN</h1>
    <table class="data">
      <tr>
        <td class="year">1901</td>
        <td class="data">1.10934799048</td>
      </tr>
      <tr>
        <td class="year">1902</td>
        <td class="data">1.11023963325</td>
      </tr>
      <tr>
        <td class="year">1903</td>
        <td class="data">1.10876094164</td>
      </tr>
      ...
      <tr>
        <td class="year">2007</td>
        <td class="data">1.10725265753</td>
      </tr>
      <tr>
        <td class="year">2008</td>
        <td class="data">1.10793365185</td>
      </tr>
      <tr>
        <td class="year">2009</td>
        <td class="data">1.10865537105</td>
      </tr>
    </table>
  </body>
</html>

The first step is to calculate ratios, which we did in the previous section. The main function of our program is:

def main(args):
    '''Create web page showing temperature ratios for two countries.'''

    assert len(args) == 4, \
           'Usage: make_data_page template_filename output_filename country_1 country_2'
    template_filename = args[0]
    output_filename = args[1]
    country_1 = args[2]
    country_2 = args[3]

    page = make_page(template_filename, country_1, country_2)

    writer = open(output_filename, 'w')
    writer.write(page)
    writer.close()

if __name__ == '__main__':
    main(sys.argv[1:])

Most of the work is done by make_page, which gets temperature data for two countries, calculates ratios, and fills in a Jinja2 template. Using the get_temps function we wrote earlier, it is:

def make_page(template_filename, output_filename, country_1, country_2):
    '''Create page showing temperature ratios.'''

    data_1 = get_temps(country_1)
    data_2 = get_temps(country_2)
    years = data_1.keys()
    years.sort()
    the_date = date.isoformat(date.today())  # Format today's date

    loader = jinja2.FileSystemLoader(['.'])
    environment = jinja2.Environment(loader=loader)
    template = environment.get_template(template_filename)

    result = template.render(country_1=country_1, data_1=data_1,
                             country_2=country_2, data_2=data_2,
                             years=years, the_date=the_date)
    return result

The only new thing here is the use of date.isoformat and date.today to format today's date as something like "2013-09-15".

To finish, we need a Jinja2 template for the pages we want to create:

<!DOCTYPE html>
<html>
  <head>
    <title>Temperature Ratios of {{country_1}} and {{country_2}} as of {{the_date}}</title>
  </head>
  <body>
    <h1>Temperature Ratios of {{country_1}} and {{country_2}}</h1>
    <h2>Calculated {{the_date}}</h2>
    <table>
      <tr>
        <td>Year</td>
        <td>{{country_1}}</td>
        <td>{{country_2}}</td>
        <td>Ratio</td>
      </tr>
      {% for year in years %}
      <tr>
        <td>{{year}}</td>
        <td>{{data_1[year]}}</td>
        <td>{{data_2[year]}}</td>
        <td>{{data_1[year] / data_2[year]}}</td>
      </tr>
      {% endfor %}
    </table>
  </body>
</html>

Let's run it for Australia and Canada:

$ python make_data_page.py temp_ratio.html /tmp/aus-can.html AUS CAN

Sure enough, the file /tmp/aus-can.htmlcontains:

<!DOCTYPE html>
<html>
  <head>
    <title>Temperature Ratios of AUS and CAN as of 2013-02-10</title>
  </head>
  <body>
    <h1>Temperature Ratios of AUS and CAN</h1>
    <h2>Calculated 2013-02-10</h2>
    <table>
      <tr>
        <td>Year</td>
        <td>AUS</td>
        <td>CAN</td>
        <td>Ratio</td>
      </tr>
      
      <tr>
        <td>1901</td>
        <td>294.507021</td>
        <td>265.477581</td>
        <td>1.10934799048</td>
      </tr>
      
      <tr>
        <td>1902</td>
        <td>294.532462</td>
        <td>265.2872886</td>
        <td>1.11023963325</td>
      </tr>

      ...
      
      <tr>
        <td>2009</td>
        <td>295.07194</td>
        <td>266.1529883</td>
        <td>1.10865537105</td>
      </tr>
      
    </table>
  </body>
</html>

This looks right, but most experienced programmers would ask us to make one improvement. Our program doesn't actually calculate temperature ratios; that's done by this line in the template:

        <td>{{data_1[year] / data_2[year]}}</td>

Experience shows that the more calculations we do in our views (i.e., our information displays), the harder they are to maintain. What we should do is:

  1. create another dictionary called ratios in the Python program and pass it into the template, and
  2. have the template display those values rather than calculating ratios itself.

Splitting things this way is extra work in this small case, but it's the best way to manage information as our displays become more complex.

Running a Local Server

The HTTP servers taht come in the standard Python library are useful for practicing these things in class. To start serving files, we go into the directory that contains them and run:

$ python -m SimpleHTTPServer 8080

-m SimpleHTTPServer tells Python to find the SimpleHTTPServer library and run it as a program; the parameter 8080 tells it what port to use. (It's normal to run HTTP servers on port 80, but your system may forbid you from doing that if you don't have administrator privileges.) To get files, we use localhost as the site, and include the appropriate port number, so the URL is http://localhost:80/index.html, or more simply, http://localhost:80/.

Key Points

  • Creating static files is a safe, simple alternative to providing content dynamically.
  • Use a program to get and manipulate data, and a template to generate the page.
  • Views should display values, not calculate them.

Challenges

FIXME

Creating an Index

Objectives

  • Create and update an index for a set of pages.
  • Explain why having an index is important.

Lesson

If Carla is calculating temperature ratios for many different countries, how will other scientists know which ones she has done? In other words, how can she make her data findable?

The standard answer for hundreds of years has been, "Create an index." On the web, we can do this by creating a file called index.html and putting it in the directory that holds our data files.

Indexing Conventions

We don't have to call our index file index.html, but it's best to do so. By default, most web servers will give clients that file when they're asked for the directory itself. In other words, if someone points a browser (or any other program) at http://my.site/tempratio/, the web server will look for /tempratio. When it realizes that path is a directory rather than a file, it will look inside that directory for a file called index.html and return that. This is not guaranteed—system administrators can and do set up other default behaviors—but it is a common convention, and we can always tell our colleagues to fetch http://my.site/tempratio/ if they want the current index anyway.

What should be in index.html? The answer is simple: a table of some kind showing what files are available, when they were created, and where they are. The first piece of information is the most important; the second allows users to determine what has been added since they last looked at our site without having to download actual data files, while the third tells them how to get what they want. Our index.html will therefore be something like this:

<html>
  <head>
    <title>Index of Average Annual Temperature Ratios</title>
    <meta name="revised" content="2013-09-15" />
  </head>
  <body>
    <h1>Index of Average Annual Temperature Ratios</h1>
    <table class="data">
      <tr>
        <td class="country">AUS</td>
        <td class="country">CAN</td>
        <td class="revised">2013-09-12</td>
        <td class="revised"><a href="http://my.site/tempratio/AUS-CAN.html">download</a></td>
      </tr>
      ...
      <tr>
        <td class="country">MYS</td>
        <td class="country">NOR</td>
        <td class="revised">2013-09-15</td>
        <td class="download"><a href="http://my.site/tempratio/MYS-NOR.html">download</a></td>
      </tr>
    </table>
  </body>
</html>

Why Explicit URLs?

Strictly speaking, we don't need to store the URLs in the index file: we could instead tell people that if they got the index from http://my.site/tempratio/index.html, then the data for AUS and CAN is in http://my.site/tempratio/AUS-CAN.html, and let them construct the URL themselves. However, that puts more of a burden on the user both in the short term (since more coding is required) and in the long term (since the rule for constructing the URL for a particular data set could well change). It also effectively hides our data from search engines, since there's no way for them to know what our URL construction rule is.

Now, unlike our actual data files, this index file is added to incrementally: each time we generate a new version, we have to include all the data that was in the old version as well. We therefore need to remember what we've done. The usual way to do this in a real application is to use a database, but for our purposes, a plain old text file will suffice.

We could make up a format to store the information we need, such as:

Updated 2013-05-09
AUS CAN 2013-03-07
AUS NOR 2013-03-09
CAN NOR 2013-04-22
CAN MDG 2013-05-09

but it's much simpler just to use JSON:

{
    'updated' : '2013-05-09',
    'entries' : [
        ['AUS', 'CAN', '2013-03-07'],
        ['AUS', 'NOR', '2013-03-09'],
        ['CAN', 'NOR', '2013-04-22'],
        ['CAN', 'MDG', '2013-05-09']
    ]
}

Loading this data is as simple as:

import json
reader = open('index.json', 'r')
check = json.load(reader)
print check
{u'updated': u'2013-05-09', u'entries': [[u'AUS', u'CAN', u'2013-03-07'], [u'AUS', u'NOR', u'2013-03-09'], [u'CAN', u'NOR', u'2013-04-22'], [u'CAN', u'MDG', u'2013-05-09']]}

(Remember, the 'u' in front of each string signals that these strings are actually stored as Unicode, but we can safely ignore that for now.) Let's rewrite the main function of our temperature ratio program so that it creates the index as well as the individual page:

import sys
import os
from datetime import date
import jinja2
import json
from temperatures import get_temps

INDIVIDUAL_PAGE = 'temp_ratio.html'
INDEX_PAGE = 'index.html'
INDEX_FILE = 'index.json'

def main(args):
    '''
    Create web page showing temperature ratios for two countries,
    and update the index.html page with the new entry.
    '''

    assert len(args) == 5, \
           'Usage: make_indexed_page url_base template_dir output_dir country_1 country_2'
    url_base, template_dir, output_dir, country_1, country_2 = args
    the_date = date.isoformat(date.today())

    loader = jinja2.FileSystemLoader([template_dir])
    environment = jinja2.Environment(loader=loader)

    page = make_page(environment, country_1, country_2, the_date)
    save_page(output_dir, '%s-%s.html' % (country_1, country_2), page)

    index_data = load_index(output_dir, INDEX_FILE)
    index_data['entries'].append([country_1, country_2, the_date])
    save_page(output_dir, INDEX_FILE, json.dumps(index_data))

    page = make_index(environment, url_base, index_data)
    save_page(output_dir, INDEX_PAGE, page)

Since we will be expanding templates in a couple of different functions, we move the creation of the Jinja2 environment to the main program. We then pass the environment into both make_page and a new function called update_index, and use another new function save_page to save generated pages where they need to go. (Note that we update the index data before rewriting the index HTML page, so that the updates to the index appear in the HTML. We did these two steps in the wrong order in the first version of this program that we wrote, and it was several hours before we noticed the error...)

save_page is the simplest function to write, so let's do that:

def save_page(output_dir, page_name, content):
    '''Save text in a file output_dir/page_name.'''
    path = os.path.join(output_dir, page_name)
    writer = open(path, 'w')
    writer.write(content)
    writer.close()

Our revised make_page function is shorter than our original, since the environment is now being created in main. It is also now being passed the date (since that is used to update the index as well), and uses a fixed template specified by the global variable INDIVIDUAL_PAGE. The result is:

def make_page(environment, country_1, country_2, the_date):
    '''Create page showing temperature ratios.'''

    data_1 = get_temps(country_1)
    data_2 = get_temps(country_2)
    years = data_1.keys()
    years.sort()

    template = environment.get_template(INDIVIDUAL_PAGE)
    result = template.render(country_1=country_1, data_1=data_1,
                             country_2=country_2, data_2=data_2,
                             years=years, the_date=the_date)

    return result

The function that loads existing index data is also pretty simple:

def load_index(output_dir, filename):
    '''Load index data from output_dir/filename.'''

    path = os.path.join(output_dir, filename)
    reader = open(path, 'r')
    result = json.load(reader)
    reader.close()
    return result

All that's left are the function that regenerates the HTML version of the index:

def make_index(environment, url_base, index_data):
    '''Refresh the HTML index page.'''

    template = environment.get_template(INDEX_PAGE)
    return template.render(url_base=url_base,
                           updated=index_data['updated'],
                           entries=index_data['entries'])

and the HTML template it relies on:

<!DOCTYPE html>
<html>
  <head>
    <title>Index of Average Annual Temperature Ratios</title>
    <meta name="revised" content="{{updated}}" />
  </head>
  <body>
    <h1>Index of Average Annual Temperature Ratios</h1>
    <table class="data">
      {% for entry in entries %}
      <tr>
        <td class="country">{{entry[0]}}</td>
        <td class="country">{{entry[1]}}</td>
        <td class="revised">{{entry[2]}}</td>
        <td class="revised"><a href="{{url_base}}/{{entry[0]}}-{{entry[1]}}.html">download</a></td>
      </tr>
      {% endfor %}
    </table>
  </body>
</html>

Key Points

  • Every collection of data should have an index.
  • The index should specify when things were updated, as well as what they are.
  • The URLs linking files should be absolute, so that client programs do not have to modify them in order to use them.

Challenges

FIXME

Syndicating Data

Objectives

FIXME

Lesson

We'll now use what we have learned to build a simple tool to download new temperature comparisons from a web site. In broad strokes, our program will keep a list of URLs to download data from, along with a timestamp showing when data was last downloaded. When we run the program, it will poll each site to see if any new data sets have been added since the last check. If any have, the program will display their URLs.

In order for this to work, each of the sites that's providing data needs to be able to tell us what data sets it has calculated, and when they were created. This information is in the site's index.html file in human-readable form, but it's also in the index.json file each site is maintaining. Client programs can load this file directly without having to do any parsing, so we'll rely on that.

Making Life Simpler

An earlier version of this tutorial loaded the HTML version of the index and extracted dates and URLs from it. Doing so only required twelve extra lines of code—but an extra 1200 words to explain how to read HTML into a program and find things in it. Storing information in machine-friendly formats for machines to use makes life a lot simpler...

The next step is to decide how to keep track of what we have downloaded and when. The simplest thing is to create another JSON file containing the timestamp and the list of URLs. We'll call this sources.json:

{
    "timestamp" : "2013-05-02:07:04:03",
    "sites" : [
        "http://software-carpentry.org/temperatures/index.json",
        "http://some.other.site/some/path/index.json"
    ]
}

(Again, a larger application would use a database of some kind, but that's more than we need right now.) Each time we run our program, it will read this file, then download each index.json file. If any of those files contain links to data sets that are newer than the timestamp, it will print the data set's URL. (A real data analysis program would download the data and do something with it.) We will then save a fresh copy of sources.json with an updated timestamp (Figure XXX). Our main program looks like this:

import date

def main(sources_path):
    '''Check all data sites in list, then update timestamp of sources.json.'''
    old_timestamp, all_sources = read_sources(sources_path)
    new_timestamp = date.datetime.now()
    for source in all_sources:
        for url in get_new_datasets(old_timestamp, source):
            process(url)
    write_sources(sources_path, new_timestamp, sources)
Syndication Lifecycle
Figure XXX: Syndication Lifecycle

That seems pretty simple; the only subtlety is that we calculate the new timestamp before we start checking for new datasets. The reason is that this check might take anything from a few seconds to a few hours, depending on how busy the Internet is and how much data we actually download. If we wait until we're done and then record that moment as the new timestamp, then the next time we run our program, we won't download any datasets that were created between the time we started the first run of our program and the time it finished (Figure XXX).

When to Create Timestamps
Figure XXX: When to create Timestamps

We now have four functions to write: read_sources, write_sources, get_new_datasets, and process. Reading and writing the sources.json file is pretty simple:

import json

def read_sources(path):
    '''Read timestamp and data sources from JSON files.'''
    reader = open(path, 'r')
    data = json.load(reader)
    timestamp = data['timestamp']
    sources = data['sources']
    return timestamp, sources

def write_sources(sources_path, timestamp, sources):
    '''Write timestamp and data sources to JSON file.'''
    data = {'timestamp' : timestamp,
            'sources'   : sources}
    writer = open(sources_path, 'w')
    json.dump(data, writer)
    writer.close()

What about processing a URL? Right now, we're just going to print it, though in a real application we would probably download the data and do some further calculations with it:

def process(url):
    '''Placeholder for processing a data set given its URL.'''
    print url

Finally, we need to construct a list of dataset URLs given the URL of an index.json file:

import requests

def get_new_datasets(last_checked, index_url):
    '''Return a list of URLs of datasets that are newer than the timestamp.'''
    response = requests.get(index_url)
    index_data = json.loads(index.text)
    result = []
    for (country_a, country_b, updated) in index_data:
        dataset_timestamp = datetime.parse(updated)
        if dataset_timestamp >= last_checked:
            dataset_url = make_dataset_url(index_url, country_a, country_b)
            result.append(dataset_url)
    return result

The logic here is straightforward: grab the index.json file, check each dataset to see if it's newer than the last time we checked, and if it is—hm. This code uses a not-yet-written function called make_dataset_url to construct the URL for the specific dataset from the URL of the index file and the two country codes, but as we discussed earlier, asking client programs to construct links themselves is a bad idea. Instead, we should modify the index.json files so that they include the URLs. Doing this is left as an exercise for the reader.

But hang on: what exactly are we downloading when we download data sets? Right now, our temperature ratio files are all HTML pages; if we want to use that information in programs, it would be a lot easier if producers generated JSON files that consumers could use directly. It's almost trivial to extend our original program to produce such a file each time it produces a new HTML file, and to include the URLs for both files in both versions of the index (Figure XXX). Once we've done that, we have a first-class data syndication system: human-friendly and machine-friendly formats live side by side, so scientists and programs all over the world can make use of our results as soon as they appear.

Final System
Figure XXX: Final System

Key Points

  • Provide human-readable and machine-readable versions of everything.

Challenges

FIXME

Summary

The web has changed in many ways over the last 20 years, not all of them for the better. An HTML page on a modern commercial site is likely to include dozens or hundreds of lines of Javascript that depend on several large, complicated libraries, and which generate the page's content on the fly inside the browser. Such a "page" is really a small (or not-so-small) program rather than a document in the classical sense of the word, and while that may produce a better experience for human users, it makes life more difficult for programs (and for people with disabilities, whose assistive aids are all too easy to confuse). And while XML is widely used for representing data, many people believe that younger alternatives like JSON do a better job of balancing the needs of human and computer readers.

Regardless of the technology used, though, the web's basic design principles are both simple and stable: tell people where data is, rather than giving them a copy; make the data itself and your names for it easy for both human beings and computers to understand; remix other people's data, and allow them to remix yours.