Database backed web applications: checking the list

The list checker is similar to the hotlist application proper. Upon loading the page for checking the list, the user is presented with a list of links, sorted in some order. The difference is that each item is displayed with a set of links for doing maintenance on the item as well as the link to the page described. While this format has some problems, they are acceptable for a private application like this.

After experimenting with the layout a few times, I decided that the list should be sorted by decreasing HTTP return codes from the page being fetched. This means the pages that fetched properly are at the bottom, and those that had problems are at the top. That code is also the first bit of information displayed for an item, and is a link that will remove that item from the database. Then comes the description, with a link to it just as in the hotlist application. After that, in parenthesis, comes any problem reports. If the link was redirected, the english version of the HTTP status is presented, and the URL the redirection provided. That text is a link to a maintenance function, to update the database by replacing the URL for the item with the one it was redirected to. If the title of the page fetched did not match the description, the title of the page is presented as well, and is a link to update the database by replacing the description with the actual page title.

The code for checklist is similar to the hotlist application. It starts with a set of format strings used to print the pages it produces. The only one that isn't a variant of hotlist application code are deleted_format and changed_format, which are used to format the page returned after an item is deleted or changed.

Following is the document class derived from the Python library class SGMLParser. It provides an easy way to to find the title in an HTML document. It includes methods invoked when the title tag starts and when it ends that set an instance variable to note that we are processing the title, and a method invoked as data is processed that saves the title if it's being processed

After that is the checkedurl class. This class inherits from the Python library class threading, which provides a high-level interface to the systems threading facilities. The run method describes the actions that should be taken to check a URL, and is invoked in a new thread when the Thread class's start method is invoked. It uses the good_document and bad_document methods to build a dictionary of values for the URL. Good documents get a title from the find_title method, which uses a document object.

Then there's the checker class that inherits from the hotlist handler class. It extends the initialization method to use the checker formats, and simplifies the display_page method as it no longer depends on the type variable from the query string. The rest of the display methods are unchanged. The get_list method is changed to create checkedurl object for each entry in the list, then start it with the Thread class's start method. It then walks the list of checkedurl objects, invoking the Thread class's join method for each to wait until that thread is finished, and then checks the status and updates that entries checked database entry. That list is then sorted to make the higher status entries show up first, and those with changed titles above those without. Finally, there's a trio of do_ methods to handle request to maintain the database. These all fetch the item with the get_item, make the change, fetch the new version of that item - except for do_delete - then display the items (or item, for delete) with the appropriate format string.

checklist.py is installed the same way that hotlist.py was - by copying it to /usr/local/www/cgi-bin. Be warned that it can take quite a while to run, as it checks every item in the hotlist, and some of those may have to time out.

Like the code described for hotlist, most of the work is done by C code in other applications. The only new items of work are fetching the document and sorting the list. Fetching the document is involves quite a bit of Python code, to parse the URL and the result - but is still dominated by the time taken to fetch the document over the network. The list sorting all happens in C. So once again - that we're using an interpreted language won't make much difference in the applications speed.

This ability to modify data on the server brings with it a need for some form of security for that data.


Prev, Next, Contents

Mike Meyer,
June, 1999