|
Skimpy Forum: An Application of Perl and XMLby Erik T. Ray, coauthor of Perl and XML06/20/2002 |
Every day, thousands, if not millions, of people view and post messages on Web forums. These forums are public, like billboards, but organized like email into threads, and sorted. My friends and I have been using them for years to keep in touch.
Over time, forums have become quite ornate, with polished designs as you would find in slashdot and other places. But I told my friends that at their heart, a forum is really very simple. In fact, I could probably write one over the weekend. "OK,"they said, "let's see you do that."
Oops. I had just committed myself to another project -- and I have more than enough work to do already. With my respectability at stake, I started to work on Skimpy Forum, a Perl- and XML-based CGI application. Here are the things I wanted Skimpy Forum to do:
What's missing is user authentication. Arguably, this is very important for a forum, because you want to be able to keep out anyone who abuses the rules. However, this being a project between me and a couple of friends, I didn't think it was strictly necessary. I can always add a cookie-based user authentication scheme later. For now, I'll trust that the user is who he says he is.
Now on to the Common Gteway Interface (CGI) design. I used a
parameter called action to tell the program which action to
perform, whether to display a list of threads, or add a post, or
whatever. The actions supported are:
| Action | Feature |
|---|---|
| (none) | display list of all threads |
| start | create a new thread |
| show | show a thread |
| post | post a new message to a thread |
| quote | quote a previous post |
Other parameters supply action-specific information. For example, the following CGI query string would add a new post to thread number 4 and attribute it to "Bubba:"
action=post&name=Bubba&thread=4&content=hello
So that's how the forum would look from the outside. Now to design the innards. The heart of the program is its data structure. For storage, I decided on XML rather than a database because it's simpler and faster to set up. True, databases are more scaleable and offer faster performance, but they are also more complex and would take too much time to get going. Besides, I have a soft spot for XML and thought it would be fun for this project. One cool thing about XML is that you can view its guts in any text editor, whereas a database has a proprietary interface.
I devised an XML markup language to hold the threads and post information. These were the elements I came up with:
| Element | Purpose |
|---|---|
| forum | root element |
| thread | contain all the posts in a thread |
| thread/title | title of a thread |
| post | contain data for a posted message |
| post/from | name of the contributor |
| post/date | when the post was submitted |
| post/content | contents of the message |
Example 1 shows a data file with one thread containing one post.
<?xml version="1.0"?>
<forum>
<thread num="1">
<title>Excitin' Stuff</title>
<post id="1">
<from>Fat Albert</from>
<date>Sun Jun 2 20:49:49 2002</date>
<content>hey hey hey</content>
</post>
</thread>
</forum>
Note the addition of id attributes for posts and
num attributes for threads. These serve as unique identifiers
that I use to select particular threads and posts. Why not use the
id attribute in both threads and posts? In XML, it's
traditional to use id as a unique identifier across
all elements, regardless of type. That means having a thread
with id="1" and a post with id="1" is forbidden. I
want to keep the two separate, each with its own counting scheme, so I
used different attributes.
|
Related Reading Perl and XML |
I decided not to make a Document Type Description (DTD), a formal description of the language that allows you to do high-level testing of grammar called "validation." Validation wouldn't be necessary because I trusted the source of the file: namely, the program itself. Once it was debugged, I could trust that it wouldn't mess up the structure.
|
With the planning complete, I rolled up my sleeves for coding. I needed a script to handle CGI queries and update the data file. Perl is what I know best and it happens to be good at generating HTML, so that was my language of choice. In fact, I use Perl and XML together all the time.
First up, I needed to choose a parser to turn the XML into an
object tree for manipulation by the program. There are many parsers in
the Perl module pantheon, but there is one that I like the most --
XML::LibXML. Written by Matt Sergeant and extended and maintained by
Christian Glahn, it's an interface to the C library libxml2 (a project under the auspices of the GNOME group). I have been using it for a few months now and I have to admit it is an amazingly useful module. Here are a few of its features:
XML::LibXML is also fast. It is, after all, an interface to a parser written in C. In recent months, it has become very stable too. So I like to think of XML::LibXML as my handy swiss army knife of XML processing.
The script, called skimpyforum, is listed here in example 2.
Now, let's walk through the code. The script starts by creating a
parser object and using it to read the forum's data file. Then
we call handle_query() to get the process rolling. The global
variable $doc is a reference to an object tree
created by the XML::LibXML parser. It uses a
standard Document Object Model (DOM) interface for manipulating
pieces, plus the XPath interface to find information inside of the
document. Specifically, the object is an
instance of XML::LibXML::Document, which has a reference to the root
element of the data file object, which contains all of the other parts of
the document.
# Read data file and store in DOM object whose ref is $doc
my $datafile = "/Library/WebServer/Documents/mun/forum.data";
my $parser = new XML::LibXML;
my $doc = $parser->parse_file( $datafile );
# munch on the CGI query and format results in an HTML page
handle_query();
And here's handle_query(). It delegates tasks based on the value of
the CGI parameter action.
sub handle_query {
#
# Decide what to do based on the action parameter
# from the query string we received.
#
my %params = get_params();
# show a thread
if( $params{action} eq 'show' ) {
gen_page_thread( $params{thread} );
# start a new thread
} elsif( $params{action} eq 'start' ) {
my $tid = add_thread( $paramsSkimpy Forum: An Application of Perl and XML );
update_data();
gen_page_thread( $tid );
# post a message
} elsif( $params{action} eq 'post' ) {
add_post( $params{name}, $params{content}, $params{thread} );
update_data();
gen_page_thread( $params{thread} );
# quote a post
} elsif( $params{action} eq 'quote' ) {
gen_page_quote( $params{thread}, $params{id} );
# no action specified; show list of threads
} else {
gen_page_toc();
}
}
The subroutines get_params() and unescape() are in the code listing. Standard CGI coding stuff.
The routine update_data() is important because we need to save new information, such as posts and threads, back into the data file. It first generates a datestamp in the DOM object, then uses the object's
toString() method to convert to text, which is written to the data file. (Note that the permissions on the file have to be set
so that the Web server can read from and write to it. I set the owner of
the file to be "www," which is what the Web servers runs as.)
I thought it would be useful to have a datestamp in the file. So just before saving to a file, I have the script create a comment and slip it into the root element just before the first child.
sub update_data {
#
# Write data back to file.
#
# add a datestamp just after root element start tag
my $date = localtime;
my $root = $doc->getDocumentElement;
# remove previous datestamp
foreach my $c ( $root->findnodes( 'comment()' )) {
$c->getParentNode->removeChild( $c );
}
my $comment = $doc->createComment( " updated: $date " );
$root->insertBefore( $comment, $root->getFirstChild );
# output document object as string into our data file
my $text = $doc->toString;
if( open( F, ">$datafile" )) {
$text =~ s/></>\n</g;
$text =~ s/\n\s+/\n/g;
print F $text;
close F;
} else {
gen_page_error( "Could not update data file." );
}
}
Note the different ways in which you can locate things in an XML
document. getDocumentElement() and getFirstChild() are DOM methods for obtaining the root element. findnodes() uses an XPath statement to gather references to elements, comments, text, and other things. It's great to have this kind of flexibility. You'll see more examples throughout the script.
Here's the first of the delegate routines, in charge of adding a new
thread to the forum. It creates new elements and text "nodes" with
calls to $doc's createElement() and other methods. Only the document
object can create new parts. It also adds a unique ID attribute so that we can identify it later. This unique ID is generated by finding the highest thread ID and incrementing it.
sub add_thread {
#
# Add a new thread.
#
my $title = shift; # title of the thread from query
gen_page_error( "Need a valid title." ) unless( $title );
# create new thread element and sub elements
my $newthread = $doc->createElement( 'thread' );
my $newtitle = $doc->createElement( 'title' );
my $newtitletext = $doc->createTextNode( $title );
# put the elements where they need to go
$doc->getDocumentElement->appendChild( $newthread );
$newthread->appendChild( $newtitle );
$newtitle->appendChild( $newtitletext );
# give thread a unique ID
my $tid = highest_id_thread() +1;
$newthread->setAttribute( 'num', $tid );
return $tid;
}
add_post() works pretty much the same way.
Recall that we have to find the highest ID among threads to generate a unique ID. Here's the routine that does it. highest_id_thread() uses the findnodes() method of the DOM document object, which takes an XPath expression and returns a list of nodes that match. XPath expressions use a special syntax, similar to filesystem paths, to locate any part of a document. In this case, we're looking for the attribute num, which appears only in threads.
sub highest_id_thread {
#
# Return the value of the highest thread ID.
#
my @ids = ();
foreach my $id( $doc->findnodes( '//thread/@num' )) {
my $idval = $id->findvalue( '.' );
$ids[ $idval ] = 1;
}
return $#ids;
}
highest_id_post() works like this also, but looks for id attributes instead of num attributes.
Now for a routine that generates some HTML. gen_page_toc() (TOC = table of contents) calls gen_page_start() and gen_page_end() to create the top and bottom of an HTML page and uses the rest of its time to generate the middle. It calls gen_entry_thread() for each thread in the data file, which will output the title as a link. After the table is output, the routine creates a Web form used for adding new threads.
sub gen_page_toc {
#
# Generate a table of contents for threads.
#
# output beginning of page
gen_page_start();
print <<END;
<p;div class='ss'>
<h1>All the threads</h1>
END
# generate an entry for each thread
foreach my $thread ( $doc->findnodes( '//thread' )) {
gen_entry_thread( $thread );
}
# create a form for adding a new thread
print <<END;
</div>
<div class="s">
<h3>Add a thread</h3>
<form action="/cgi-bin/skimpyforum" method="POST">
<input name="action" type="hidden" value="start" />
Thread title: <input name="title" size="64" value="Excitin' Stuff"/>
<input type="submit" value="Create"/>
</form>
</form>
END
gen_page_end();
}
And here's the routine gen_entry_thread():
sub gen_entry_thread {
#
# Output a piece of HTML representing an entry in the thread TOC.
#
my $thread = shift; # thread element ref
my $title = $thread->findvalue( 'title' );
my $num = $thread->findvalue( '@num' );
print <<END;
<h3><a href="/cgi-bin/skimpyforum?action=show&thread=$num">$title</a></h3>
END
}
Generating a page for a thread is very similar to
this. gen_page_thread() is like gen_page_toc, except that it calls gen_entry_post() to output a representation of each post in the thread. We won't bother listing those routines here.
gen_page_quote(), gen_page_start(), gen_page_end(), and
gen_page_error() are all pretty straightforward, so we'll leave you to
read them in the code listing.
Here is a screen shot of the forum's top page showing three threads:
And here is a screen shot showing a sample thread view:
An unintended consequence of this program is that HTML tags in post content will be formatted in the thread display. You can make text bold simply by using <b> tags, and it will get passed along to your Web browser along with the generated HTML. Take a look at this monstrosity my friend created:
This is because I forgot to filter out ampersands and angle brackets in post text. They should be converted into entities (& and <) or some other characters. At first, I didn't really think it was a problem. After all, it's cool to be able to make text bold or italic or whatever without adding any code to the forum program. But this same loophole led to the first crash of my forum. Badly-formed HTML in a post content caused the whole XML file to become badly-formed XML! Throw in something like "<foo>" with no corresponding end tag and the XML parser conks out. I quickly fixed this bug, but I wanted to let you know, so you can see that quick projects like this often are pretty fragile.
My weekend project was a success. My friends were impressed. Then the feedback started rolling in. "I want user authentication." "How about personalized stylesheets?" "Can we edit our own posts?" Oh man. I've created a monster. And did I learn my lesson? Nope. I promised them I'd have user authentication with personalized configurations done by next week.
Erik T. Ray is a software wrangler and XML guru for O'Reilly Media.
Return to the ONLamp.com.
Copyright © 2009 O'Reilly Media, Inc.