Sign In/My Account | View Cart  

advertisement

AddThis Social Bookmark Button

Skimpy Forum: An Application of Perl and XML
Pages: 1, 2

With the planning complete, I rolled up my sleeves for coding. I needed a script to handle CGI queries and update the data file. Perl is what I know best and it happens to be good at generating HTML, so that was my language of choice. In fact, I use Perl and XML together all the time.



First up, I needed to choose a parser to turn the XML into an object tree for manipulation by the program. There are many parsers in the Perl module pantheon, but there is one that I like the most -- XML::LibXML. Written by Matt Sergeant and extended and maintained by Christian Glahn, it's an interface to the C library libxml2 (a project under the auspices of the GNOME group). I have been using it for a few months now and I have to admit it is an amazingly useful module. Here are a few of its features:

  • It's an XML parser that checks how well-formed the XML is, and can parse DTDs to report validity errors.
  • It implements DOM level 2, giving you a tree representation of a document with a standard intereface to traverse and manipulate elements.
  • It adds XPath support for very easy access to data located deep inside a document.
  • It implemenets SAX level 2, providing an event stream interface for fast data-driven processing.

XML::LibXML is also fast. It is, after all, an interface to a parser written in C. In recent months, it has become very stable too. So I like to think of XML::LibXML as my handy swiss army knife of XML processing.

The script, called skimpyforum, is listed here in example 2.

Now, let's walk through the code. The script starts by creating a parser object and using it to read the forum's data file. Then we call handle_query() to get the process rolling. The global variable $doc is a reference to an object tree created by the XML::LibXML parser. It uses a standard Document Object Model (DOM) interface for manipulating pieces, plus the XPath interface to find information inside of the document. Specifically, the object is an instance of XML::LibXML::Document, which has a reference to the root element of the data file object, which contains all of the other parts of the document.

# Read data file and store in DOM object whose ref is $doc
my $datafile = "/Library/WebServer/Documents/mun/forum.data";
my $parser = new XML::LibXML;
my $doc = $parser->parse_file( $datafile );

# munch on the CGI query and format results in an HTML page
handle_query();
And here's handle_query(). It delegates tasks based on the value of the CGI parameter action.
sub handle_query {
#
# Decide what to do based on the action parameter
# from the query string we received.
#
    my %params = get_params();

    # show a thread
    if( $params{action} eq 'show' ) {
        gen_page_thread( $params{thread} );

    # start a new thread
    } elsif( $params{action} eq 'start' ) {
        my $tid = add_thread( $paramsSkimpy Forum: An Application of Perl and XML );
        update_data();
        gen_page_thread( $tid );

    # post a message
    } elsif( $params{action} eq 'post' ) {
        add_post( $params{name}, $params{content}, $params{thread} );
        update_data();
        gen_page_thread( $params{thread} );

    # quote a post
    } elsif( $params{action} eq 'quote' ) {
        gen_page_quote( $params{thread}, $params{id} );

    # no action specified; show list of threads
    } else {
        gen_page_toc();
    }
}

The subroutines get_params() and unescape() are in the code listing. Standard CGI coding stuff.

The routine update_data() is important because we need to save new information, such as posts and threads, back into the data file. It first generates a datestamp in the DOM object, then uses the object's toString() method to convert to text, which is written to the data file. (Note that the permissions on the file have to be set so that the Web server can read from and write to it. I set the owner of the file to be "www," which is what the Web servers runs as.)

I thought it would be useful to have a datestamp in the file. So just before saving to a file, I have the script create a comment and slip it into the root element just before the first child.

sub update_data {
#
# Write data back to file.
#
    # add a datestamp just after root element start tag
    my $date = localtime;
    my $root = $doc->getDocumentElement;
    # remove previous datestamp
    foreach my $c ( $root->findnodes( 'comment()' )) {
        $c->getParentNode->removeChild( $c );
    }
    my $comment = $doc->createComment( " updated: $date " );
    $root->insertBefore( $comment, $root->getFirstChild );

    # output document object as string into our data file
    my $text = $doc->toString;
    if( open( F, ">$datafile" )) {
        $text =~ s/></>\n</g;
        $text =~ s/\n\s+/\n/g;
        print F $text;
        close F;
    } else {
        gen_page_error( "Could not update data file." );
    }
}

Note the different ways in which you can locate things in an XML document. getDocumentElement() and getFirstChild() are DOM methods for obtaining the root element. findnodes() uses an XPath statement to gather references to elements, comments, text, and other things. It's great to have this kind of flexibility. You'll see more examples throughout the script.

Here's the first of the delegate routines, in charge of adding a new thread to the forum. It creates new elements and text "nodes" with calls to $doc's createElement() and other methods. Only the document object can create new parts. It also adds a unique ID attribute so that we can identify it later. This unique ID is generated by finding the highest thread ID and incrementing it.

sub add_thread {
#
# Add a new thread.
#
    my $title = shift;    # title of the thread from query
    gen_page_error( "Need a valid title." ) unless( $title );

    # create new thread element and sub elements
    my $newthread = $doc->createElement( 'thread' );
    my $newtitle = $doc->createElement( 'title' );
    my $newtitletext = $doc->createTextNode( $title );

    # put the elements where they need to go
    $doc->getDocumentElement->appendChild( $newthread );
    $newthread->appendChild( $newtitle );
    $newtitle->appendChild( $newtitletext );

    # give thread a unique ID
    my $tid = highest_id_thread() +1;
    $newthread->setAttribute( 'num', $tid );
    return $tid;
}

add_post() works pretty much the same way.

Recall that we have to find the highest ID among threads to generate a unique ID. Here's the routine that does it. highest_id_thread() uses the findnodes() method of the DOM document object, which takes an XPath expression and returns a list of nodes that match. XPath expressions use a special syntax, similar to filesystem paths, to locate any part of a document. In this case, we're looking for the attribute num, which appears only in threads.

sub highest_id_thread {
#
# Return the value of the highest thread ID.
#
    my @ids = ();
    foreach my $id( $doc->findnodes( '//thread/@num' )) {
        my $idval = $id->findvalue( '.' );
        $ids[ $idval ] = 1;
    }
    return $#ids;
}

highest_id_post() works like this also, but looks for id attributes instead of num attributes.

Now for a routine that generates some HTML. gen_page_toc() (TOC = table of contents) calls gen_page_start() and gen_page_end() to create the top and bottom of an HTML page and uses the rest of its time to generate the middle. It calls gen_entry_thread() for each thread in the data file, which will output the title as a link. After the table is output, the routine creates a Web form used for adding new threads.

sub gen_page_toc {
#
# Generate a table of contents for threads.
#

    # output beginning of page
    gen_page_start();
    print <<END;
<p;div class='ss'>
<h1>All the threads</h1>
END

    # generate an entry for each thread
    foreach my $thread ( $doc->findnodes( '//thread' )) {
        gen_entry_thread( $thread );
    }

    # create a form for adding a new thread
    print <<END;
</div>
<div class="s">
<h3>Add a thread</h3>
<form action="/cgi-bin/skimpyforum" method="POST">
<input name="action" type="hidden" value="start" />
Thread title: <input name="title" size="64" value="Excitin' Stuff"/>
<input type="submit" value="Create"/>
</form>
</form>
END
    gen_page_end();
}

And here's the routine gen_entry_thread():

sub gen_entry_thread {
#
# Output a piece of HTML representing an entry in the thread TOC.
#
   my $thread = shift;    # thread element ref
   my $title = $thread->findvalue( 'title' );
   my $num = $thread->findvalue( '@num' );
   print <<END;
<h3><a href="/cgi-bin/skimpyforum?action=show&thread=$num">$title</a></h3>
END
}

Generating a page for a thread is very similar to this. gen_page_thread() is like gen_page_toc, except that it calls gen_entry_post() to output a representation of each post in the thread. We won't bother listing those routines here.

gen_page_quote(), gen_page_start(), gen_page_end(), and gen_page_error() are all pretty straightforward, so we'll leave you to read them in the code listing.

Here is a screen shot of the forum's top page showing three threads:

Screen shot.

And here is a screen shot showing a sample thread view:

Screen shot.

An unintended consequence of this program is that HTML tags in post content will be formatted in the thread display. You can make text bold simply by using <b> tags, and it will get passed along to your Web browser along with the generated HTML. Take a look at this monstrosity my friend created:

Screen shot.

This is because I forgot to filter out ampersands and angle brackets in post text. They should be converted into entities (&amp; and &lt;) or some other characters. At first, I didn't really think it was a problem. After all, it's cool to be able to make text bold or italic or whatever without adding any code to the forum program. But this same loophole led to the first crash of my forum. Badly-formed HTML in a post content caused the whole XML file to become badly-formed XML! Throw in something like "<foo>" with no corresponding end tag and the XML parser conks out. I quickly fixed this bug, but I wanted to let you know, so you can see that quick projects like this often are pretty fragile.

My weekend project was a success. My friends were impressed. Then the feedback started rolling in. "I want user authentication." "How about personalized stylesheets?" "Can we edit our own posts?" Oh man. I've created a monster. And did I learn my lesson? Nope. I promised them I'd have user authentication with personalized configurations done by next week.

Erik T. Ray is a software wrangler and XML guru for O'Reilly Media.


Return to the ONLamp.com.




-->