O'Reilly Databases

oreilly.comSafari Books Online.Conferences.

We've expanded our coverage and improved our search! Search for all things Database across O'Reilly!

Search Search Tips

advertisement
AddThis Social Bookmark Button

Listen Print Discuss Subscribe to Databases Subscribe to Newsletters

Data Mining Email
Pages: 1, 2

Part 3. Developing the Email Parser

Install the following modules from the CPAN, along with their dependencies:
  • Mail::MboxParser, a simple interface that provides read-only access to Unix-mailboxes
  • DBD::Pg, the database interface between Perl and the PostgreSQL database.



The mbox parser script first defines some global variables and then establishes a database connection with the PostgreSQL server:

my $dbh = DBI->connect($dsn,$user,$pass);
die $DBI::errstr unless defined $dbh;
$dbh->{PrintError} = 0;

The script then reads the mailbox and breaks down the component parts of each individual message:

my $mb = Mail::MboxParser->new(\@myarray,
	decode     => 'ALL',
	parseropts => $parseropts);

A while loop processes messages from the mailbox and identifies the interesting parts of the message:

my $id      = $msg->id;
my $from    = $msg->from->{email};
my $subject = $msg->header->{subject};

The MboxParser module identifies the main body of the message:

my $body = $msg->body($msg->find_body);

The main body and attachments of a message are referenced in the main and attachment tables with an oid. Prior to version 7.1, PostgreSQL cannot insert data into a row that exceeds the default data page size of 8192 bytes. To get around this, we'll instead insert this information as large objects. Upload a large object by using psql:

my $myoid=`psql -U postgres --command "\\lo_import \'$body_file\' \'$mymailbox\' " $db`;
my @oidarray = split(/ /,$myoid);

This method requires less coding and is simpler to understand than is using libpq's more efficient large object call.

We are now able to insert the email message into the main table:

$sql = "INSERT INTO main VALUES( ?, ?, ?, @oidarray[1])";
$sth = $dbh->prepare($sql) or die $DBI::errstr;
$sth->execute( $id, $from, $subject ) or die $DBI::errstr;

This snippet of code generates the list of email recipients:

#POPULATE TABLE 'mailto'

for my $msg_TO ($msg->to) {
	$sql = "INSERT INTO mailto VALUES( ?, ? )";
	$sth = $dbh->prepare($sql) or die $DBI::errstr;
	$sth->execute( $id, $msg_TO->{email} ) or die $DBI::errstr;
}

The most interesting part of the program identifies MS Word attachments:

# DETERMINE ATTACHMENT MIME-TYPES

my $decodedattachment = `echo "$attachment" | ./decode.pl |
	tee $attachment_file | file -i -`;
my @test0     = split(/ /, $decodedattachment);
my $mime_type = $test0[1];
chomp $mime_type;

# VALIDATING MSWORD DOCS

if ($mime_type eq "application/MS Word") {

	# CONVERT INTO READABLE TEXT
	`antiword $attachment_file > $MS Word_file;
	mv -f $MS Word_file $attachment_file`;

} else {

	# NOT MSWORD DOCS,
	# USE THE ENCODED MIME VERSION IN THE FILE
	`echo "$attachment" > $attachment_file`;
}

decode.pl decodes the MIME Base64 message back into its original binary encoding. There is also a PostgreSQL function for this. The program then pipes the encoded output into the file utility to determine if it really is a MS Word document. At the same time, it saves the decoded attachment as the temporary file /tmp/attachment.

Any attachment that is identified as a MS Word document with a MIME type of application/MS Word runs through the antiword utility to turn it into an ASCII text document. The program then uploads this as a large object.

All other attachments are saved in the database as large objects in their original MIME encodings.

Part 4. Querying the Database

Now that the information is in the database, the next trick is to retrieve it in a way that makes sense. I've created the following queries, based on my own database that I used to develop this article.

First, let's summarize all the email messages that have both a main body and an attachment. (Refer to the table creation create tables script for table details).

SELECT m.from_email AS From,
       m.subject    AS Subject,
       m.mailbody   AS Message,
       a.attachment AS Attachment
FROM   main       m,
       attachment a
WHERE  m.messageid=a.message

This will produce output similar to:

From Subject Message Attachment
robert.bernier5@sympatico.ca message 1 60227 60230
robert.bernier5@sympatico.ca message 2 60233 60236
casestudy@postgresql.org message 3 60239 60242
casestudy@postgresql.org message 4 60245 60248

The Message and Attachments are oids, or reference numbers, to the large objects now stored in the pg_largeobject catalog.

Now, let's list all messages with attachments and identify those attachments that were MS Word documents:

SELECT m.messageid  AS "Message ID",
       a.attachment AS "Attachment oid",
	CASE
		WHEN a.mime_type='application/MS Word'
		THEN 'true'
 		ELSE 'false'
	END
                    AS "MS Word attachment"
FROM   main m, attachment a
WHERE  m.messageid=a.messageid;

This will produce output similar to:

Message ID Attachment oid MS Word attachment
3FF79F08.6060208@sympatico.ca 60230 true
3FF79F29.2090101@sympatico.ca 60236 false
3FF7A511.2030104@postgresql.org 60242 false
3FF7A52A.8040203@postgresql.org 60248 true

You can return the ASCII text conversion of any MS Word attachment from a query similar to:

SELECT encode(lo.data,'escape') AS "My Document"
FROM   pg_largeobject lo,
       attachme1t     a
WHERE  a.messageid  = '3FF7A52A.8040203@postgresql.org'
AND    a.attachment = lo.loid;

The encode function casts My Document from a bytea to a text type by removing all escape sequences that the lo_import function originally inserted.

Searching for Patterns

To fetch results with more detail, you'll have to search specific columns for patterns. There are two ways to search text, with the SQL LIKE expression or with PostgreSQL's POSIX-style regular expressions.

Using LIKE

This query will search for the phrase "These functions read and write" in a particular Word document.


SELECT m.from_email  AS "From",
       a.description AS "File name",
	CASE
		WHEN encode(lo.data,'escape') LIKE '%These functions read and write%'
		THEN 'true'
  		ELSE 'false'
	END AS "Search results"
FROM pg_largeobject lo,
     attachment      a,
     main            m
WHERE a.messageid  = '3FF7A52A.8040203@postgresql.org'
AND   a.attachment = lo.loid
AND   a.messageid  = m.messageid;

Here are the results from my sample tables:

From File name Search results
casestudy@postgresql.org doc2.doc true

Regular Expressions

This query will count the number of emails with attachments and whose from address begins with the letters ca and end in org.

SELECT count(m.from_email)
FROM   main       m,
       attachment a
WHERE  m.messageid  = a.messageid
AND    m.from_email ~ '^ca'
AND    m.from_email ~ 'org$';

Caveats and Suggestions

In the interests of keeping the Perl script as simple as possible, I've assumed that all email attachments arrive encoded in Base64. As well, since I wrote the script to follow the flow of this article, the code is a bit clumsy.

The database definitions here only demonstrate where you can make a start. Making a database that flies like a jet is an article in itself.

Save yourself time and grief by saving SQL queries and definitions in a file and then invoking it through the psql client. If you're paranoid, you can put these files under version control

Invoke the create tables script to clean out the database as you rerun the Perl scripts.

Antiword is a bit quirky in that it will not parse files that it considers too small.

Viewing the Mozilla mbox through its client can be deceiving. It never really removes email from the box even if you've deleted it, so you may well see email in your database that you thought you destroyed months before.

You can export large objects from the database by using the lo_export function in the psql client. For example \lo_export 123 temp.txt will save the large object with the oid number of 132 to a file named temp.txt.

Conclusion

There are two parts to a successful data mine; the mechanism that prepares the document and the algorithm that looks for patterns. We've looked at one way getting email into a database but frankly, I haven't given that much coverage on the power of regular expressions in a database environment, although some of you Perl fiends have a pretty good idea of the possibilities.

References

  • Perl
  • CPAN
  • PostgreSQL
  • Antiword
  • Create table script
  • Mail parsing script
  • MIME decoding script

Robert Bernier is the PostgreSQL business intelligence analyst for SRA America, a subsidiary of Software Research America (SRA).


Return to ONLamp.com.


Have a question about Robert's technique or a suggestion for further improvements? Let us know here.
You must be logged in to the O'Reilly Network to post a talkback.
Post Comment
Full Threads Oldest First

Showing messages 1 through 4 of 4.

  • E-mail Vault for Mozilla
    2004-04-11 07:11:47  kevinneely [Reply | View]

    It looks like a group of people are attempting to implement this in Mozilla.
    E-mail Vault for Mozilla is supposed to store e-mail in a database for easier cateegorizing and searching. Don't know if it'll search through .doc files for ya yet.
  • 8K Limit, OpenFTS
    2004-04-10 07:50:33  agliodbs [Reply | View]

    Robert,


    Two things for your readers:


  • The 8K limit on text fields has been fixed for 3 years, so with any reasonably current version of PostgreSQL it's no longer necessary to use large objects for the message body or translated Word doc.

  • Rather than using Regexes, the current waw to so this would be to use OpenFTS (openfts.sourceforge.net) to do ranked word searches on the message subject, body, and attachment.


  • All in all, thanks for the article and I look forward to tinkering with the tools you mention for my own personal store of 30,000 messages!


    -Josh Berkus

  • Inverted Index
    2004-04-08 23:27:27  Joe [Reply | View]

    Hey Robert,

    The idea is great - but you don't take it far enough in the article. You should think about how to move further into that world with more search concepts (ala the Inverted Index) like Zoë, or digging into some of the more esoteric "search" worlds like Peer-to-Peer Semantic Search Engines: Building a Memex by Maciej


Tagged Articles

Post to del.icio.us

This article has been tagged:

email

Articles that share the tag email:

Using Qpsmtpd (20 tags)

Data Mining Email (19 tags)

Tweaking Tiger Mail (14 tags)

Email Protocols: Where Do We Go Now? (14 tags)

The Fight Against Spam, Part 2 (12 tags)

View All

perl

Articles that share the tag perl:

Programming is Hard, Let's Go Scripting... (177 tags)

Using Ajax from Perl (101 tags)

Ten Essential Development Practices (97 tags)

Everyday Perl 6 (92 tags)

Catalyst (91 tags)

View All

datamining

Articles that share the tag datamining:

Data Mining Email (10 tags)

Massive Data Aggregation with Perl (9 tags)

Top Ten Data Crunching Tips and Tricks (8 tags)

Calculating Entropy for Data Mining (5 tags)

Calculating Entropy for Data Miners (3 tags)

View All

programming

Articles that share the tag programming:

Rolling with Ruby on Rails (1374 tags)

Very Dynamic Web Interfaces (279 tags)

Ajax on Rails (231 tags)

Understanding MVC in PHP (202 tags)

A Simpler Ajax Path (186 tags)

View All

database

Articles that share the tag database:

MySQL FULLTEXT Searching (54 tags)

Live Backups of MySQL Using Replication (53 tags)

Advanced MySQL Replication Techniques (53 tags)

Dreaming of an Atom Store: A Database for the Web (49 tags)

How to Misuse SQL's FROM Clause (38 tags)

View All

Sponsored Resources

  • Inside Lightroom

Related to this Article

Understanding Oracle Clinical Understanding Oracle Clinical
by Joan M. Johnson
May 2007
$9.99 USD

Inside SQLite Inside SQLite
by Sibsankar Haldar
April 2007
$9.99 USD

Advertisement
O'Reilly Media

©2009, O'Reilly Media, Inc.
(707) 827-7000 / (800) 998-9938
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
About O'Reilly
Academic Solutions
Authors
Contacts
Customer Service
Jobs
Newsletters
O'Reilly Labs
Press Room
Privacy Policy
RSS Feeds
Terms of Service
User Groups
Writing for O'Reilly
Content Archive
Business Technology
Computer Technology
Google
Microsoft
Mobile
Network
Operating System
Digital Photography
Programming
Software
Web
Web Design
More O'Reilly Sites
O'Reilly Radar
Ignite
Tools of Change for Publishing
Digital Media
Inside iPhone
O'Reilly FYI
makezine.com
craftzine.com
hackszine.com
perl.com
xml.com

Partner Sites
InsideRIA
java.net
O'Reilly Insights on Forbes.com