Data Mining Email
Pages: 1, 2
Part 3. Developing the Email Parser
Install the following modules from the CPAN, along with their dependencies:- Mail::MboxParser, a simple interface that provides read-only access to Unix-mailboxes
- DBD::Pg, the database interface between Perl and the PostgreSQL database.
The mbox parser script first defines some global variables and then establishes a database connection with the PostgreSQL server:
my $dbh = DBI->connect($dsn,$user,$pass);
die $DBI::errstr unless defined $dbh;
$dbh->{PrintError} = 0;
The script then reads the mailbox and breaks down the component parts of each individual message:
my $mb = Mail::MboxParser->new(\@myarray,
decode => 'ALL',
parseropts => $parseropts);
A while loop processes messages from the mailbox and identifies the interesting parts of the message:
my $id = $msg->id;
my $from = $msg->from->{email};
my $subject = $msg->header->{subject};
The MboxParser module identifies the main body of the
message:
my $body = $msg->body($msg->find_body);
The main body and attachments of a message are referenced in the
main and attachment tables with an oid. Prior to
version 7.1, PostgreSQL cannot insert data into a row that exceeds the default
data page size of 8192 bytes. To get around this, we'll instead insert this
information as large objects. Upload a large object by using
psql:
my $myoid=`psql -U postgres --command "\\lo_import \'$body_file\' \'$mymailbox\' " $db`;
my @oidarray = split(/ /,$myoid);
This method requires less coding and is simpler to understand than is using
libpq's more efficient large object call.
We are now able to insert the email message into the main
table:
$sql = "INSERT INTO main VALUES( ?, ?, ?, @oidarray[1])";
$sth = $dbh->prepare($sql) or die $DBI::errstr;
$sth->execute( $id, $from, $subject ) or die $DBI::errstr;
This snippet of code generates the list of email recipients:
#POPULATE TABLE 'mailto'
for my $msg_TO ($msg->to) {
$sql = "INSERT INTO mailto VALUES( ?, ? )";
$sth = $dbh->prepare($sql) or die $DBI::errstr;
$sth->execute( $id, $msg_TO->{email} ) or die $DBI::errstr;
}
The most interesting part of the program identifies MS Word attachments:
# DETERMINE ATTACHMENT MIME-TYPES
my $decodedattachment = `echo "$attachment" | ./decode.pl |
tee $attachment_file | file -i -`;
my @test0 = split(/ /, $decodedattachment);
my $mime_type = $test0[1];
chomp $mime_type;
# VALIDATING MSWORD DOCS
if ($mime_type eq "application/MS Word") {
# CONVERT INTO READABLE TEXT
`antiword $attachment_file > $MS Word_file;
mv -f $MS Word_file $attachment_file`;
} else {
# NOT MSWORD DOCS,
# USE THE ENCODED MIME VERSION IN THE FILE
`echo "$attachment" > $attachment_file`;
}
decode.pl decodes the MIME Base64 message back into its
original binary encoding. There is also a PostgreSQL function for this. The
program then pipes the encoded output into the file utility to
determine if it really is a MS Word document. At the same time, it saves the
decoded attachment as the temporary file /tmp/attachment.
Any attachment that is identified as a MS Word document with a MIME type of
application/MS Word runs through the antiword utility
to turn it into an ASCII text document. The program then uploads this as a
large object.
All other attachments are saved in the database as large objects in their original MIME encodings.
Part 4. Querying the Database
Now that the information is in the database, the next trick is to retrieve it in a way that makes sense. I've created the following queries, based on my own database that I used to develop this article.
First, let's summarize all the email messages that have both a main body and an attachment. (Refer to the table creation create tables script for table details).
SELECT m.from_email AS From,
m.subject AS Subject,
m.mailbody AS Message,
a.attachment AS Attachment
FROM main m,
attachment a
WHERE m.messageid=a.message
This will produce output similar to:
| From | Subject | Message | Attachment |
|---|---|---|---|
robert.bernier5@sympatico.ca |
message 1 |
60227 |
60230 |
robert.bernier5@sympatico.ca |
message 2 |
60233 |
60236 |
casestudy@postgresql.org |
message 3 |
60239 |
60242 |
casestudy@postgresql.org |
message 4 |
60245 |
60248 |
The Message and Attachments are oids, or reference numbers, to the large
objects now stored in the pg_largeobject catalog.
Now, let's list all messages with attachments and identify those attachments that were MS Word documents:
SELECT m.messageid AS "Message ID",
a.attachment AS "Attachment oid",
CASE
WHEN a.mime_type='application/MS Word'
THEN 'true'
ELSE 'false'
END
AS "MS Word attachment"
FROM main m, attachment a
WHERE m.messageid=a.messageid;
This will produce output similar to:
| Message ID | Attachment oid | MS Word attachment |
|---|---|---|
3FF79F08.6060208@sympatico.ca |
60230 |
true |
3FF79F29.2090101@sympatico.ca |
60236 |
false |
3FF7A511.2030104@postgresql.org |
60242 |
false |
3FF7A52A.8040203@postgresql.org |
60248 |
true |
You can return the ASCII text conversion of any MS Word attachment from a query similar to:
SELECT encode(lo.data,'escape') AS "My Document"
FROM pg_largeobject lo,
attachme1t a
WHERE a.messageid = '3FF7A52A.8040203@postgresql.org'
AND a.attachment = lo.loid;
The encode function casts My Document from a
bytea to a text type by removing all escape sequences
that the lo_import function originally inserted.
Searching for Patterns
To fetch results with more detail, you'll have to search specific columns for patterns. There are two ways to search text, with the SQL LIKE expression or with PostgreSQL's POSIX-style regular expressions.
Using LIKE
This query will search for the phrase "These functions read and write" in a particular Word document.
SELECT m.from_email AS "From",
a.description AS "File name",
CASE
WHEN encode(lo.data,'escape') LIKE '%These functions read and write%'
THEN 'true'
ELSE 'false'
END AS "Search results"
FROM pg_largeobject lo,
attachment a,
main m
WHERE a.messageid = '3FF7A52A.8040203@postgresql.org'
AND a.attachment = lo.loid
AND a.messageid = m.messageid;
Here are the results from my sample tables:
| From | File name | Search results |
|---|---|---|
casestudy@postgresql.org |
doc2.doc |
true |
Regular Expressions
This query will count the number of emails with attachments and whose from
address begins with the letters ca and end in
org.
SELECT count(m.from_email)
FROM main m,
attachment a
WHERE m.messageid = a.messageid
AND m.from_email ~ '^ca'
AND m.from_email ~ 'org$';
Caveats and Suggestions
In the interests of keeping the Perl script as simple as possible, I've assumed that all email attachments arrive encoded in Base64. As well, since I wrote the script to follow the flow of this article, the code is a bit clumsy.
The database definitions here only demonstrate where you can make a start. Making a database that flies like a jet is an article in itself.
Save yourself time and grief by saving SQL queries and definitions in a file
and then invoking it through the psql client. If you're paranoid, you can put these files under version control
Invoke the create tables script to clean out the database as you rerun the Perl scripts.
Antiword is a bit quirky in that it will not parse files that
it considers too small.
Viewing the Mozilla mbox through its client can be deceiving. It never really removes email from the box even if you've deleted it, so you may well see email in your database that you thought you destroyed months before.
You can export large objects from the database by using the
lo_export function in the psql client. For example
\lo_export 123 temp.txt will save the large object with the oid
number of 132 to a file named temp.txt.
Conclusion
There are two parts to a successful data mine; the mechanism that prepares the document and the algorithm that looks for patterns. We've looked at one way getting email into a database but frankly, I haven't given that much coverage on the power of regular expressions in a database environment, although some of you Perl fiends have a pretty good idea of the possibilities.
References
Robert Bernier is the PostgreSQL business intelligence analyst for SRA America, a subsidiary of Software Research America (SRA).
Return to ONLamp.com.
You must be logged in to the O'Reilly Network to post a talkback.
Showing messages 1 through 4 of 4.
-
8K Limit, OpenFTS
2004-04-10 07:50:33 agliodbs [Reply | View]
Robert,
Two things for your readers:
- The 8K limit on text fields has been fixed for 3 years, so with any reasonably current version of PostgreSQL it's no longer necessary to use large objects for the message body or translated Word doc.
- Rather than using Regexes, the current waw to so this would be to use OpenFTS (openfts.sourceforge.net) to do ranked word searches on the message subject, body, and attachment.
All in all, thanks for the article and I look forward to tinkering with the tools you mention for my own personal store of 30,000 messages!
-Josh Berkus
-
Inverted Index
2004-04-08 23:27:27 Joe [Reply | View]
Hey Robert,
The idea is great - but you don't take it far enough in the article. You should think about how to move further into that world with more search concepts (ala the Inverted Index) like Zoë, or digging into some of the more esoteric "search" worlds like Peer-to-Peer Semantic Search Engines: Building a Memex by Maciej






E-mail Vault for Mozilla is supposed to store e-mail in a database for easier cateegorizing and searching. Don't know if it'll search through .doc files for ya yet.