PHP DevCenter

oreilly.comSafari Books Online.Conferences.

We've expanded our LAMP news coverage and improved our search! Search for all things LAMP across O'Reilly!

Search
Search Tips

advertisement

Listen Print Discuss Subscribe to PHP Subscribe to Newsletters

Building a Simple Search Engine with PHP
Pages: 1, 2, 3

Populating The Database



Okay, the database is created and we're ready to feed it some content. For this, we'll create a PHP script that takes a user-specified URL, reads the document representing the URL, extracts words from the page, and creates records in the database. Here's the program:

<?
/*
* populate.php
*
* Script for populating the search database with words,
* pages and word-occurences.
*/

/* Connect to the database: */
mysql_pconnect("localhost","root","secret")
    or die("ERROR: Could not connect to database!");

mysql_select_db("test");

/* Define the URL that should be processed: */

$url = addslashes( $_GET['url'] );

if( !$url )
{
   die( "You need to define a URL to process." );
}
else if( substr($url,0,7) != "http://" )
{
   $url = "http://$url";
}

/* Does this URL already have a record in the page-table? */
$result = mysql_query("SELECT page_id FROM page WHERE page_url = \"$url\"");
$row = mysql_fetch_array($result);

if( $row['page_id'] )
{
   /* If yes, use the old page_id: */
   $page_id = $row['page_id'];
}
else
{
   /* If not, create one: */
   mysql_query("INSERT INTO page (page_url) VALUES (\"$url\")");
   $page_id = mysql_insert_id();
}

/* Start parsing through the text, and build an index in the database: */
if( !($fd = fopen($url,"r")) )
   die( "Could not open URL!" );

while( $buf = fgets($fd,1024) )
{
   /* Remove whitespace from beginning and end of string: */
   $buf = trim($buf);

   /* Try to remove all HTML-tags: */
   $buf = strip_tags($buf);
   $buf = ereg_replace('/&\w;/', '', $buf);

   /* Extract all words matching the regexp from the current line: */
   preg_match_all("/(\b[\w+]+\b)/",$buf,$words);

   /* Loop through all words/occurrences and insert them into the database: */
   for( $i = 0; $words[$i]; $i++ )
   {
      for( $j = 0; $words[$i][$j]; $j++ )
      {
         /* Does the current word already have a record in the word-table? */
         $cur_word = addslashes( strtolower($words[$i][$j]) );

         $result = mysql_query("SELECT word_id FROM word 
                                WHERE word_word = '$cur_word'");
         $row = mysql_fetch_array($result);
         if( $row['word_id'] )
         {
            /* If yes, use the old word_id: */
            $word_id = $row['word_id'];
         }
         else
         {
            /* If not, create one: */
            mysql_query("INSERT INTO word (word_word) VALUES (\"$cur_word\")");
            $word_id = mysql_insert_id();
         }

         /* And finally, register the occurrence of the word: */
         mysql_query("INSERT INTO occurrence (word_id,page_id) 
                      VALUES ($word_id,$page_id)");
         print "Indexing: $cur_word<br>";
      }
   }
}

fclose($fd);

?>

Note the use of addslashes() when assigning values to $url and $cur_word. Since these variables are used within SQL queries, it's very important to make sure they're free from harmful data. Imagine if someone called this script with a URL of "; DROP TABLE FOO;. As addslashes() quotes database special characters, someone of malicious bent would have to go to much more work to exploit the database.

This script connects to the database, registering the page in the database if it does not exist, and starts to retrieve data. The preg_match_all() call extracts all words (matching the regular expression) from the page. Each word must be recorded in the occurrence and word table.

When building the index, only three SQL INSERT statements actually matter. When a page is first indexed, it must be recorded:

INSERT INTO page (page_url) VALUES ("http://www.onlamp.com/");

The first occurrence of a word within the entire dataset must be recorded:

INSERT INTO word (word_word) VALUES ("linux");

Each occurrence of a word within a page must be recorded:

INSERT INTO occurrence (word_id,page_id) VALUES ($word_id,$page_id);

Let's now index a few pages. The seven sites that makes up the O'Reilly Network is probably a good idea. We will call populate.php with the site URLs as the only argument, one at a time:

http://localhost/populate.php?url=http://www.macdevcenter.com/
http://localhost/populate.php?url=http://www.onjava.com/
http://localhost/populate.php?url=http://www.onlamp.com/
http://localhost/populate.php?url=http://www.openp2p.com/
http://localhost/populate.php?url=http://www.osdir.com/
http://localhost/populate.php?url=http://www.perl.com/
http://localhost/populate.php?url=http://www.xml.com/

A quick investigation of the tables now should result in something like this:


mysql> SELECT * FROM page;
+---------+------------------------------+
| page_id | page_url                     |
+---------+------------------------------+
| 1       | http://www.macdevcenter.com/ |
| 2       | http://www.onjava.com/       |
| 3       | http://www.onlamp.com/       |
| 4       | http://www.openp2p.com/      |
| 5       | http://www.osdir.com/        |
| 6       | http://www.perl.com/         |
| 7       | http://www.xml.com/          |
+---------+------------------------------+
7 rows in set (0.00 sec)

mysql> SELECT COUNT(*) FROM word;
+----------+
| COUNT(*) |
+----------+
| 2423     |
+----------+
1 row in set (0.00 sec)

mysql> SELECT COUNT(*) FROM occurrence;
+----------+
| COUNT(*) |
+----------+
| 20628    |
+----------+
1 row in set (0.00 sec)

The database now has seven indexed pages, resulting in 2,423 unique words and 20,628 occurrences. We're ready for the next level, the search interface.

Pages: 1, 2, 3

Next Pagearrow




Tagged Articles

Post to del.icio.us

This article has been tagged:

php

Articles that share the tag php:

Understanding MVC in PHP (477 tags)

The PHP Scalability Myth (123 tags)

The Dynamic Duo of PEAR::DB and Smarty (53 tags)

PHP Form Handling (43 tags)

Very Dynamic Web Interfaces (39 tags)

View All

search

Articles that share the tag search:

MySQL FULLTEXT Searching (93 tags)

Find What You Want with Plucene (22 tags)

Building a Vector Space Search Engine in Perl (18 tags)

Google Your Desktop (14 tags)

Dreaming of an Atom Store: A Database for the Web (14 tags)

View All

mysql

Articles that share the tag mysql:

MySQL FULLTEXT Searching (155 tags)

Live Backups of MySQL Using Replication (152 tags)

Advanced MySQL Replication Techniques (125 tags)

Ten MySQL Best Practices (59 tags)

Rolling with Ruby on Rails (56 tags)

View All

Sponsored Resources

  • Inside Lightroom
Advertisement

Sponsored by:

O'Reilly Media

©2009, O'Reilly Media, Inc.
(707) 827-7000 / (800) 998-9938
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
About O'Reilly
Academic Solutions
Authors
Contacts
Customer Service
Jobs
Newsletters
O'Reilly Labs
Press Room
Privacy Policy
RSS Feeds
Terms of Service
User Groups
Writing for O'Reilly
Content Archive
Business Technology
Computer Technology
Google
Microsoft
Mobile
Network
Operating System
Digital Photography
Programming
Software
Web
Web Design
More O'Reilly Sites
O'Reilly Radar
Ignite
Tools of Change for Publishing
Digital Media
Inside iPhone
O'Reilly FYI
makezine.com
craftzine.com
hackszine.com
perl.com
xml.com

Partner Sites
InsideRIA
java.net
O'Reilly Insights on Forbes.com