PHP DevCenter

oreilly.comSafari Books Online.Conferences.

We've expanded our LAMP news coverage and improved our search! Search for all things LAMP across O'Reilly!

Search
Search Tips

advertisement

Listen Print Discuss Subscribe to PHP Subscribe to Newsletters

PHP Search Engine Showdown
Pages: 1, 2

PHP Engines

If you're going to install a local search engine and are using PHP, you have several great PHP engines to consider. We took the leaders in the field, summarized their features (Table 1), tested them all, and found:



iSearch has an excellent range of options for the needs of nearly any site, yet the core functions are encrypted and highly unchangeable. Also, in testing, the spider would trap itself in a loop or unreachable page every 20 minutes or so, making a cron-based update most unreliable.

MnogoSearch is quite powerful and versatile, but unlike most of its PHP-minded competitors, it must be compiled before usage and has the most substantial learning curve. It is immediately compatible with every major database, including SQLite, and comes with front ends for PHP, C, and Perl. There is a command-line interface to perform all maintenance and indexing; once you have configured it correctly, it is also useful for automation. It has a wide variety of features, including searches of your site, FTP archive searches, news article and newspaper searches, and more.

PHPDig uses a MySQL database, building a glossary with words from the pages you index. The search result displays the pages ranked by keyword density. Though PHPDig's fame and clean code would suggest otherwise, this search engine is far from being one of the best available. The indexing speed is quite slow, especially in comparison with MnogoSearch or RiSearch. It's overflowing with features and plugins for any format of data and has built-in index scheduling routines.

RiSearch is powerful and has a very fast search script, designed to work with hundreds of megabytes of text data. It does not use libraries or databases but is Perl code with PHP front ends. RiSearch is surprisingly fast to search for a file-based storage back end. However, this affects the search result relevancy, which is poorer than other options. It is therefore better for finding unique phrases, like names of species, than for searching concepts.

Sphider is PHP code that uses MySQL for indexing pages. It works for sites up to 20,000 pages. It also works great as a tool for site analysis, such as finding broken links and gathering statistics about the site. It has an efficient back end and search algorithm, but its crawling methods function poorly.

Sphinx is a fast and capable full text search engine, particularly suited for database content. It runs its own daemon (which you compile) and does not have any web crawlers bundled. Features include high performance, good scalability and search quality, advanced sorting, filtering, and grouping.

TSEP causes a long delay when executing the crawler if the data to index is extensive. This was a problem on one server with time-out/keep-alive of 8/15, though adding ignore_user_abort() to the top of indexer.php bypasses it.

Table 1. Summary of leading PHP engines

  Sphider MnogoSearch TSEP PHPDig iSearch RiSearch Sphinx
Overall ranking *** *** ** ** * * **
Database MySQL, SQLite Several MySQL, SQLite MySQL MySQL Flat files (text) MySQL, PostgreSQL, Flat files
Multilanguage support No Yes Yes No Yes Yes Yes
Support Medium (forum) Very good (discussion list, forum, and paid email support) Medium (forum) Poor (forum) Medium (FAQ and forum) Medium (forum) Good
User interface Easy Easy Easy Medium Difficult Easy Easy
Customizability High High High High Medium Medium High
PHP 5 compatible Yes Yes (for the interface) Yes Yes Yes Yes
(requires PHP 5)
Yes
SQLite compatible No Yes Yes Yes No No No
URL-free crawling Yes Yes Yes No No Yes Yes
Install package download 44K 2MB 1.5MB 273K 150K 128K ~300K
Installation Medium Very easy Easy Easy-Medium Easy-Medium Easy-Medium Easy
Access needed to install Root Root (need to compile) Root Root Root FTP Shell (non root)
Recommended file limit High Very high High High High Very high Very high
Index speed Very slow ~500 in 10 seconds ~500 in 14 seconds Slow Medium ~500 in 18 seconds 4-10 MB/sec

Overall ranking represents the author's overall ranking of the engine, based on ease of use, power, spidering speed, and ranking relevancy.

Database lists the kind of database used for creating and storing the index.

Support refers to the customer support available for each engine and how you can ask questions to clarify any problems you might have on the installation or usage of the tool.

Access needed to install indicates the access you need to have on the server in order to fully install your application and index your site.

Recommended file limit identifies the number of documents that the search engine can support in order to run at its full capacity.

Other PHP search engines, not included in the table but listed below, are available. We do not recommend these engines as highly.

SiteSearch is a PHP engine that uses a text file database to index the information on the site. It includes several useful features, such as indexing by meta tags and multiple word search. It has several add-ons, including multilanguage support and text database support.

Simple Web Search is a script that searches a SWISH-E index. It requires SWISH-E 1.x or 2.x and PHP 3.0.8 or newer on the system, and a web server supporting PHP 3.

IndexServer is a useful plugin package that lets you perform a variety of tasks. Indexing web sites allows you to further query the final index.

Xapian is only an indexing tool, but the company also offers a web site search engine package that includes its Omega solution, which looks promising and has several interesting features. Xapian uses SWIG for PHP, so the indexer is not PHP5 compatible. This is where BeebleX comes in. BeebleX is a search engine that uses a PHP 5-compatible Xapian extension. For more information, visit Marco Tabini's thoughts on BeebleX.

Recommendations

There is no ideal PHP search engine, but our overall impression was that Sphider and MnogoSearch are the best contenders. In general, Sphider returns more accurate hits, and MnogoSearch is easier to set up.

Sphinx is a relatively new contender, and shows good promise. Although Sphinx is little known and has few real-world installations so far, it is worth checking in on in the future, particularly if you don't need a web crawler. Xapian is a strong engine, with support for many programming languages, and an active community, but we found it difficult to set up in PHP.

Conclusion

If you want to know more about search engines, the following sites have plenty of descriptions, reviews, news, guides, how-tos, and technologies:

  • www.searchtools.com
  • www.searchengines.com
  • www.searchengineguide.com
  • www.searchengineshowdown.com
  • searchenginewatch.com

Michael Douma is an expert in user interface design and web-based interactive education.


Return to the PHP DevCenter.


Did he overlook anything? Recommend your favorite here.
You must be logged in to the O'Reilly Network to post a talkback.
Post Comment
Full Threads Oldest First

Showing messages 1 through 6 of 6.

  • Need some information related to search engines
    2007-05-23 06:32:19  SEO_Hawk [Reply | View]

    Is there any way to find out technology behind search engines like Google and Yahoo.

    We have been tracking their technology for the last several days, but could not figure out technology working behind these search engine giants.

    regards
    http://www.seohawk.com


  • Another search engine
    2007-05-16 08:46:41  GreyWyvern [Reply | View]

    I found this article through a link on a forum where my own search engine was recommended. My little engine isn't really an industry leader, but perhaps you would see fit to include it in your tests :) People seem to like it, I guess.

    The search package can be found here: www.greywyvern.com/orca#search
  • Solr
    2007-04-27 03:14:46  jerj [Reply | View]

    It's well worth looking at the Apache Solr project too which is based on Java Lucene but will talk to many languages including PHP

    http://lucene.apache.org/solr/
  • you're mistaken to dismiss Xapian
    2007-03-30 03:41:05  heathd [Reply | View]

    Hi,

    in your article you describe Xapian as "not recommended". I think this is mistaken. Xapian is one of the most powerful, high performance, flexible, open source search+indexing systems around. Benefits include:


    1. the ability to do real-time (re)-indexing on added/modified/deleted documents.

    2. bindings for many programming languages
    3. based on sound theoretical basis
    4. stemmers for many human languages:
      http://www.xapian.org/docs/stemming.html



    Xapian is in successful use on many websites. One I have personal experience of is http://www.theyworkforyou.com/. Try out the search of this large database and see how fast it is.

    You also state that Xapian "is not PHP5 compatible" but then 2 sentences later mention a "PHP 5-compatible Xapian extension". Although I have not used it in PHP5 myself, there appear to be several people who have used Xapian successfully in PHP5:


    1. http://blog.dixo.net/2006/04/04/xapian-php5-wrapper/

    2. http://article.gmane.org/gmane.comp.search.xapian.general/2673
      (link broken at time of writing, so google cache version:)

    3. http://66.102.9.104/search?q=cache:-lKeWzlSLZAJ:article.gmane.org/gmane.comp.search.xapian.general/2673+http://article.gmane.org/gmane.comp.search.xapian.general/2673&hl=en&ct=clnk&cd=1&gl=fr&client=firefox-a



    for an intro read "Homo Xapian - The Search For a Better Search… Engine" in June 2005 PHP Architect:
    https://www.phparch.com/issue.php?mid=59

    David Heath
  • Sphinx Missing
    2007-03-26 20:48:50  helphand [Reply | View]


    A blazingly fast engine that hooks to php is the Sphinx search engine, http://www.sphinxsearch.com/

    If you are thinking about adding search to a php site you host, be sure to take a look a sphinx.

    Helphand
  • what about the swish-e pecl module?
    2007-03-23 15:50:04  gggeek [Reply | View]

    even though not strictly php-based, swish-e might be considered in the review, at least because of the existence of the swish extension in pecl, by two of the core php developers


Tagged Articles

Be the first to post this article to del.icio.us

Sponsored Resources

  • Inside Lightroom
Advertisement

Sponsored by:

O'Reilly Media

©2009, O'Reilly Media, Inc.
(707) 827-7000 / (800) 998-9938
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
About O'Reilly
Academic Solutions
Authors
Contacts
Customer Service
Jobs
Newsletters
O'Reilly Labs
Press Room
Privacy Policy
RSS Feeds
Terms of Service
User Groups
Writing for O'Reilly
Content Archive
Business Technology
Computer Technology
Google
Microsoft
Mobile
Network
Operating System
Digital Photography
Programming
Software
Web
Web Design
More O'Reilly Sites
O'Reilly Radar
Ignite
Tools of Change for Publishing
Digital Media
Inside iPhone
O'Reilly FYI
makezine.com
craftzine.com
hackszine.com
perl.com
xml.com

Partner Sites
InsideRIA
java.net
O'Reilly Insights on Forbes.com