PHP Search Engine Showdown
Pages: 1, 2
PHP Engines
If you're going to install a local search engine and are using PHP, you have several great PHP engines to consider. We took the leaders in the field, summarized their features (Table 1), tested them all, and found:
iSearch has an excellent range of options for the needs of nearly any site, yet the core functions are encrypted and highly unchangeable. Also, in testing, the spider would trap itself in a loop or unreachable page every 20 minutes or so, making a cron-based update most unreliable.
MnogoSearch is quite powerful and versatile, but unlike most of its PHP-minded competitors, it must be compiled before usage and has the most substantial learning curve. It is immediately compatible with every major database, including SQLite, and comes with front ends for PHP, C, and Perl. There is a command-line interface to perform all maintenance and indexing; once you have configured it correctly, it is also useful for automation. It has a wide variety of features, including searches of your site, FTP archive searches, news article and newspaper searches, and more.
PHPDig uses a MySQL database, building a glossary with words from the pages you index. The search result displays the pages ranked by keyword density. Though PHPDig's fame and clean code would suggest otherwise, this search engine is far from being one of the best available. The indexing speed is quite slow, especially in comparison with MnogoSearch or RiSearch. It's overflowing with features and plugins for any format of data and has built-in index scheduling routines.
RiSearch is powerful and has a very fast search script, designed to work with hundreds of megabytes of text data. It does not use libraries or databases but is Perl code with PHP front ends. RiSearch is surprisingly fast to search for a file-based storage back end. However, this affects the search result relevancy, which is poorer than other options. It is therefore better for finding unique phrases, like names of species, than for searching concepts.
Sphider is PHP code that uses MySQL for indexing pages. It works for sites up to 20,000 pages. It also works great as a tool for site analysis, such as finding broken links and gathering statistics about the site. It has an efficient back end and search algorithm, but its crawling methods function poorly.
Sphinx is a fast and capable full text search engine, particularly suited for database content. It runs its own daemon (which you compile) and does not have any web crawlers bundled. Features include high performance, good scalability and search quality, advanced sorting, filtering, and grouping.
TSEP causes a long delay when executing the crawler if the data to index is extensive. This was a problem on one server with time-out/keep-alive of 8/15, though adding ignore_user_abort() to the top of indexer.php bypasses it.
Table 1. Summary of leading PHP engines
| Sphider | MnogoSearch | TSEP | PHPDig | iSearch | RiSearch | Sphinx | |
|---|---|---|---|---|---|---|---|
| Overall ranking | *** | *** | ** | ** | * | * | ** |
| Database | MySQL, SQLite | Several | MySQL, SQLite | MySQL | MySQL | Flat files (text) | MySQL, PostgreSQL, Flat files |
| Multilanguage support | No | Yes | Yes | No | Yes | Yes | Yes |
| Support | Medium (forum) | Very good (discussion list, forum, and paid email support) | Medium (forum) | Poor (forum) | Medium (FAQ and forum) | Medium (forum) | Good |
| User interface | Easy | Easy | Easy | Medium | Difficult | Easy | Easy |
| Customizability | High | High | High | High | Medium | Medium | High |
| PHP 5 compatible | Yes | Yes (for the interface) | Yes | Yes | Yes | Yes (requires PHP 5) |
Yes |
| SQLite compatible | No | Yes | Yes | Yes | No | No | No |
| URL-free crawling | Yes | Yes | Yes | No | No | Yes | Yes |
| Install package download | 44K | 2MB | 1.5MB | 273K | 150K | 128K | ~300K |
| Installation | Medium | Very easy | Easy | Easy-Medium | Easy-Medium | Easy-Medium | Easy |
| Access needed to install | Root | Root (need to compile) | Root | Root | Root | FTP | Shell (non root) |
| Recommended file limit | High | Very high | High | High | High | Very high | Very high |
| Index speed | Very slow | ~500 in 10 seconds | ~500 in 14 seconds | Slow | Medium | ~500 in 18 seconds | 4-10 MB/sec |
Overall ranking represents the author's overall ranking of the engine, based on ease of use, power, spidering speed, and ranking relevancy.
Database lists the kind of database used for creating and storing the index.
Support refers to the customer support available for each engine and how you can ask questions to clarify any problems you might have on the installation or usage of the tool.
Access needed to install indicates the access you need to have on the server in order to fully install your application and index your site.
Recommended file limit identifies the number of documents that the search engine can support in order to run at its full capacity.
Other PHP search engines, not included in the table but listed below, are available. We do not recommend these engines as highly.
SiteSearch is a PHP engine that uses a text file database to index the information on the site. It includes several useful features, such as indexing by meta tags and multiple word search. It has several add-ons, including multilanguage support and text database support.
Simple Web Search is a script that searches a SWISH-E index. It requires SWISH-E 1.x or 2.x and PHP 3.0.8 or newer on the system, and a web server supporting PHP 3.
IndexServer is a useful plugin package that lets you perform a variety of tasks. Indexing web sites allows you to further query the final index.
Xapian is only an indexing tool, but the company also offers a web site search engine package that includes its Omega solution, which looks promising and has several interesting features. Xapian uses SWIG for PHP, so the indexer is not PHP5 compatible. This is where BeebleX comes in. BeebleX is a search engine that uses a PHP 5-compatible Xapian extension. For more information, visit Marco Tabini's thoughts on BeebleX.
Recommendations
There is no ideal PHP search engine, but our overall impression was that Sphider and MnogoSearch are the best contenders. In general, Sphider returns more accurate hits, and MnogoSearch is easier to set up.
Sphinx is a relatively new contender, and shows good promise. Although Sphinx is little known and has few real-world installations so far, it is worth checking in on in the future, particularly if you don't need a web crawler. Xapian is a strong engine, with support for many programming languages, and an active community, but we found it difficult to set up in PHP.
Conclusion
If you want to know more about search engines, the following sites have plenty of descriptions, reviews, news, guides, how-tos, and technologies:
- www.searchtools.com
- www.searchengines.com
- www.searchengineguide.com
- www.searchengineshowdown.com
- searchenginewatch.com
Michael Douma is an expert in user interface design and web-based interactive education.
Return to the PHP DevCenter.
You must be logged in to the O'Reilly Network to post a talkback.
Showing messages 1 through 6 of 6.
-
Another search engine
2007-05-16 08:46:41 GreyWyvern [Reply | View]
I found this article through a link on a forum where my own search engine was recommended. My little engine isn't really an industry leader, but perhaps you would see fit to include it in your tests :) People seem to like it, I guess.
The search package can be found here: www.greywyvern.com/orca#search
-
Solr
2007-04-27 03:14:46 jerj [Reply | View]
It's well worth looking at the Apache Solr project too which is based on Java Lucene but will talk to many languages including PHP
http://lucene.apache.org/solr/
-
you're mistaken to dismiss Xapian
2007-03-30 03:41:05 heathd [Reply | View]
Hi,
in your article you describe Xapian as "not recommended". I think this is mistaken. Xapian is one of the most powerful, high performance, flexible, open source search+indexing systems around. Benefits include:
- the ability to do real-time (re)-indexing on added/modified/deleted documents.
- bindings for many programming languages
- based on sound theoretical basis
- stemmers for many human languages:
http://www.xapian.org/docs/stemming.html
Xapian is in successful use on many websites. One I have personal experience of is http://www.theyworkforyou.com/. Try out the search of this large database and see how fast it is.
You also state that Xapian "is not PHP5 compatible" but then 2 sentences later mention a "PHP 5-compatible Xapian extension". Although I have not used it in PHP5 myself, there appear to be several people who have used Xapian successfully in PHP5:
- http://blog.dixo.net/2006/04/04/xapian-php5-wrapper/
- http://article.gmane.org/gmane.comp.search.xapian.general/2673
(link broken at time of writing, so google cache version:) - http://66.102.9.104/search?q=cache:-lKeWzlSLZAJ:article.gmane.org/gmane.comp.search.xapian.general/2673+http://article.gmane.org/gmane.comp.search.xapian.general/2673&hl=en&ct=clnk&cd=1&gl=fr&client=firefox-a
for an intro read "Homo Xapian - The Search For a Better Search… Engine" in June 2005 PHP Architect:
https://www.phparch.com/issue.php?mid=59
David Heath



We have been tracking their technology for the last several days, but could not figure out technology working behind these search engine giants.
regards
http://www.seohawk.com