I'm currently looking at other search methods rather than having a huge SQL query. I saw elasticsearch recently and played with whoosh (a Python implementation of a search engine).
Can you give reasons for your choice(s)?
This question is related to
solr
lucene
elasticsearch
sphinx
xapian
We use Lucene regularly to index and search tens of millions of documents. Searches are quick enough, and we use incremental updates that do not take a long time. It did take us some time to get here. The strong points of Lucene are its scalability, a large range of features and an active community of developers. Using bare Lucene requires programming in Java.
If you are starting afresh, the tool for you in the Lucene family is Solr, which is much easier to set up than bare Lucene, and has almost all of Lucene's power. It can import database documents easily. Solr are written in Java, so any modification of Solr requires Java knowledge, but you can do a lot just by tweaking configuration files.
I have also heard good things about Sphinx, especially in conjunction with a MySQL database. Have not used it, though.
IMO, you should choose according to:
My sphinx.conf
source post_source
{
type = mysql
sql_host = localhost
sql_user = ***
sql_pass = ***
sql_db = ***
sql_port = 3306
sql_query_pre = SET NAMES utf8
# query before fetching rows to index
sql_query = SELECT *, id AS pid, CRC32(safetag) as safetag_crc32 FROM hb_posts
sql_attr_uint = pid
# pid (as 'sql_attr_uint') is necessary for sphinx
# this field must be unique
# that is why I like sphinx
# you can store custom string fields into indexes (memory) as well
sql_field_string = title
sql_field_string = slug
sql_field_string = content
sql_field_string = tags
sql_attr_uint = category
# integer fields must be defined as sql_attr_uint
sql_attr_timestamp = date
# timestamp fields must be defined as sql_attr_timestamp
sql_query_info_pre = SET NAMES utf8
# if you need unicode support for sql_field_string, you need to patch the source
# this param. is not supported natively
sql_query_info = SELECT * FROM my_posts WHERE id = $id
}
index posts
{
source = post_source
# source above
path = /var/data/posts
# index location
charset_type = utf-8
}
Test script:
<?php
require "sphinxapi.php";
$safetag = $_GET["my_post_slug"];
// $safetag = preg_replace("/[^a-z0-9\-_]/i", "", $safetag);
$conf = getMyConf();
$cl = New SphinxClient();
$cl->SetServer($conf["server"], $conf["port"]);
$cl->SetConnectTimeout($conf["timeout"]);
$cl->setMaxQueryTime($conf["max"]);
# set search params
$cl->SetMatchMode(SPH_MATCH_FULLSCAN);
$cl->SetArrayResult(TRUE);
$cl->setLimits(0, 1, 1);
# looking for the post (not searching a keyword)
$cl->SetFilter("safetag_crc32", array(crc32($safetag)));
# fetch results
$post = $cl->Query(null, "post_1");
echo "<pre>";
var_dump($post);
echo "</pre>";
exit("done");
?>
Sample result:
[array] =>
"id" => 123,
"title" => "My post title.",
"content" => "My <p>post</p> content.",
...
[ and other fields ]
Sphinx query time:
0.001 sec.
Sphinx query time (1k concurrent):
=> 0.346 sec. (average)
=> 0.340 sec. (average of last 10 query)
MySQL query time:
"SELECT * FROM hb_posts WHERE id = 123;"
=> 0.001 sec.
MySQL query time (1k concurrent):
"SELECT * FROM my_posts WHERE id = 123;"
=> 1.612 sec. (average)
=> 1.920 sec. (average of last 10 query)
An experiment to compare ElasticSearch and Solr
I have used Sphinx, Solr and Elasticsearch. Solr/Elasticsearch are built on top of Lucene. It adds many common functionality: web server api, faceting, caching, etc.
If you want to just have a simple full text search setup, Sphinx is a better choice.
If you want to customize your search at all, Elasticsearch and Solr are the better choices. They are very extensible: you can write your own plugins to adjust result scoring.
Some example usages:
The only elasticsearch vs solr performance comparison I've been able to find so far is here:
Try indextank.
As the case of elastic search, it was conceived to be much easier to use than lucene/solr. It also includes very flexible scoring system that can be tweaked without reindexing.
We use Sphinx in a Vertical Search project with 10.000.000 + of MySql records and 10+ different database . It has got very excellent support for MySQL and high performance on indexing , research is fast but maybe a little less than Lucene. However it's the right choice if you need quickly indexing every day and use a MySQL db.
Lucene is nice and all, but their stop word set is awful. I had to manually add a ton of stop words to StopAnalyzer.ENGLISH_STOP_WORDS_SET just to get it anywhere near usable.
I haven't used Sphinx but I know people swear by its speed and near-magical "ease of setup to awesomeness" ratio.
Source: Stackoverflow.com