Comparison of full text search engine - Lucene Sphinx Postgresql MySQL

Question

I m building a Django site and I am looking for a search engine   A few candidates    Lucene Lucene with Compass Solr Sphinx Postgresql built-in full text search MySQl built-in full text search   Selection criteria    result relevance and ranking searching and indexing speed ease of use and ease of integration with Django resource requirements - site will be hosted on a VPS  so ideally the search engine wouldn t require a lot of RAM and CPU scalability extra features such as  did you mean    related searches  etc   Anyone who has had experience with the search engines above  or other engines not in the list -- I would love to hear your opinions   EDIT  As for indexing needs  as users keep entering data into the site  those data would need to be indexed continuously  It doesn t have to be real time  but ideally new data would show up in index with no more than 15 - 30 minutes delay

User · Answer

I am surprised that there isn t more information posted about Solr  Solr is quite similar to Sphinx but has more advanced features  AFAIK as I haven t used Sphinx -- only read about it    The answer at the link below details a few things about Sphinx which also applies to Solr  Comparison of full text search engine - Lucene  Sphinx  Postgresql  MySQL   Solr also provides the following additional features    Supports replication Multiple cores  think of these as separate databases with their own configuration and own indexes  Boolean searches Highlighting of keywords  fairly easy to do in application code if you have regex-fu  however  why not let a specialized tool do a better job for you  Update index via XML or delimited file Communicate with the search server via HTTP  it can even return Json  Native PHP Ruby Python  PDF  Word document indexing Dynamic fields Facets Aggregate fields Stop words  synonyms  etc  More Like this    Index directly from the database with custom queries Auto-suggest Cache Autowarming Fast indexing  compare to MySQL full-text search indexing times  -- Lucene uses a binary inverted index format  Boosting  custom rules for increasing relevance of a particular keyword or phrase  etc   Fielded searches  if a search user knows the field he she wants to search  they narrow down their search by typing the field  then the value  and ONLY that field is searched rather than everything -- much better user experience    BTW  there are tons more features  however  I ve listed just the features that I have actually used in production  BTW  out of the box  MySQL supports  1   3  and  11  limited  on the list above  For the features you are looking for  a relational database isn t going to cut it  I d eliminate those straight away   Also  another benefit is that Solr  well  Lucene actually  is a document database  e g  NoSQL  so many of the benefits of any other document database can be realized with Solr  In other words  you can use it for more than just search  i e  Performance   Get creative with it

User · Answer

Just my two cents to this very old question  I would highly recommend taking a look at ElasticSearch       Elasticsearch is a search server based on Lucene  It provides a distributed  multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents  Elasticsearch is developed in Java and is released as open source under the terms of the Apache License    The advantages over other FTS  full text search  Engines are    RESTful interface  Better scalability  Large community  Built by Lucene developers   Extensive documentation  There are many open source libraries available  including Django    We are using this search engine at our project and very happy with it

User · Answer

I have used Solr   in my project  and it is the best so far

User · Answer

Apache Solr    Apart from answering OP s queries  Let me throw some insights on Apache Solr from simple introduction to detailed installation and implementation      Simple Introduction        Anyone who has had experience with the search engines above  or other   engines not in the list -- I would love to hear your opinions    Solr shouldn t be used to solve real-time problems  For search engines  Solr is pretty much game and works flawlessly   Solr works fine on High Traffic web-applications  I read somewhere that it is not suited for this  but I am backing up that statement   It utilizes the RAM  not the CPU         result relevance and ranking      The boost helps you rank your results show up on top  Say  you re trying to search for a name john in the fields firstname and lastname  and you want to give relevancy to the firstname field  then you need to boost up the firstname field as shown    http   localhost 8983 solr collection1 select q firstname john 2 amp lastname john   As you can see  firstname field is boosted up with a score of 2   More on SolrRelevancy        searching and indexing speed      The speed is unbelievably fast and no compromise on that  The reason I moved to Solr   Regarding the indexing speed  Solr can also handle JOINS from your database tables  A higher and complex JOIN do affect the indexing speed  However  an enormous RAM config can easily tackle this situation   The higher the RAM  The faster the indexing speed of Solr is         ease of use and ease of integration with Django      Never attempted to integrate Solr and Django  however you can achieve to do that with Haystack  I found some interesting article on the same and here s the github for it         resource requirements - site will be hosted on a VPS  so ideally the  search engine wouldn t require a lot of RAM and CPU      Solr breeds on RAM  so if the RAM is high  you don t to have to worry about Solr   Solr s RAM usage shoots up on full-indexing if you have some billion records  you could smartly make use of Delta imports to tackle this situation  As explained  Solr is only a near real-time solution         scalability      Solr is highly scalable  Have a look on SolrCloud  Some key features of it    Shards  or sharding is the concept of distributing the index among multiple machines  say if your index has grown too large  Load Balancing  if Solrj is used with Solr cloud it automatically takes care of load-balancing using it s Round-Robin mechanism  Distributed Search High Availability         extra features such as  did you mean    related searches  etc      For the above scenario  you could use the SpellCheckComponent that is packed up with Solr  There are a lot other features  The SnowballPorterFilterFactory helps to retrieve records say if you typed  books instead of book  you will be presented with results related to book     This answer broadly focuses on Apache Solr  amp  MySQL   Django is out of scope     Assuming that you are under LINUX environment  you could proceed to this article further   mine was an Ubuntu 14 04 version      Detailed Installation   Getting Started  Download Apache Solr from here  That would be version is 4 8 1  You could download new versions  I found this stable   After downloading the archive   extract it to a folder of your choice   Say     Downloads or whatever   So it will look like Downloads solr-4 8 1   On your prompt    Navigate inside the directory  shankar shankar-lenovo  cd Downloads solr-4 8 1  So now you are here      shankar shankar-lenovo    Downloads solr-4 8 1   Start the Jetty Application Server  Jetty is available inside the examples folder of the solr-4 8 1 directory   so navigate inside that and start the Jetty Application Server   shankar shankar-lenovo   Downloads solr-4 8 1 example  java -jar start jar  Now   do not close the terminal   minimize it and let it stay aside         TIP   Use  amp  after start jar to make the Jetty Server run in the   background     To check if Apache Solr runs successfully  visit this URL on the browser  http   localhost 8983 solr  Running Jetty on custom Port  It runs on the port 8983 as default  You could change the port either here or directly inside the jetty xml file   java -Djetty port 9091 -jar start jar  Download the JConnector  This JAR file acts as a bridge between MySQL and JDBC   Download the Platform Independent Version here  After downloading it  extract the folder and copy themysql-connector-java-5 1 31-bin jar and paste it to the lib directory   shankar shankar-lenovo   Downloads solr-4 8 1 contrib dataimporthandler lib  Creating the MySQL table to be linked to Apache Solr  To put Solr to use  You need to have some tables and data to search for  For that  we will use MySQL for creating a table and pushing some random names and then we could use Solr to connect to MySQL and index that table and it s entries   1 Table Structure  CREATE TABLE test solr mysql      id INT UNSIGNED NOT NULL AUTO INCREMENT    name VARCHAR 45  NULL    created TIMESTAMP NULL DEFAULT CURRENT TIMESTAMP    PRIMARY KEY  id        2 Populate the above table  INSERT INTO  test solr mysql    name   VALUES   Jean    INSERT INTO  test solr mysql    name   VALUES   Jack    INSERT INTO  test solr mysql    name   VALUES   Jason    INSERT INTO  test solr mysql    name   VALUES   Vego    INSERT INTO  test solr mysql    name   VALUES   Grunt    INSERT INTO  test solr mysql    name   VALUES   Jasper    INSERT INTO  test solr mysql    name   VALUES   Fred    INSERT INTO  test solr mysql    name   VALUES   Jenna    INSERT INTO  test solr mysql    name   VALUES   Rebecca    INSERT INTO  test solr mysql    name   VALUES   Roland      Getting inside the core and adding the lib directives  1 Navigate to  shankar shankar-lenovo    Downloads solr-4 8 1 example solr collection1 conf   2 Modifying the solrconfig xml  Add these two directives to this file       lt lib dir           contrib dataimporthandler lib   regex      jar    gt     lt lib dir           dist   regex  solr-dataimporthandler- d    jar    gt    Now add the DIH  Data Import Handler    lt requestHandler name   dataimport     class  org apache solr handler dataimport DataImportHandler   gt       lt lst name  defaults  gt         lt str name  config  gt db-data-config xml lt  str gt       lt  lst gt   lt  requestHandler gt    3 Create the db-data-config xml file  If the file exists then ignore  add these lines to that file  As you can see the first line  you need to provide the credentials of your MySQL database  The Database name  username and password    lt dataConfig gt       lt dataSource type  JdbcDataSource  driver  com mysql jdbc Driver  url  jdbc mysql   localhost yourdbname  user  dbuser  password  dbpass   gt       lt document gt      lt entity name  test solr  query  select CONCAT  test solr-  id  as rid name from test solr mysql WHERE    dataimporter request clean       false        OR  created   gt     dataimporter last index time     gt       lt field name  id  column  rid    gt       lt field name  solr name  column  name    gt       lt  entity gt      lt  document gt   lt  dataConfig gt         TIP   You can have any number of entities but watch out for id field    if they are same then indexing will skipped      4 Modify the schema xml file  Add this to your schema xml as shown      lt uniqueKey gt id lt  uniqueKey gt   lt field name  solr name  type  string  indexed  true  stored  true    gt       Implementation   Indexing  This is where the real deal is  You need to do the indexing of data from MySQL to Solr inorder to make use of Solr Queries   Step 1  Go to Solr Admin Panel  Hit the URL http   localhost 8983 solr on your browser  The screen opens like this      As the marker indicates  go to Logging inorder to check if any of the above configuration has led to errors   Step 2  Check your Logs  Ok so now you are here  As you can there are a lot of yellow messages  WARNINGS   Make sure you don t have error messages marked in red  Earlier  on our configuration we had added a select query on our db-data-config xml  say if there were any errors on that query  it would have shown up here     Fine  no errors  We are good to go  Let s choose collection1 from the list as depicted and select Dataimport  Step 3  DIH  Data Import Handler   Using the DIH  you will be connecting to MySQL from Solr through the configuration file db-data-config xml from the Solr interface and retrieve the 10 records from the database which gets indexed onto Solr   To do that  Choose full-import   and check the options Clean and Commit  Now click Execute as shown   Alternatively  you could use a direct full-import query like this too    http   localhost 8983 solr collection1 dataimport command full-import amp commit true     After you clicked Execute  Solr begins to index the records  if there were any errors  it would say Indexing Failed and you have to go back to the Logging section to see what has gone wrong   Assuming there are no errors with this configuration and if the indexing is successfully complete   you would get this notification     Step 4  Running Solr Queries  Seems like everything went well  now you could use Solr Queries to query the data that was indexed  Click the Query on the left and then press Execute button on the bottom   You will see the indexed records as shown   The corresponding Solr query for listing all the records is  http   localhost 8983 solr collection1 select q     amp wt json amp indent true     Well  there goes all 10 indexed records  Say  we need only names starting with Ja   in this case  you need to target the column name solr name  Hence your query goes like this   http   localhost 8983 solr collection1 select q solr name Ja  amp wt json amp indent true     That s how you write Solr Queries  To read more about it  Check this beautiful article

User · Answer

Good to see someone s chimed in about Lucene - because I ve no idea about that   Sphinx  on the other hand  I know quite well  so let s see if I can be of some help    Result relevance ranking is the default  You can set up your own sorting should you wish  and give specific fields higher weightings  Indexing speed is super-fast  because it talks directly to the database  Any slowness will come from complex SQL queries and un-indexed foreign keys and other such problems  I ve never noticed any slowness in searching either  I m a Rails guy  so I ve no idea how easy it is to implement with Django  There is a Python API that comes with the Sphinx source though  The search service daemon  searchd  is pretty low on memory usage - and you can set limits on how much memory the indexer process uses too  Scalability is where my knowledge is more sketchy - but it s easy enough to copy index files to multiple machines and run several searchd daemons  The general impression I get from others though is that it s pretty damn good under high load  so scaling it out across multiple machines isn t something that needs to be dealt with  There s no support for  did-you-mean   etc - although these can be done with other tools easily enough  Sphinx does stem words though using dictionaries  so  driving  and  drive   for example  would be considered the same in searches  Sphinx doesn t allow partial index updates for field data though  The common approach to this is to maintain a delta index with all the recent changes  and re-index this after every change  and those new results appear within a second or two   Because of the small amount of data  this can take a matter of seconds  You will still need to re-index the main dataset regularly though  although how regularly depends on the volatility of your data - every day  every hour    The fast indexing speeds keep this all pretty painless though    I ve no idea how applicable to your situation this is  but Evan Weaver compared a few of the common Rails search options  Sphinx  Ferret  a port of Lucene for Ruby  and Solr   running some benchmarks  Could be useful  I guess   I ve not plumbed the depths of MySQL s full-text search  but I know it doesn t compete speed-wise nor feature-wise with Sphinx  Lucene or Solr

User · Answer

I don t know Sphinx  but as for Lucene vs a database full-text search  I think that Lucene performance is unmatched  You should be able to do almost any search in less than 10 ms  no matter how many records you have to search  provided that you have set up your Lucene index correctly   Here comes the biggest hurdle though  personally  I think integrating Lucene in your project is not easy  Sure  it is not too hard to set it up so you can do some basic search  but if you want to get the most out of it  with optimal performance  then you definitely need a good book about Lucene   As for CPU  amp  RAM requirements  performing a search in Lucene doesn t task your CPU too much  though indexing your data is  although you don t do that too often  maybe once or twice a day   so that isn t much of a hurdle   It doesn t answer all of your questions but in short  if you have a lot of data to search  and you want great performance  then I think Lucene is definitely the way to go  If you re not going to have that much data to search  then you might as well go for a database full-text search  Setting up a MySQL full-text search is definitely easier in my book

User · Answer

SearchTools-Avi said  MySQL text search  which doesn t even index words of three letters or fewer    FYIs  The MySQL fulltext min word length is adjustable since at least MySQL 5 0  Google  mysql fulltext min length  for simple instructions   That said  MySQL fulltext has limitations  for one  it gets slow to update once you reach a million records or so

User · Answer

I m looking at PostgreSQL full-text search right now  and it has all the right features of a modern search engine  really good extended character and multilingual support  nice tight integration with text fields in the database     But it doesn t have user-friendly search operators like   or AND  uses  amp       and I m not thrilled with how it works on their documentation site   While it has bolding  of match terms in the results snippets  the default algorithm for which match terms is not great  Also  if you want to index rtf  PDF  MS Office  you have to find and integrate a file format converter   OTOH  it s way better than the MySQL text search  which doesn t even index words of three letters or fewer   It s the default for the MediaWiki search  and I really think it s no good for end-users  http   www searchtools com analysis mediawiki-search   In all cases I ve seen  Lucene Solr and Sphinx are really great   They re solid code and have evolved with significant improvements in usability  so the tools are all there to make search that satisfies almost everyone   for SHAILI - SOLR includes the Lucene search code library and has the components to be a nice stand-alone search engine

User · Answer

I would add mnoGoSearch to the list  Extremely performant and flexible solution  which works as Google   indexer fetches data from multiple sites  You could use basic criterias  or invent Your own hooks to have maximal search quality  Also it could fetch the data directly from the database   The solution is not so known today  but it feets maximum needs  You could compile and install it or on standalone server  or even on Your principal server  it doesn t need so much ressources as Solr  as it s written in C and runs perfectly even on small servers   In the beginning You need to compile it Yourself  so it requires some knowledge  I made a tiny script for Debian  which could help  Any adjustments are welcome   As You are using Django framework  You could use or PHP client in the middle  or find a solution in Python  I saw some articles   And  of course mnoGoSearch is open source  GNU GPL

[mysql] Comparison of full text search engine - Lucene, Sphinx, Postgresql, MySQL?

Examples related to mysql

Examples related to postgresql

Examples related to full-text-search

Examples related to lucene

Examples related to sphinx