On Lucene and it's decency
From a
reply of mine in a thread on gnome-devel-list, I'm quoting here since it documents part of my work in 2002 on
RiRa Persian digital library project:
On Wed, 6 Apr 2005, Jamie McCracken wrote:
> 2) Use of an SQL database is a far superior, faster and flexible
> solution to using a dedicated indexer like the lucerne engine (all other
> competing engines like spotlight use sql databases). This is one area
> search services has got right.
Lucene is a decent search engine. You cannot compare it with SQL databases, you can compare it to another search engine, that may or may not use SQL databases as backend, but, as soon as you are talking about search engines, their implementation details doesn't matter at all. So, SearchServices by using SQL databases is really losing here, since it has to do a lot to catch Lucene, that I doubt it can. SQL databases are good things if you want atomicity, transactions, scalability, support for (really) complicated queries: joins, subqueries, etc. None of which is needed at all in a Desktop search service that you have one single server per user that does the indexing too. What SQL databases provide for a search engine is at best the "like" operator and well, they can use indexes when you are matching the beginning of the string. And all the RDBMS hype comes from decent products like PostgreSQL, not a toy size one like SQLite.
Lucene on the other hand, comes out from an experience ex-employer of Excite, and from the Apache Foundation. It's specialized for search services. It allows for localization of search technology: You have an English normalizer, a German one, a Persian one, .... Yes, you have text normalizers there.
> Cause its not just about indexing - We have metadata too and
> that really needs a DB. If all you want is a google on your
> hard drive then yes a dedicated indexer would be best but an
> RDBMS will give you expanidbility and flebility in handling
> structured metadata with more powerful search options.
Very good point. Yes, Lucene accepts metadata too. You can have an unlimited number of fields. In fact, Lucene is quite like a relational database, you have different tables, each table has a number of fields. Just that you are not forced to have a primary key. At search time, you can search a table, any field of it, with exact or fuzzy matching. Queries can be built in a tree like fashion, by using AND, OR, and NOT operations. And it already has parsers for parsing Google like queries. It even
accepts wildcards in query words. It also accepts quotation for searching phrases exactly, something that's a nightmare doing with RDBMS-based systems.
I had an experience with Lucene a couple years ago. (http://rira.ir/) I was working on a smallsized database of Persian poetry, some 700'000 verses in 17'000 poems. I had it imported in PostgreSQL, in some ten tables. I wanted to add a search service. Using a table for word-item matchings was out of question. I got Lucene and it was a matter of couple hours to write an indexer to fetch data out of PostgreSQL and import into Lucene. Now some of my observations were really stunning:
- Data was getting out of PostgreSQL views, which were simply natural join of some six tables (poet, book, part, poem, block, verse), all indexed, etc. Database was tuned up to my best of knowledge (shared memory size, vacuumed, etc). Lucene and the indexer were running on another maching. The indexing got just under one minute, with the PostgreSQL server making it's machine just unusable in this period, perhaps writing join tables on hard disk and fetching back later, etc, while the Lucene machine was as happy as a machine can be.
- The raw SQL dump of the data was 45MiB, Bzip2 would reduce to 17MiB. The PostgreSQL database to hold this data takes more than 70MiB, not talking about aa indexing system on top of that. In Lucene, for each field you can select at index time whether you like this field to be stored in the database (to be returned at search time) or not. I could simply store primary keys to my RDBMS database, but decided to store the whole text in Lucene, since after all stored AND indexed, the database as a small 30MiB file!! and my search page didn't need to contact the RDBMS for serach excerpts anymore.
- For a small project like mine, that didn't need almost any of RDBMS's glories --or to be honest it needs, but the performance of joins I like is not satisfying at all--, I may decide to move completely to Lucene. It provides all I want, and at least fetching number of rows is far cheaper than in PostgreSQL for example. (Don't argue about MySQL and others, they barely have things like views, schemas, etc.)
Update: I forgot to mention where IMHO the speed of Lucene comes from. From looking at the code, Lucene (and probably many other small-scale (and large-scale too?) search engines) work with bit vectors over all documents. So a complex query can be performed by bitwise operations over long bit vectors of basic queries on one phrase. Now you probably say a bit vector of over all documents is HUGE, but no: eight million documents take only one megabyte, which is negligible these days. And eight million documents is pretty much more than what you find in any website.