Behdad Esfahbod's daily notes on GNOME, Pango, Fedora, Persian Computing, Bob Dylan, and Dan Bern!

My Photo
Name:
Location: Toronto, Ontario, Canada

Ask Google.

Contact info
Google
Hacker Emblem Become a Friend of GNOME I Power Blogger
follow me on Twitter
Archives
July 2003
August 2003
October 2003
November 2003
December 2003
March 2004
April 2004
May 2004
July 2004
August 2004
September 2004
November 2004
March 2005
April 2005
May 2005
June 2005
July 2005
August 2005
September 2005
October 2005
November 2005
December 2005
January 2006
February 2006
March 2006
April 2006
May 2006
June 2006
July 2006
August 2006
September 2006
October 2006
November 2006
December 2006
January 2007
February 2007
March 2007
April 2007
May 2007
June 2007
July 2007
August 2007
September 2007
October 2007
November 2007
December 2007
January 2008
February 2008
March 2008
April 2008
May 2008
June 2008
July 2008
August 2008
October 2008
November 2008
December 2008
January 2009
March 2009
April 2009
May 2009
June 2009
July 2009
August 2009
November 2009
December 2009
March 2010
April 2010
May 2010
June 2010
July 2010
October 2010
November 2010
April 2011
May 2011
August 2011
September 2011
October 2011
November 2011
November 2012
June 2013
January 2014
May 2015
Current Posts
McEs, A Hacker Life
Friday, April 08, 2005
 On Lucene and it's decency

From a reply of mine in a thread on gnome-devel-list, I'm quoting here since it documents part of my work in 2002 on RiRa Persian digital library project:

On Wed, 6 Apr 2005, Jamie McCracken wrote:

> 2) Use of an SQL database is a far superior, faster and flexible
> solution to using a dedicated indexer like the lucerne engine (all other
> competing engines like spotlight use sql databases). This is one area
> search services has got right.

Lucene is a decent search engine. You cannot compare it with SQL databases, you can compare it to another search engine, that may or may not use SQL databases as backend, but, as soon as you are talking about search engines, their implementation details doesn't matter at all. So, SearchServices by using SQL databases is really losing here, since it has to do a lot to catch Lucene, that I doubt it can. SQL databases are good things if you want atomicity, transactions, scalability, support for (really) complicated queries: joins, subqueries, etc. None of which is needed at all in a Desktop search service that you have one single server per user that does the indexing too. What SQL databases provide for a search engine is at best the "like" operator and well, they can use indexes when you are matching the beginning of the string. And all the RDBMS hype comes from decent products like PostgreSQL, not a toy size one like SQLite.

Lucene on the other hand, comes out from an experience ex-employer of Excite, and from the Apache Foundation. It's specialized for search services. It allows for localization of search technology: You have an English normalizer, a German one, a Persian one, .... Yes, you have text normalizers there.

> Cause its not just about indexing - We have metadata too and
> that really needs a DB. If all you want is a google on your
> hard drive then yes a dedicated indexer would be best but an
> RDBMS will give you expanidbility and flebility in handling
> structured metadata with more powerful search options.

Very good point. Yes, Lucene accepts metadata too. You can have an unlimited number of fields. In fact, Lucene is quite like a relational database, you have different tables, each table has a number of fields. Just that you are not forced to have a primary key. At search time, you can search a table, any field of it, with exact or fuzzy matching. Queries can be built in a tree like fashion, by using AND, OR, and NOT operations. And it already has parsers for parsing Google like queries. It even
accepts wildcards in query words. It also accepts quotation for searching phrases exactly, something that's a nightmare doing with RDBMS-based systems.


I had an experience with Lucene a couple years ago. (http://rira.ir/) I was working on a smallsized database of Persian poetry, some 700'000 verses in 17'000 poems. I had it imported in PostgreSQL, in some ten tables. I wanted to add a search service. Using a table for word-item matchings was out of question. I got Lucene and it was a matter of couple hours to write an indexer to fetch data out of PostgreSQL and import into Lucene. Now some of my observations were really stunning:

Update: I forgot to mention where IMHO the speed of Lucene comes from. From looking at the code, Lucene (and probably many other small-scale (and large-scale too?) search engines) work with bit vectors over all documents. So a complex query can be performed by bitwise operations over long bit vectors of basic queries on one phrase. Now you probably say a bit vector of over all documents is HUGE, but no: eight million documents take only one megabyte, which is negligible these days. And eight million documents is pretty much more than what you find in any website.

Comments:
Comparing Lucene and a RDBMS is like comparing apples and strudel (ok, oranges). I'd be careful calling Lucene a small-scale search engine (some major, big sites use it). Doug Cutting is an ex-Excite employee, not ex-Altavista employee. :)
 
Hey, thanks for the comment. Fixed Altavista->Excite. Indeed comparing them is meaningless, that's what I was trying to say too. By small-scale, I was comparing it to Google, Yahoo!, Exceite, etc.
 
Hi,

It provides all I want, and at least fetching number of rows is far cheaper than in PostgreSQL for example.

Means are you talking about the following query?

select count(*) from table;

Will it be possible in lucene to get the total number of events though they are not selected fully? For example I have 1000 occurences of a string in my data. I am just getting latest 10 and I want to show the total count as 1000.

select * from table order by time limit 10;
select count(*) from table;

Will it be possible to avoid the second query in lucene?
 
Yes, in Lucene, a search query always returns the number of matches. But note that Lucene is not a relational database. It's not a drop-in replacement for Postgres for sure!
 
Post a Comment



<< Archive
<< Home