Discussion:
Xapian vs Lucene
Yannick Warnier
2007-01-27 09:41:07 UTC
Permalink
Hello,

It's probably quite troll-risky to put a title like this, but did anyone
take the trouble to compare Lucene to Xapian and make a list of
differences?

As I told the list at the end of last year, I'm going to have to
integrate an indexing/search engine in the coming weeks or months. It
will be integrated to Dokeos, an open-source e-learning application in
PHP, and at the moment we are using MnoGoSearch which is alright but the
problem lies in the indexing engine that we cannot really provide with
our application as only the Linux version is GPL and it runs as a C
program that has to be run via cron. Also, the free/collaborative
support and mailing-list activity are a bit too loose/slow.

So far, my understanding is that I can use Xapian PHP bindings to index
"on the fly" when inserting new content in my e-learning application. It
is also my understanding that Lucene is a piece of code in Java (which
is wrong for me as long as it involves more languages than just PHP for
the Dokeos administrators to deal with) that is quite popular and that
does things alright.

One problem I know of (from a Perl programmer) about Lucene is that the
Perl bindings do not actually handle unicode characters, and so the
*universality* of Lucene is lost when using it via the Perl bindings.

Of course, Dokeos-wise, it is important to have UTF-8 handling as we
plan to move to full-UTF-8 just before we start integrating the new
indexing...*stuff*.

As far as I am aware of, my search application (as a finished/integrated
product) should deal with:
- indexing of webpages
- indexing of documents (all office documents)
- indexing/parsing of XML metadata
- awareness of user permissions (a result should only display if the
searching user is authorized to see it)

So, my question is: which is the best for my case? Lucene or Xapian? Any
benchmarks or comparisons available?

Of course, this is specialised advice and I should really post the same
mail to the Lucene list, but I'm not subscribed there yet, so for now I
will analyse the feedback I get from here only (which will obviously
distort it just a little bit).

Thanks a lot,

Yannick
Reini Urban
2007-02-01 07:10:48 UTC
Permalink
Post by Yannick Warnier
It's probably quite troll-risky to put a title like this, but did anyone
take the trouble to compare Lucene to Xapian and make a list of
differences?
I compared Lucene C# against xapian in a rather non-technical way.
Lucene C# CAN do UTF-8, has much better MS Office and PDF parsing and
searching (native windows techniques), but is rather awkward to
customize. The C# they used was very compiler dependent.
Lucene C# has got a filechange notification hook, which is cool.

We use in our company both, I'm doing the xapian search engine,
a colleague built the Lucene search on windows only.

I think I won in the long term because it was easier for me to customize
it. I developed and tested the xapian piece on cygwin, and then moved
with the production engine to linux which was 10 times faster.
Post by Yannick Warnier
As I told the list at the end of last year, I'm going to have to
integrate an indexing/search engine in the coming weeks or months. It
will be integrated to Dokeos, an open-source e-learning application in
PHP, and at the moment we are using MnoGoSearch which is alright but the
problem lies in the indexing engine that we cannot really provide with
our application as only the Linux version is GPL and it runs as a C
program that has to be run via cron. Also, the free/collaborative
support and mailing-list activity are a bit too loose/slow.
So far, my understanding is that I can use Xapian PHP bindings to index
"on the fly" when inserting new content in my e-learning application. It
is also my understanding that Lucene is a piece of code in Java (which
is wrong for me as long as it involves more languages than just PHP for
the Dokeos administrators to deal with) that is quite popular and that
does things alright.
One problem I know of (from a Perl programmer) about Lucene is that the
Perl bindings do not actually handle unicode characters, and so the
*universality* of Lucene is lost when using it via the Perl bindings.
Of course, Dokeos-wise, it is important to have UTF-8 handling as we
plan to move to full-UTF-8 just before we start integrating the new
indexing...*stuff*.
As far as I am aware of, my search application (as a finished/integrated
- indexing of webpages
both very good.
Post by Yannick Warnier
- indexing of documents (all office documents)
lucene C# the best of alöl, the java lucene not that good.
Post by Yannick Warnier
- indexing/parsing of XML metadata
both good.
Post by Yannick Warnier
- awareness of user permissions (a result should only display if the
searching user is authorized to see it)
This was complicated to achieve with xapian. Native Windows Lucene C# is
better here.
For now we - xapian-omega - stick with http auth via a mod_ntlm backend,
which handles the windows auth tokens automatically. No login required
on MSIE, just firefox.

I still have to implement the ldap backend within omega or php for users
and groups to check against the acl's.

Or simply do it via suexec and map the user into samba for the cgi call
only. But I still have to persuade our IT to let me use samba.
Security-wise linux with a working suexec and samba is better than
native windows.
Post by Yannick Warnier
So, my question is: which is the best for my case? Lucene or Xapian? Any
benchmarks or comparisons available?
Of course, this is specialised advice and I should really post the same
mail to the Lucene list, but I'm not subscribed there yet, so for now I
will analyse the feedback I get from here only (which will obviously
distort it just a little bit).
--
Reini Urban
http://phpwiki.org/ http://murbreak.at/
http://helsinki.at/ http://spacemovie.mur.at/
Jamie D
2007-02-01 07:42:48 UTC
Permalink
Post by Reini Urban
Post by Yannick Warnier
- awareness of user permissions (a result should only display if the
searching user is authorized to see it)
This was complicated to achieve with xapian. Native Windows Lucene C# is
better here.
For now we - xapian-omega - stick with http auth via a mod_ntlm backend,
which handles the windows auth tokens automatically. No login required
on MSIE, just firefox.
In your case it would be very simple to use Xapian. Xapian has to know
nothing about user permissions, why would it anyway, its a search
engine right? Handle all of you user authentication/acl's using PHP,
if all of your search queries are made through PHP you have no
problem.

I too was looking at Xapian and Lucene, before trying Xapian I first
used the php implementation of Lucene in the Zend framework. It was
very slow both for indexing and searching, and was constantly running
into memory issues. I switched to Xapian and its been far better, what
took several days to index using the Zend search Lucene can be indexed
using xapian in just 5 minutes and searching is much faster too.

I have not used the java version of Lucene so can not comment on this
version, I wanted to stay away from java as I know very little about
it.

Jamie
Jeff Breidenbach
2007-02-02 07:06:49 UTC
Permalink
did anyone take the trouble to compare Lucene to Xapian and
make a list of differences?
This was my bottom line analysis in September 2006. Since then
Xapian has made significant progress on UTF-8.

http://spreadsheets.google.com/pub?key=pKHp5ItRZ0SUL0PKhN_ssfA
James Aylett
2007-02-02 12:20:33 UTC
Permalink
Post by Jeff Breidenbach
This was my bottom line analysis in September 2006. Since then
Xapian has made significant progress on UTF-8.
http://spreadsheets.google.com/pub?key=pKHp5ItRZ0SUL0PKhN_ssfA
Were you using Quartz, or Flint, as your backend for the Xapian
analysis?

J
--
/--------------------------------------------------------------------------\
James Aylett xapian.org
***@tartarus.org uncertaintydivision.org
Jeff Breidenbach
2007-02-02 17:15:29 UTC
Permalink
Post by James Aylett
Were you using Quartz, or Flint, as your backend for the Xapian
analysis?
Flint. I was extremely impressed with it.

-Jeff
James Aylett
2007-02-02 17:41:05 UTC
Permalink
Post by Jeff Breidenbach
Post by James Aylett
Were you using Quartz, or Flint, as your backend for the Xapian
analysis?
Flint. I was extremely impressed with it.
We should start collecting quotes like this... if only to help Olly's
ego :-)

J
--
/--------------------------------------------------------------------------\
James Aylett xapian.org
***@tartarus.org uncertaintydivision.org
Yannick Warnier
2007-02-02 23:28:19 UTC
Permalink
Post by James Aylett
Post by Jeff Breidenbach
Post by James Aylett
Were you using Quartz, or Flint, as your backend for the Xapian
analysis?
Flint. I was extremely impressed with it.
We should start collecting quotes like this... if only to help Olly's
ego :-)
It should still be balanced with a quote I had the other day:
"I looked at the Xapian website, and it looked like it was a page
written by a 14 years old boy, whereas the Lucene website looks very
professional".

It may look very bad when knowing that the two have the same
capabilities, but it still counts. The other day, I was pitching for an
integrated system (all GPL) for a public administration in Belgium
(around 30.000 users) and a search engine had to be included. Being part
of a small company, to make the pitch we had to get acknowledged by a
bigger-size company. They wouldn't let us even go one step with Xapian,
just because of the website appearance... Lucene was not a problem
though, and this is all because of a website.

Another quote I had was:
"From what I have read, Xapian people seem to consider their way of
treating the indexing process/algorithms as the biblical truth, that
doesn't have to be discussed, while Lucen explains a lot more what they
are doing and why".

I haven't checked the truthness of both these quotes, so I can't coment
on them, but I think they must be taken into account if Xapian wants to
get a better public image.

Yannick
Jason White
2007-02-03 10:29:03 UTC
Permalink
Post by Yannick Warnier
"I looked at the Xapian website, and it looked like it was a page
written by a 14 years old boy, whereas the Lucene website looks very
professional".
I *strongly* disagree.

The Xapian Web site was clearly written by an expert. It is well organized,
and informative.
Post by Yannick Warnier
"From what I have read, Xapian people seem to consider their way of
treating the indexing process/algorithms as the biblical truth, that
doesn't have to be discussed, while Lucen explains a lot more what they
are doing and why".
Have you actually read the discussion of algorithms and the introduction to
information retrieval on the Xapian Web site?

It's better than what most free software projects offer by way of explanation.
It also shows that the developers have expertise in research related to
information retrieval systems. I haven't searched the Lucene site for
corresponding information, so this isn't a comparative comment; I'm just
making the point that the Xapian approach is thoroughly documented.
Post by Yannick Warnier
I haven't checked the truthness of both these quotes, so I can't coment
on them, but I think they must be taken into account if Xapian wants to
get a better public image.
These quotes are both uninformed regarding Xapian, so I suggest that they
shouldn't be taken seriously.
Yannick Warnier
2007-02-03 11:10:37 UTC
Permalink
Post by Jason White
Post by Yannick Warnier
"I looked at the Xapian website, and it looked like it was a page
written by a 14 years old boy, whereas the Lucene website looks very
professional".
I *strongly* disagree.
The Xapian Web site was clearly written by an expert. It is well organized,
and informative.
Written by an expert doesn't mean written for a decision-taker.
Post by Jason White
Post by Yannick Warnier
"From what I have read, Xapian people seem to consider their way of
treating the indexing process/algorithms as the biblical truth, that
doesn't have to be discussed, while Lucen explains a lot more what they
are doing and why".
Have you actually read the discussion of algorithms and the introduction to
information retrieval on the Xapian Web site?
I didn't (yet) but I sincerely believe the person quoting did.
Post by Jason White
It's better than what most free software projects offer by way of explanation.
It also shows that the developers have expertise in research related to
information retrieval systems. I haven't searched the Lucene site for
corresponding information, so this isn't a comparative comment; I'm just
making the point that the Xapian approach is thoroughly documented.
I think making the point for making the point isn't really what it's
about here. As always, it's about trying to improve things where
possible. A quote is a quote, and if the people behind them didn't want
to bring the matter to the mailing-list themselves, it's probably
because they don't have the time to spend discussing it. Most of the
decision takers today don't have the time to spend discussing anything,
that's a problem but in my view it needs to be taken into account.
Post by Jason White
Post by Yannick Warnier
I haven't checked the truthness of both these quotes, so I can't coment
on them, but I think they must be taken into account if Xapian wants to
get a better public image.
These quotes are both uninformed regarding Xapian, so I suggest that they
shouldn't be taken seriously.
They are probably uninformed, but that doesn't make them necessarily
wrong.

Anyway, my intention is not to discuss all this, unless if it is to find
a solution. My suggestion would be to have a part of the website, or a
subdomain or something, particularly addressed at decision takers and
commercial people. No project is ever adopted at very large scale
without *some* kind of lobbying from these people, and that's where I
think Xapian might reveal itself as the best solution.

And, of course, my quotes here were not intended at discouraging the
team. Actually I think it's quite positive and should be taken as the
fact that people at high levels of authority are getting interested in
the project (at least they know the name now, so if they are
dissatisfied with Lucene...)

Yannick
Yannick Warnier
2007-02-03 11:29:40 UTC
Permalink
Le samedi 03 février 2007 à 10:29 +0000, Jason White a écrit :
[...]
Post by Jason White
Have you actually read the discussion of algorithms and the introduction to
information retrieval on the Xapian Web site?
Is this the one you're talking about?
http://www.xapian.org/docs/intro_ir.html

Thanks,

Yannick
James Aylett
2007-02-03 15:19:45 UTC
Permalink
Post by Jason White
Post by Yannick Warnier
"I looked at the Xapian website, and it looked like it was a page
written by a 14 years old boy, whereas the Lucene website looks very
professional".
I *strongly* disagree.
The Xapian Web site was clearly written by an expert. It is well organized,
and informative.
I think the distinction needs to be drawn between content and
presentation (which includes things like executive
summaries). Xapian's website is very much for developers as it stands,
and also doesn't benefit from being hosted on the Apache Forrest
system, which gives it a whole load of CMS features straight away. (On
the other hand, I as a developer personally find the Apache sites
quite awkward, because it can be difficult to find the documentation
and download links.)

This is something we're aware of, however. We discussed back in the
summer the idea of having a more problem-and-solution orientated front
page, with an introductory section that didn't pre-suppose so much
information up front (the first paragraph requires a fair amount of
thinking if you don't know what IR is, and cites the GPL without
explanation or a link, both of which could be improved upon).

The main problem is for someone to have the time to do something about
it. Getting web presence right such that it can be sold into a CIO/CKO
based solely on that is a very tricky problem, especially without the
money for stock photography.
Post by Jason White
Post by Yannick Warnier
"From what I have read, Xapian people seem to consider their way of
treating the indexing process/algorithms as the biblical truth, that
doesn't have to be discussed, while Lucen explains a lot more what they
are doing and why".
Have you actually read the discussion of algorithms and the introduction to
information retrieval on the Xapian Web site?
Yeah, I don't entirely get that either. I spent ages at one point
trying to find out how the Lucene ranking algorithm works (which is
there, somewhere), and why it was chosen/developed in that way (which
I couldn't find). Xapian has, I think, all of that information
available. (Although again, it isn't presented as obviously as I'd
like; however going into the Docs page is a good bet, and there's a
paragraph pointing you in exactly the right direction.)
Post by Jason White
These quotes are both uninformed regarding Xapian, so I suggest that they
shouldn't be taken seriously.
As has been pointed out, the fact that someone *does* think in this
way needs to be taken seriously. I don't, however, think they point to
a content problem with the website.

J
--
/--------------------------------------------------------------------------\
James Aylett xapian.org
***@tartarus.org uncertaintydivision.org
Jim
2007-02-03 18:16:53 UTC
Permalink
Post by James Aylett
Post by Jason White
Post by Yannick Warnier
"I looked at the Xapian website, and it looked like it was a page
written by a 14 years old boy, whereas the Lucene website looks very
professional".
I *strongly* disagree.
The Xapian Web site was clearly written by an expert. It is well organized,
and informative.
I think the distinction needs to be drawn between content and
presentation (which includes things like executive
summaries). Xapian's website is very much for developers as it stands,
and also doesn't benefit from being hosted on the Apache Forrest
system, which gives it a whole load of CMS features straight away. (On
the other hand, I as a developer personally find the Apache sites
quite awkward, because it can be difficult to find the documentation
and download links.)
I agree, ignoring content, the lucene site just looks better. It is
definitely better eye candy and that appeals to non technical folks.
I've implement both Lucene and Xapian search engines, but you see what
list I'm active on.

Jim.
Jason White
2007-02-04 00:35:22 UTC
Permalink
Post by James Aylett
This is something we're aware of, however. We discussed back in the
summer the idea of having a more problem-and-solution orientated front
page, with an introductory section that didn't pre-suppose so much
information up front (the first paragraph requires a fair amount of
thinking if you don't know what IR is, and cites the GPL without
explanation or a link, both of which could be improved upon).
If someone does decide to add introductory material, could you make sure that
the technical details are easy to find directly from the main page, as they
are now? As you note, the Apache site is a particularly bad example: I have
had to resort to a search engine more than once to find information and
documentation there.
James Aylett
2007-02-04 02:21:33 UTC
Permalink
Post by Jason White
If someone does decide to add introductory material, could you make sure that
the technical details are easy to find directly from the main page, as they
are now? As you note, the Apache site is a particularly bad example: I have
had to resort to a search engine more than once to find information and
documentation there.
That's very important, yes - thanks for pointing that out. I was
actually thinking of something a /little/ in the style of Drupal
<http://drupal.org/>, so you have direct access at the top of the page
to all the significant sections of the site, plus a very simple
overview with links to the most useful things for first-time users
(feature list, fuller description, quick start guide... some
screenshots of basic Omega themes would be nice, but would require
more time :-).

I certainly wouldn't want anything directly accessible from the front
page to disappear, or for any item on the menu to go away.

(While I'd love to get stuck into this myself, including a theme
design for the site as a whole, there's no way I'm going to find time
in at least the next six months, more likely the next
eighteen. Accordingly I don't want to suggest streams of
hard-to-implement ideas in case I frighten away someone with more time
:-)

J
--
/--------------------------------------------------------------------------\
James Aylett xapian.org
***@tartarus.org uncertaintydivision.org
Olly Betts
2007-02-09 04:28:19 UTC
Permalink
Post by Yannick Warnier
I haven't checked the truthness of both these quotes, so I can't coment
on them, but I think they must be taken into account if Xapian wants to
get a better public image.
I don't find these quotes helpful. Sure the Xapian website isn't
perfect, but "looks like it was written by a 14 year old boy" isn't a
useful bug report.

Similarly, we do discuss the indexing and ranking algorithms, and if you
don't like them, you can even implement your own!

If you really want to help make the website better, constructive
comments are certainly welcome. But less of the trolling please!

Cheers,
Olly
Bill Crawford
2007-02-02 12:33:08 UTC
Permalink
Post by Jeff Breidenbach
This was my bottom line analysis in September 2006. Since then
Xapian has made significant progress on UTF-8.
http://spreadsheets.google.com/pub?key=pKHp5ItRZ0SUL0PKhN_ssfA
Just out of curiosity, has anyone had success using UTF-8 via the perl API
wrapper? I've been able to add stuff TO the index using UTF-8 data, via perl,
but when I try to search, I get no results on any query using accented chars
(but "term_exists" returns true, and I can successfully query from a C++
program, so I know the data is there).
--
http://www.lost.eu/175db
Peter Karman
2007-02-02 14:40:18 UTC
Permalink
Post by Bill Crawford
Just out of curiosity, has anyone had success using UTF-8 via the perl API
wrapper? I've been able to add stuff TO the index using UTF-8 data, via perl,
but when I try to search, I get no results on any query using accented chars
(but "term_exists" returns true, and I can successfully query from a C++
program, so I know the data is there).
can you post a small example?
--
Peter Karman . http://peknet.com/ . ***@peknet.com
Bill Crawford
2007-02-02 15:10:43 UTC
Permalink
Post by Peter Karman
can you post a small example?
I've updated to the "trunk" snapshot and all seems to be well, so I'm going to
assume whatever the problem was is fixed. I'm still trying to get my head
around perl's utf8 handling ...
--
http://www.lost.eu/175db
Olly Betts
2007-02-09 04:30:52 UTC
Permalink
Post by Bill Crawford
I've updated to the "trunk" snapshot and all seems to be well, so I'm going to
assume whatever the problem was is fixed.
In 0.9.9, the QueryParser assumes the character set is iso-8859-1, which
would explain your problem. In SVN trunk, QueryParser assumes the
character set is utf-8.

Cheers,
Olly

Loading...