Monday, April 30, 2012

Lucene Generic Highlighter

I have pushed on GitHub a projet on how to create a generic highlighter with Apache Lucene.

Original Lucene Highlighter is too much coupled with snippet highlighting and :
  • Do not allow easily to highlight a whole text 
  • Handles only text with a formatter strongly coupled to text 
 I have modified the original Lucene Highlighter to allow highlighting of "anything". The highlighter is a callback instead of a formatter and it's purpose is to find terms in a whole text with a score. I used this code to highlight XML, PDF, HTML... with or without Solr.

Note : This project is an extract of a large project with submodule.