giger.dev

Creating a Simple Search With Lucene

Gradle Config

We will be using the following libraries:

compile group: 'org.apache.lucene', name: 'lucene-core', version: '8.2.0'
compile group: 'org.apache.lucene', name: 'lucene-queryparser', version: '8.2.0'
compile group: 'org.apache.lucene', name: 'lucene-suggest', version: '8.2.0'

Create Index

Directory indexDirectory = FSDirectory.open(Paths.get("./index/"));

The index will be saved to disk in this folder

Analyzer analyzer = new StandardAnalyzer();

The analyzer defines how the document will be analyzed. The StandardAnalyzer uses the Lucene StandardTokenizer with LowerCaseFilter and StopFilter.
This means all tokens will be normalized to lowercase and stop-words will be removed.

The basics on analysis is also describe in the Lucene docs: Analysis overview

Add Documents

IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter indexWriter = new IndexWriter(indexDirectory, config);
Document document = new Document();
document.add(new TextField("field", "value", Field.Store.YES));
indexWriter.addDocument(document);

Make sure to only create the index writer once, not for every indidivual document.

To describe I used a TextField to tokenize and index the field. However there are other field types as described here: Lucene Field Documentation

Search the Index

Directory indexDirectory = [...]; // Directory with index
IndexReader indexReader = DirectoryReader.open(indexDirectory);
Analyzer analyzer = [...]; // Same Analyzer used to build the index
String[] fields = [...]; // Array of all the fields to search
int numberOfResults = 10;

IndexSearcher searcher = new IndexSearcher(indexReader);
Query query = new MultiFieldQueryParser(fields, analyzer).parse(queryString);
TopDocs topDocs = searcher.search(query, numberOfResults);

// Extracting the results
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
    Document doc = searcher.doc(scoreDoc.doc);
    System.out.println(doc.toString());
}

This setup searches multiple fields of all documents.

Weighted Search (Score Boosting)

Sometimes you might want to change the ranking of the documents based on in which fields the query string was found.
Lucene calls this concept ‘Score Boosting’. The score boost will be multiplied into the total score of the result.
Boosting can be done directly when creating the query:

String[] fields = {"importantField", "normalField"};
Map<String, Float> boosts = new HashMap<>();
boosts.put("importantField", 1.5f);
boosts.put("normalField", 1.0f);

Query query = new MultiFieldQueryParser(fields, analyzer, boosts)
    .parse(queryString);

Unfortunately this does not work for Prefix/Wildcard queries. However there is a workaround. This will change
the implementation to consider the boost using a less efficient algorithm.

MultiFieldQueryParser queryParser = new MultiFieldQueryParser(fields, analyzer, boosts);
queryParser.setMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_REWRITE);

Autosuggest

Lucene can provide suggestions to the user. For example if the query contains a spelling mistake, Lucene can
suggest the correct spelling.

To do this, the lucene-suggest module is needed. Using the already existing index of documents, we can build a
LuceneDictionary and use this as an input for our suggester.

IndexReader indexReader = [...]; // Reader to the existing index
Analyzer analyzer = [...]; // Analyzer for the suggester
String filePrefix = [...]; // Prefix used to save the suggester dictionary
String field = [...]; // Name of the field that should be used for suggestions

Lookup suggester = new FuzzySuggester(indexDirectory, filePrefix, analyzer);
LuceneDictionary dictionary = new LuceneDictionary(indexReader, field);
suggester.build(dictionary);

As a suggester we are using FuzzySuggester. An alternative is AnalyzingInfixSuggester which will consider infix substrings as well.

Then the suggestion lookup can be done like this:

int numberOfResults = 10;
List<Lookup.LookupResult> lookup = suggester.lookup(queryString, false, numberOfResults);
for (Lookup.LookupResult lookupResult : lookup) {
    System.out.println(lookupResult.key);
}

— Oct 28, 2019

Gitlab Github