Nov 16, 2024 - #Lucene, #OpenAI, #RAG
This is a guide on how to implement ‘retrieval augmented generation (RAG)’ with Lucene as a vector search. The goal is to find relevant data in Lucene and use that to give the AI this information as additional context when answering the query.
This is especially useful if you have more additional data than fits into the context of the AI model. You can instead search the most relevant documents via Lucene, and then pass only those in the context.
The basic algorithm is:
We will be using Lucene for indexing and searching and the OpenAI API for generating embeddings and chat AI.
Add the following dependencies to your build.gradle.kts
file:
implementation("org.apache.lucene:lucene-core:10.0.0")
implementation("org.apache.httpcomponents.client5:httpclient5-fluent:5.4.1")
implementation("com.google.code.gson:gson:2.11.0")
There are three steps to index documents:
Before indexing the documents directly, we split them into smaller parts. This has two advantages:
So this depends on the type of input data. For example when using markdown files: Each document will be split by its headers and subheaders to create smaller chunks. To not completely lose the content for each chunk, we add all previous headers to each chunk. The goal is that each chunk is <= 8192 characters, since that is the OpenAI limit for embeddings.
This is a very simple variation of such a splitting algorith:
List<String> splitMarkdown(String document) {
var headerTokens = List.of("# ", "## ", "### ", "#### ");
var lines = document.split("\n");
var headers = new ArrayList<String>();
var currentChunk = new StringBuilder();
var result = new ArrayList<String>();
for (var line : lines) {
var header = headerTokens.stream()
.filter(line::startsWith)
.findFirst();
if (header.isPresent()) {
result.add(String.join("\n", headers) + currentChunk.toString());
currentChunk.setLength(0);
var level = headerTokens.indexOf(header.get());
// Drop headers that are deeper than the current level
while (level < headers.size()) {
headers.removeLast();
}
headers.add(line + "\n");
} else {
currentChunk.append(line).append("\n");
}
}
result.add(String.join("\n", headers) + currentChunk.toString());
return result;
}
To create embeddings, we will use the OpenAI Embeddings API (Docs)
The function takes any text and returns a vector embedding of the text. The following code shows how to make a simple HTTP call to the OpenAI API.
public static float[] computeEmbedding(String input) {
var requestBody = GSON.toJson(new EmbeddingRequest(input, "text-embedding-3-small"));
var response = Request.post("https://api.openai.com/v1/embeddings")
.addHeader("Content-Type", "application/json")
.addHeader("Authorization", "Bearer " + API_KEY)
.bodyString(requestBody, ContentType.APPLICATION_JSON)
.execute();
var embeddingResponse = GSON.fromJson(response.returnContent().asString(),
EmbeddingResponse.class);
return embeddingResponse.data().getFirst().embedding();
}
record EmbeddingRequest(String input, String model) {}
record EmbeddingResponse(String model, List<EmbeddingData> data) {}
record EmbeddingData(int index, float[] embedding) {}
Now we can combine these functions to build our index. The inputDocuments
are assumed to be a list of Markdown documents.
Our Lucene index will have two fields per document:
contents
for the original textcontents-vector
for the embeddingvoid createIndex(List<String> inputDocuments) {
// Prepare documents by splitting them into smaller chunks
var chunks = inputDocuments.stream()
.flatMap(doc -> splitMarkdown(doc).stream())
.collect(Collectors.toList());
// Where the index will be stored
var indexDirectory = FSDirectory.open(Paths.get("./index/"));
var analyzer = new StandardAnalyzer();
var config = new IndexWriterConfig(analyzer);
// Setup config with custom codec to allow for vectors of length 2048.
// Necessary if the used embeddings are >1024 dimensions, since the
// Lucene default maximum is 1024.
// See below for the implementation of CustomCodec
config.setCodec(new CustomCodec());
try (var indexWriter = new IndexWriter(indexDirectory, config)) {
for (String input : chunks) {
// Index document with both the original text and its embedding
var document = new Document();
document.add(new TextField("contents", input, Field.Store.YES));
var vector = OpenAI.computeEmbedding(input);
document.add(new KnnFloatVectorField("contents-vector", vector,
VectorSimilarityFunction.DOT_PRODUCT));
indexWriter.addDocument(document);
}
}
}
By default, Lucene allows a maximum of 1024 dimension for a vector field. However OpenAI embeddings return more dimensions in their embeddings. (Though it is possible to truncate the OpenAI embeddings to 1024 values).
But it is also possible to extend Lucene with a custom codec that allows >1024 dimensions. We just have to extend the Lucene default codec with a custom vector format. In that vector format we delegate to the Lucene default format, but overwrite the getMaxDimensions
method.
/**
* Custom codec that extends the Lucene100Codec and allows for vectors of length 2048.
* Otherwise, the codec delegates to the default Lucene99HnswVectorsFormat.
*/
public class CustomCodec extends FilterCodec {
public CustomCodec() {
super("CustomCodec", new Lucene100Codec());
}
@Override
public KnnVectorsFormat knnVectorsFormat() {
return new KnnVectorsFormat("CustomVectorsFormat") {
private final KnnVectorsFormat delegate = new Lucene99HnswVectorsFormat();
@Override
public int getMaxDimensions(String fieldName) {
return 2048;
}
@Override
public KnnVectorsWriter fieldsWriter(SegmentWriteState state) throws IOException {
return delegate.fieldsWriter(state);
}
@Override
public KnnVectorsReader fieldsReader(SegmentReadState state) throws IOException {
return delegate.fieldsReader(state);
}
};
}
}
This codec must also be made available when searching. This can be done by creating a file with the fully qualified name of the codec class.
dev.giger.CustomCodec
The index can now be searched like any other Lucene index. We use KnnFloatVectorQuery
to build a query for an
embedding vector. The given text query is also first transformed into an embedding via the OpenAI API.
The number of results depends on the amount of context that can be given to the final AI call. In this example, we will use the top 10 results.
List<SearchResult> searchIndex(String queryString) {
var indexDirectory = FSDirectory.open(Paths.get("./index/"));
var indexReader = DirectoryReader.open(indexDirectory);
var searcher = new IndexSearcher(indexReader);
int numResults = 10;
var query = new KnnFloatVectorQuery(
"contents-vector",
computeEmbedding(queryString),
numResults
);
var topDocs = searcher.search(query, numResults);
var storedFields = searcher.storedFields();
var result = new ArrayList<SearchResult>();
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
var doc = storedFields.document(scoreDoc.doc);
result.add(new SearchResult(
scoreDoc.score,
doc.toString()
));
}
}
record SearchResult(float score, String content) {}
We now have queried our Lucene index and it has given us a list of the 10 most relevant documents. Using this context is now as simple as passing the additional information when querying the AI. Here is the code to send such a request to the OpenAI API:
String completion(String query, List<String> context) {
var prompt = "You are an assistant that answers questions.\n\n"
+ "Answer the question based on the following knowledge:\n\n"
+ String.join("\n\n ", context);
var messages = List.of(
new CompletionMessage("system", prompt),
new CompletionMessage("user", query)
);
var requestBody = GSON.toJson(new CompletionRequest("gpt-4o-mini", messages));
var response = Request.post("https://api.openai.com/v1/chat/completions")
.addHeader("Content-Type", "application/json")
.addHeader("Authorization", "Bearer " + API_KEY)
.bodyString(requestBody, ContentType.APPLICATION_JSON)
.execute();
var responseBody = new BufferedReader(
new InputStreamReader(response.returnContent().asStream()));
var completionResponse = GSON.fromJson(responseBody, CompletionResponse.class);
return completionResponse.choices().toString();
}
record CompletionRequest(String model, List<CompletionMessage> messages) {}
record CompletionMessage(String role, String content) {}
record CompletionResponse(String id, String model, List<CompletionChoice> choices) {}
record CompletionChoice(int index, CompletionMessage message, String finishReason) {}
All the previous code can be combined for a query like this:
String queryAI(String query) {
var context = searchIndex(query);
return completion(query, context.stream().map(SearchResult::content).toList());
}
Since the OpenAI model already has quite extensive knowledge, it will be interesting to see how much the additional context from Lucene will improve the answers.
So as an example I will ask it about a made-up microcontroller that it can’t know about:
> What are the characteristics of the GIDO123XYZ microcontroller?
As of my last update in October 2023, there is no widely recognized microcontroller specifically named "GIDO123XYZ."
It's possible that it may be a new or niche product that was released after my training data was compiled, or it could
be a fictional or hypothetical example.
Now we can add some information about this specific microcontroller to our database. We just add another document with the following content:
# GIDO123XYZ Microcontroller
Advanced AI capabilities, 123 teraflops of processing power, 5G connectivity, 10-year battery life, supports quantum
computing, built-in security features. Perfectly suited for AI-powered IoT devices.
Let’s ask the AI again with this added context:
> What are the characteristics of the GIDO123XYZ microcontroller?
The GIDO123XYZ microcontroller has the following characteristics:
- Advanced AI capabilities
- 123 teraflops of processing power
- 5G connectivity
- 10-year battery life
- Support for quantum computing
- Built-in security features
It is perfectly suited for AI-powered IoT devices.
Of course this is a very simple example, but it shows that the AI can easily use the added context to generate a more accurate answer.