How to build your own AI search engine with NucliaDB

Developers

First things first…

What is NucliaDB?

Let me answer in few words:

NucliaDB is an open-source vector database that stores, indexes, & retrieves vectors for AI search.

So… What about starting playing a bit with NLP models and using the vector search abilities of NucliaDB to implement an AI search feature?

You probably know StackExchange, the Q&A platform where you can ask questions and get answers from experts. If you are a developer, you probably know StackOverflow, the StackExchange site dedicated to developers.

Usually, developers are able to query StackOverflow with the right keywords to get the answer they are looking for because the technical vocabulary used in the questions is very precise. Typically if you are wondering how to align vertically text in a display: flex container, there is not plenty of ways to phrase your question.

And that’s where the full-text search is just fine. You can just type “align vertically flex” and you will get the answer you are looking for.

But when it comes to more abstract questions, it is not that easy. For example, if you go to Philosophy StackExchange, and search for “what is the meaning of life”, you will get a lot of results, but they will be restricted to the posts mentionning “meaning of life” in the title or in the body, whereas you might be interested by a post named “Is our existence pointless?”.

It highlights two important things here:

Full-text search is sometimes not good enough.
Finding the meaning of life is more difficult than aligning text vertically in CSS (and yet it’s already really hard).

If you want to be able to search for “meaning of life” and get the post “Is our existence pointless?” as a result, you need AI search. You need the search engine to understand the meaning of your query, and not just look for the keywords.

So, let’s build an AI search engine with NucliaDB!

Principle

NucliaDB is the core component of the Nuclia platform, it is open-source and can be used standalone.

NucliaDB is a vector database. It means that it stores and indexes vectors, and it accepts vectors as query parameter to retrieve matching results.
It also offers full-text search, meaning it can index and search for keywords in the text too.

To implement an AI search engine, you need to associate to each text its meaning. The meaning will be represented by a vector delivered by a model (a Natural Language Processing model).
A good way to picture it is to think of a map, a huge map where all the sentences are located according their meanning. When two sentences have a close meaning, they will be close on the map. When two sentences have a distant meaning, they will be far away on the map. The vector is simply the coordinates of the sentence on the map.

When indexing text content in NucliaDB, you need to encode each sentences into vectors and store them in NucliaDB.
When you want to search NucliaDB for a question, you encode the query first using the same model and then search the closest sentence on the map, and you will get the answer you are looking for.

In this example, you will use the paraphrase-MiniLM-L6-v2 model that can be found on HugginFace.

Implementation

Step 1: Get the data

StackExchange is great because it offers a nice query tool able to filter and export the data you want.

For example, if you want to get all the posts from the Philosophy StackExchange site, you can run the following query:

SELECT q.Title, q.Body, (SELECT a.Body FROM Posts a WHERE a.ParentId=q.id) AS Answers
FROM Posts q
WHERE PostTypeId=1

And you will get a huge CSV with all the philosophy questions and their answers.

At the time I did it myself, there were 16,000 questions and 40,000 answers. That’s a lot of data! But NucliaDB can manage that easily. Let’s see how.

Step 2: Create a knowledge box

First, run a NucliaDB local instance with Docker:

docker run -it \
       -e LOG=INFO \
       -p 8080:8080 \
       -p 8060:8060 \
       -p 8040:8040 \
       -v nucliadb-standalone:/data \
       nuclia/nucliadb:latest

Now that you have NucliaDB running, you need to create a knowledge box. A knowledge box is a container for your data. It is a place where you can index your data and search in it.

It can be created from the REST API like this:

curl 'http://localhost:8080/api/v1/kbs' \
    -X POST \
    -H "X-NUCLIADB-ROLES: MANAGER" \
    -H "Content-Type: application/json" \
    --data-raw '{
  "slug": "philosophy",
  "title": "Philosophy StackExchange"
}'

This call will return you a Knowledge Box ID needed for further API calls.

You can achieve the same thing in Python with nucliadb_client:

from nucliadb_client.client import NucliaDBClient

client = NucliaDBClient(host="localhost", grpc=8060, http=8080, train=8031)
kb = client.create_kb(slug="philosophy", title="Philosophy StackExchange Questions&Answers")

Step 3: Extract and encode sentences

When indexing a question and its answers, you need to pass 2 things to NucliaDB:

the text content itself (the question, its title, and the answers)
the vector representation of each sentence

It means you have to extract sentences from the texts.

There are smart way to do it, but for this demo, you will just go with a super simple approach: get the first-level HTML tags from your content. That will be mostly <p> tags, and sometimes <h1>, <h2>, or <ul> tags. And you will just assume these are “sentences”.

It can be done with BeautifulSoup:

tree = BeautifulSoup(text, features="html.parser")
sentences = [child.get_text(" ", strip=True) for child in tree.contents]

Then, you need to encode the sentences into vectors. You can do it with the HuggingFace library. First you instanciate a model:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("paraphrase-MiniLM-L6-v2")

Then, you will use it to encode each sentence:

model.encode([sentence])

(you will use that in the next step)

Step 4: Index the data in NucliaDB

Now that you have extracted the sentences and encoded them into vectors, you can index everything in NucliaDB.

First, for each question, you create a resource:

payload = CreateResourcePayload()

payload.title = title
payload.icon = "text/html"
payload.metadata = InputMetadata()
payload.metadata.language = "en"

field = TextField(body=text)
field.format = TextFormat.HTML
payload.texts["body"] = field

resource = kb.create_resource(payload)

Then you index the text for full-text search:

pure_text = " ".join(sentences)
resource.add_text("body", FieldType.TEXT, pure_text)

And now, you index the vectors:

vectors = []
index = 0
for sentence in sentences:
    vector = Vector(
        start=index,
        end=index + len(sentence),
        start_paragraph=0,
        end_paragraph=len(pure_text),
    )
    index += len(sentence) + 1
    embeddings = model.encode([sentence])
    vector.vector.extend(embeddings[0])
    vectors.append(vector)

resource.add_vectors(
    "body",
    FieldType.TEXT,
    vectors,
)
resource.sync_commit()

Step 5: Search!

Once your 16,000 rows are indexed (yeah, it might take time depending on your laptop 🙂 ), you can start playing with your knowledge box!

NucliaDB comes with a default web page where you can search in your local knowledge boxes. You can access it at http://localhost:8080/widget/

Okay, it works, if you enter a query, you will get the results. But unfortunately, it only retrieves full-text results.

If you try to search for “What is the meaning of life?”, you will get full-text results:

curl http://localhost:8080/api/v1/kb/f615c8ff-6ad3-42e0-9b24-ba6ae3b73cf3/search\?query\=what+is+the+meaning+of+life \
  -H "X-NUCLIADB-ROLES: READER" | jq ".fulltext.results"

You also get paragraph results (as they are produced by the full-text index):

curl http://localhost:8080/api/v1/kb/f615c8ff-6ad3-42e0-9b24-ba6ae3b73cf3/search\?query\=what+is+the+meaning+of+life \
  -H "X-NUCLIADB-ROLES: READER" | jq ".paragraphs.results"

But no sentences results (these ones correspond to the vector matches):

curl http://localhost:8080/api/v1/kb/f615c8ff-6ad3-42e0-9b24-ba6ae3b73cf3/search\?query\=what+is+the+meaning+of+life \
  -H "X-NUCLIADB-ROLES: READER" | jq ".sentences.results"

will return a sad []…

Why is that?

Because “What is the meaning of life?” does not allow to match any vector. To search for vector, you need a vector query!

So you first need to encode the query, and then pass the corresponding vector to the search endpoint:

curl http://localhost:8080/api/v1/kb/f615c8ff-6ad3-42e0-9b24-ba6ae3b73cf3/search\?query\=what+is+the+meaning+of+life&vector=[0.12242747843265533,-0.1455705165863037,-0.05579487234354019,…] \
  -H "X-NUCLIADB-ROLES: READER" | jq ".sentences.results"

And now you get semantically-close results!

I ran few tests, and I got the following results:

To the question “Are we responsible for what we desire?”, I got “Our necessity is our intentional to do something in line with at least our emotions and possibly our thinking”.
To the question “Can we think without language?”, I got “Can any one experience the World without a language? if yes to what extent, if no, why not?”.

Pretty relevant, right? For sure, no full-text search would ever retrieve that!

And if are too lazy to type manually the vector query in your curl command (Come on! That’s only 384 float numbers! Show a little grit!), you can set up a proxy with FastAPI that will do it for you. You will find the implementation in the demo code.

And why not do it the easy way?

NucliaDB is a very nice piece of software, and if you are looking for a good way to store and query vectors, that is definitely a good choice. Not to mention that, if you have different models, you can store several vectors on your resources by just marking them with a specific vectorset name, and then you can query according the vectorset you want.

But if you are just looking for a way to provide AI search to your users, there is a simpler way to do it: just use nuclia.cloud!

Nuclia Cloud is a SaaS services implemented on top of NucliaDB, it has the very same API but it takes care of all the painful processing for you. You just need to upload your data (typically a PDF, or an MP4, or any kind of file), and it will extract the text, encode it into vectors, and index everything in NucliaDB.

Then, you can search in your data using a simple text query, here again, Nuclia Cloud will encode the query into a vector, and then query NucliaDB to retrieve the most relevant results.

Moreover, the Nuclia Cloud online dashboard offers a way to ingest data from CSV file directly by uploading it from your browser.

You can check the result on this demo page!

UPDATE 2023-07-03: here is the corresponding code, but be aware it is outdated as it is based on NucliaDBClient. The recommended approach now is to use the Nuclia Python SDK.