How to turn your NucliaDB into a code search engine!
In our previous post we introduced you to NucliaDB and our SDK. Today we’ll dive deeper into how and why NucliaDB will make your life easier.
Making your life easier sounds like a lot, but if you are a data scientist or an NLP person and have found yourself in any of these situations, this is an article for you:
- You’ve struggled to decide which sentence embeddings to use for your corpora
- Got fed up of exploring sentence similarity by comparing vectors between each other one to one
- Did not know where or how to store all the different sets of embeddings you needed or wanted to try for your data in an efficient and accessible manner.
If this does not sound like you at all, but you’d like to learn how to build a search engine to query code with natural language, then stick around because that’s what we’ll achieve with no effort.
For any newcomers to NucliaDB, let’s remember that it is an open source database ideal to index files, text, vectors, labels and annotations. On top of this it allows you to efficiently search through your database filtering by label, full text and semantic searches.
How all this fits together, we’ll see in a bit 🙂
Reading the docs is cool, but guess what is cooler
Having a magic 8 ball telling you which function you need to use to achieve what you want is every developer’s dream, we cannot get you one of those, but we’ll help you get as close as we can.
What we will do is build a dataset with the python functions we are interested in and perform semantic search on them with natural language. With this, you’ll be able to input the functionality you want, and your search results will show the specific function that matches that functionality better.
The process is similar to standard semantic search, but instead of generating embeddings with models trained only with natural language, we’ll use models that have been trained both with code and English to encode our data. As you may have guessed, the role of NucliaDB in this will be as a database for our dataset of code and embeddings, and also as a tool to perform the semantic search.
Let’s make it a bit more fun!
Since the idea is also for you to get to know the SDK better, we’ll go a bit meta. We will index the functions from our own nucliadb-sdk library, so that you won’t even have to read our docs (they are quite nice though), just search what you want to do and NucliaDB will give you the functions.
So, by the end of this article we’ll have a NucliaDB that will point us to the right nucliadb-sdk function when we describe it. For example, if we looked for
create a new knowledge boxit would point us to the functions
get_or_create, that are the ones we use to create a data container in NucliaDB.
Step by step
Now that our goal is clear, let’s recap the steps we will take:
- Prepare our data: collect functions from a library/codebase we are interested in, in our case, nucliadb-sdk
- Dive into embeddings: select a couple of embeddings models for sentence similarity that have been trained both in code and natural language
- Load our NucliaDB: Calculate embeddings and store them in our NucliaDB together with the code they represent
- Search, explore and enjoy: Use nucliadb-sdk to perform semantic searches on them and compare results between models
Get your NucliaDB ready
As always, first we need to make sure you have your NucliaDB up and running . If not, just start your local NucliaDB with docker like this:
docker run -it \
-e LOG=INFO \
-p 8080:8080 \
-p 8060:8060 \
-p 8040:8040 \
-v nucliadb-standalone:/data \
Prepare your data
Once this is done, we need to get our data. In the notebook you have a simple function
get_all_code to extract all functions from nucliadb-sdk, but you could also use whatever library or codebase you want.
This is how we get ours:
my_functions = [i.strip() for i in get_all_code(nucliadb_sdk)]
Dive into embedding models
We’ve got the data, now we need to select the models we’ll want to use to calculate the embeddings. In my case, after getting lost in hugging face tabs for a bit I selected these two, both trained with the well known code_search dataset :
(For those who would like to explore a bit more, you can find a couple more models in the notebook)
Deciding which model to use is often tricky, because even though it’s quite useful, trying out sentence similarity with examples on the HF inference widget sometimes just doesn’t cut it. Trying several models locally is an option, but calculating vectors** for all the functions we want to explore, keeping them in memory and performing similarity on all of them could get a bit complicated and memory intensive.
** Note: we’ll refer to vectors as embeddings or vectors along all the article
Time to use our NucliaDB!
With NucliaDB we solve this in a heartbeat:
- We can easily store data and the embeddings from the two models locally
- Run as many semantic searches as we want, that will return us the closest match for a given query in all our dataset. This way we can compare the results from the models more quickly and efficiently, and say goodbye to comparing sentences one to one
Before we start coding, let’s get familiar with some of the NucliaDB lingo:
- KnowledgeBox, our concept of a data container.
- Vectorset, set of vectors from the same model. We can define as many Vectorsets as we want for each KB.
- Search, we can perform search over our text fields, but also over any of our defined Vectorsets. The text search looks for exact matches, while the vector one returns those whose vectors have higher cosine similarity.
Once this is clear, let’s create a new KnowledgeBox where we will store our data:
my_kb = create_knowledge_box("my_code_search_kb")
We will need the models to calculate the vectors, so let’s load them:
model_t5 = SentenceTransformer("krlvi/sentence-t5-base-nlpl-code_search_net")
model_distilroberta = SentenceTransformer("flax-sentence-embeddings/st-codesearch-distilroberta-base")
Now we are ready to calculate the vectors for each function and store them in our KB:
for i in range(len(my_functions)):
label = "nucliadb_sdk"
*We added the label with the module in case we wanted to add functions from another library later on and differentiate them.
Now the magic starts
Once our data is saved, our Vectorsets will be created automatically and we’ll be able to perform all the searches we’d like without crashing our computer! We just need to calculate the vectors for our query, select the Vectorset and start trying searches, the code needed is the following:
query = ["create a new knowledge box"]
results_t5 = my_kb.search(
for result in results_t5:
print("Function name:",re.findall('def ([^\(]+)', result.text), end=" -- ")
Function name: create_knowledge_box -- Similarity score: 0.6352006196975708
Function name: get_kb -- Similarity score: 0.4774329662322998
Function name: get_labels -- Similarity score: 0.4565504193305969
Function name: get_entities -- Similarity score: 0.4362731873989105
Function name: async_length -- Similarity score: 0.4227059781551361
Function name: get_or_create -- Similarity score: 0.35420358180999756
Here we can easily explore the results of the searches, and the data associated to them, like labels or the similarity score.
We will start by searching for
create a new knowledge box, because even though it looks like an easy query, if we tried to look for the same with full text search, we wouldn’t get any results.
But, with semantic search, just like we predicted at the beginning of the post, the DistilRoberta model will point us to these functions:
create_knowledge_box, the function that creates a Knowledgebox
get_or_create, the function that given a KB name, creates it if it does not exist or points to it if it does.
In this case, the T5 model misses
get_or_create and points to some a bit less related functions like
Let’s run some more searches to see how the models perform!
When we search for
Upload vectors we get the functions we use to upload resources and vectors, as the top 2 best matches in both models:
If we look up
create labels we get as best match in both models
set_labels, also the right function to update the labels of a resource.
In this particular case we only aim to have a nice code search, so if we wanted to deploy our semantic search somewhere, we’d probably go for the DistilRoberta model since it seems to get more consistent results and it’s lighter.
Anyhow, I encourage you to open the notebook and play with it, try different models, indexing other libraries or whatever takes your fancy.
Besides from having a code search system, another possible use case would be if we wanted to label this data to train a downstream task, like a classifier for pythonic/non pythonic code. This would also be super easy to do with nucliadb-sdk and nucliadb-dataset , but we will get into more detail in the next tutorial!