How to build your code search engine, with NucliaDB

Code Search with NucliaDB
  • You’ve struggled to decide which sentence embeddings to use for your corpora
  • Got fed up of exploring sentence similarity by comparing vectors between each other one to one
  • Did not know where or how to store all the different sets of embeddings you needed or wanted to try for your data in an efficient and accessible manner.

Reading the docs is cool, but guess what is cooler

Let’s make it a bit more fun!

Step by step

  • Prepare our data: collect functions from a library/codebase we are interested in, in our case, nucliadb-sdk
  • Dive into embeddings: select a couple of embeddings models for sentence similarity that have been trained both in code and natural language
  • Load our NucliaDB: Calculate embeddings and store them in our NucliaDB together with the code they represent
  • Search, explore and enjoy: Use nucliadb-sdk to perform semantic searches on them and compare results between models

Get your NucliaDB ready

docker run -it \
-e LOG=INFO \
-p 8080:8080 \
-p 8060:8060 \
-p 8040:8040 \
-v nucliadb-standalone:/data \
nuclia/nucliadb:latest

Prepare your data

my_functions = [i.strip() for i in get_all_code(nucliadb_sdk)]

Dive into embedding models

Time to use our NucliaDB!

  • We can easily store data and the embeddings from the two models locally
  • Run as many semantic searches as we want, that will return us the closest match for a given query in all our dataset. This way we can compare the results from the models more quickly and efficiently, and say goodbye to comparing sentences one to one
  • KnowledgeBox, our concept of a data container.
  • Vectorset, set of vectors from the same model. We can define as many Vectorsets as we want for each KB.
  • Search, we can perform search over our text fields, but also over any of our defined Vectorsets. The text search looks for exact matches, while the vector one returns those whose vectors have higher cosine similarity.
my_kb = create_knowledge_box("my_code_search_kb")
model_t5 = SentenceTransformer("krlvi/sentence-t5-base-nlpl-code_search_net")
model_distilroberta = SentenceTransformer("flax-sentence-embeddings/st-codesearch-distilroberta-base")
for i in range(len(my_functions)):
label = "nucliadb_sdk"
my_kb.upload(
text=my_functions[i],
labels=[f"code/{label}"],
vectors={"distilroberta": model_distilroberta.encode([my_functions[i]])[0].tolist(),
"t5": model_t5.encode([my_functions[i]])[0],
},
)

Now the magic starts ✨

query = ["create a new knowledge box"]
query_vectors=model_t5.encode(query)[0]
results_t5 = my_kb.search(
vector=query_vectors,
vectorset="t5",
min_score=0.4)

for result in results_t5:

print("Function name:",re.findall('def ([^\(]+)', result.text)[0], end=" -- ")
print("Similarity score:",result.score)
Function name: create_knowledge_box -- Similarity score: 0.6352006196975708
Function name: get_kb -- Similarity score: 0.4774329662322998
Function name: get_labels -- Similarity score: 0.4565504193305969
Function name: get_entities -- Similarity score: 0.4362731873989105
Function name: async_length -- Similarity score: 0.4227059781551361
Function name: get_or_create -- Similarity score: 0.35420358180999756
  • create_knowledge_boxthe function that creates a Knowledgebox
  • get_or_create, the function that given a KB name, creates it if it does not exist or points to it if it does.
  • async_upload
  • upload

Next steps

Related articles

Nuclia’s latest articles and updates, right in your inbox

Pick up the topics you are the most interested in, we take care of the rest!

Want to know more?

If you want to lear more and how we can help you to implement this, please use this form or join our community on Discord for technical support .

See you soon!