
The vector database for data scientists working with HuggingFace

Why a DB for data scientists and NLP professionals?
As a data scientist or NLP person your hard-drive is probably full of datasets and corpora. So, if you have found yourself crashing a notebook trying to load something too big with Pandas, doing way too many shuffles in your shell just to explore your data a bit, or just not really knowing how to perform a better search through your dataset, this is a tool for you.
What you can do with NucliaDB
Compare the vectors from different models in a super easy way.
Store text, files and vectors, labels and annotations.
Access and modify your resources efficiently.
Annotate your resources.
Perform text searches, given a word or set of words, return resources in our database that contain them.
Perform semantic searches with vectors, that is, given a set of vectors return the closest matches in our database. In an NLP use case, this allows us to look for similar sentences without being constrained by exact keywords
Export your data in a format compatible with most NLP pipelines (HuggingFace datasets, pytorch, etc)
Get started!
Getting started with NucliaDB is easy. You can install it locally using docker or pip, and once it’s up and running, you can start using it by installing the nucliadb-dataset and nucliadb-sdk libraries.
1. Install NucliaDB and run it locally
pip install nucliadb
nucliadb
2. Create your first KnowledgeBox
A KnowledgeBox is a data container in NucliaDB. with just a few lines of code, and start filling it with data.
from nucliadb_sdk.utils import create_knowledge_box
my_kb = create_knowledge_box("my_new_kb")
3. Upload data
from nucliadb_sdk.knowledgebox import KnowledgeBox
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer("all-MiniLM-L6-v2")
resource_id = my_kb.upload(
key="mykey1",
binary=File(data=b"asd", filename="data"),
text="I'm Sierra, a very happy dog",
labels=["emotion/positive"],
entities=[Entity(type="NAME", value="Sierra", positions=[(4, 9)])],
vectors={"all-MiniLM-L6-v2": encoder.encode(["I'm Sierra, a very happy dog"])[0]},
)
uknowledgebox[resource_id] == knowledgebox["mykey1"]
4. Search
4.1. Semantic search
from nucliadb_sdk.knowledgebox import KnowledgeBox
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer("all-MiniLM-L6-v2")
query_vectors = encoder.encode(["To be in love"])[0]
results = my_kb.search(vector = query_vectors, vectorset="all-MiniLM-L6-v2",min_score=0.25)
Iterate over the results:
for result in results:
print(f"Text: {result.text}")
print(f"Labels: {result.labels}")
print(f"Score: {result.score}")
print(f"Key: {result.key}")
print(f"Score Type: {result.score_type}")
print("------")
The results:
Text: love is tough
Labels: ['negative']
Score: 0.4688602387905121
Key: a027ee34f3a7489d9a264b9f3d08d3a5
Score Type: COSINE
------
Text: he is heartbroken
Labels: ['negative']
Score: 0.27540814876556396
Key: 25bc7b22b4fb4f64848a1b7394fb69b1
Score Type: COSINE
4.2. Full text search
from nucliadb_sdk.knowledgebox import KnowledgeBox
results = my_kb.search(
text="dog"
)
Iterate over the results:
for result in results:
print(f"Text: {result.text}")
print(f"Labels: {result.labels}")
print(f"Score: {result.score}")
print(f"Key: {result.key}")
print(f"Score Type: {result.score_type}")
Get results:
Resource key: 4f1f570398c543e0b8c3b86e87ee2fbd
Text: Dog in catalan is gos
Score type: BM25
Score: 0.8871671557426453
Labels: ['neutral']
Resource key: 665e85f0fb2e4b2fbde8b4957b7462c1
Text: I'm Sierra, a very happy dog
Score type: BM25
Score: 0.7739118337631226
Labels: ['positive']
4.3. Search by label
results = my_kb.search(
filter=["emotion/positive"]
)
Get results
for result in results:
print(f"Resource key: {result.key}")
print(f"Text: {result.text}")
print(f"Labels: {result.labels}")
Results
Resource key: f1de1c1e3fac43aaa53dcdc54ffd07fc
Text: I'm Sierra, a very happy dog
Labels: ['positive']
Resource key: b445359d434b47dfb6a37ca45c14c2b3
Text: what a delighful day
Labels: ['positive']
Main features

It’s a cloud native data base
Install NucliaDB in multiple cloud storage providers such
Amazon S3, Google Cloud Storage, Azure File Storage, or
Alibaba file cloud storage.

Ultra-high read performance
NucliaDB offers an ultra-high read performance to provide queries at scale.

Multimodel indexing
One database, multiple indexes.
- Vector indexing
- Paragraph indexing
- Full text indexing
- Relation indexing

Open Source
NucliaDB is an open source project open to external developers.