NucliaDB for Data scientists

nucliaDB

First, why should you care about NucliaDB?

Well, as a data scientist or NLP person your hard-drive is probably full of datasets and corpora. So, if you have found yourself crashing a notebook trying to load something too big with Pandas, doing way too many shuffles in your shell just to explore your data a bit, or just not really knowing how to perform a better search through your dataset, this is a tool for you.

If this rings a bell, keep reading. If this isn’t you yet, also continue reading, as this is just the tip of the iceberg with what you can do with NucliaDB.

NucliaDB is the perfect database for data scientists. Here’s a sum up of the things you can do with it:

Store text, files and vectors, labels and annotations
Access and modify your resources efficiently
Annotate your resources
Perform text searches, given a word or set of words, return resources in our database that contain them.
Perform semantic searches with vectors, that is, given a set of vectors return the closest matches in our database. In an NLP use case, this allows us to look for similar sentences without being constrained by exact keywords
Export your data in a format compatible with most NLP pipelines (HuggingFace datasets, pytorch, etc)

All of this can be done through our ridiculously easy to use python libraries: nucliadb_sdk and nucliadb_datasets. And the best bit is still to come, everything is open-source and easy to install and run. Sounds too good to be true? Just take a couple of minutes to install it and play with it.

But, how do I use it?

Do not worry, the idea behind this article is to help you get started with NucliaDB, so what we are going to do is walk through step by step the installation and main functionalities. I’ll be sharing code snippets for everything.

By the end of the article we’ll have achieved the following things:

Install NucliaDB locally
Create the “box” where we will store our data in our database ( Knowledgebox in Nuclia lingo)
Index data from a HuggingFace dataset, and its vectors
Inspect our Knowledge box (KB)
Perform searches on our KB filtering by label, keyword search and semantic search on the provided vectors

And we’ll be able to ask our KB things like:

What is love?
What do you think of tutorials?
What do you think about developers?
What do you think about life?

To which it will give us the semantic search results that are closer to our query, from which I selected my favorites:

"I believe that the heart does go on"
"The simple ones feel elegant."
"Does anybody ever tell you guys that you're doing a good job? Because they should. You're doing a good job."
"I personally think think life is beautiful but living is hell, kinda like a hurricane can be the most beautiful thing."

Now let’s start our installation 🙂

You have many options to install NucliaDB locally, the fastest and most simple one is using docker:

docker run -it \
       -e LOG=INFO \
       -p 8080:8080 \
       -p 8060:8060 \
       -p 8040:8040 \
       -v nucliadb-standalone:/data \
       nuclia/nucliadb:latest

If you are not familiar with docker yet, you can take a look at their docs or a this tutorial

But you can also install NucliaDB and run it locally:

pip install nucliadb
nucliadb

Time to get pythonic

Once our NucliaDB is up and running we can start playing with it a bit. First things first, let’s install the libraries to access NucliaDB from the python side:

pip install nucliadb-dataset
pip install nucliadb-sdk

Once they are installed, open your notebook or favorite IDE.

Now, let’s import the modules we’ll need to perform some basic operations with our SDK. Don’t worry if they do not make a lot of sense now, we’ll get into detail in a bit!

import nuclia_sdk

I like to triple check everything, so I always start by checking my NucliaDB is up and running:

import requests
response = requests.get(f"http://localhost:8080")
assert response.ok

Once we are sure everything is running smoothly, we can start with the fun stuff! Let’s create our first KnowledgeBox ( Remember, a KB in Nuclia is how we call our data containers):

my_kb = get_or_create("my_new_kb")

See, easy as pie!

Now we need to fill this KB with data. For this we‘ll use the biggest online window into the human kind, Reddit. We’ll take the emo dataset from HuggingFace to create a KB to explore sentences from different subreddits. I chose this dataset because, even though that’s not its main purpose (it’s primarily an emotion detection dataset), the fact that is labelled by subreddit gives us the option to use it to filter, and explore the characteristics of each subreddit, the kind of language used, etc. And also, who doesn’t (at least secretly) just love Reddit?.

First we get the data:

from datasets import load_dataset
dataset = load_dataset("go_emotions", "raw")

Since we’ll want to take a look at the vector search, we need to calculate vectors for these sentences. In this case we’ll be using HF’s all-MiniLM-L6-v2 :

from sentence_transformers import SentenceTransformer 
encoder = SentenceTransformer("all-MiniLM-L6-v2")

Then we’ll upload the whole dataset, including the calculated vectors for each sentence.

for row in enumerate(dataset["train"]):
    label = row["subreddit"]
    my_kb.upload(
        text=row["text"],
        labels=[f"reddit/{label}"],
        vectors={"all-MiniLM-L6-v2": encoder.encode([row["text"]])[0]},
    )

In this case we are using only three fields:

Text: where we upload our text resources, in this case our sentences.
Labels: the labels for each of these text resources, for us, the subreddit for each sentence. Labels belong to a Labelset and are defined like `labelset/label`. This way we can define several sets of labels for each kb, potentially unrelated. For example, in our case we could have one that indicates the subreddit and other for offensive laguage.
Vectors: where we store all the different vectors for our resources. We call these groups of vectors, Vectorsets, and in this case we only have one, calculated with the “all-MiniLM-L6-v2” model.

Since our dataset is pretty big, so if you are in a hurry, you can just do it over a smaller sample from it:

sample = dataset["train"].shuffle().select(range(1000))
for row in enumerate(sample):
    label = row["subreddit"]
    my_kb.upload(
        text=row["text"],
        labels=[f"reddit/{label}"],
        vectors={"all-MiniLM-L6-v2": encoder.encode([row["text"]])[0]},
    )

Once we have our data in our KB, let’s do some quick checks, like getting all the labels and the sets of vectors uploaded:

labelset_info = my_kb.get_uploaded_labels().keys()
my_labelsets = labelset_info.keys()
my_labels = my_labels["reddit"].labels.keys()
tagged_resources = my_labels["reddit"].count
vectorsets_info = my_kb.list_vectorset()
my_vectorsets = vectorsets_info.vectorsets.keys()))
vectorset_dimension = vectorsets_info.vectorsets["all-MiniLM-L6-v2"].dimension

The results should look like this:

Labelsets info : {'reddit': LabelSet(count=10000, labels={'loveafterlockup': 48, 'TeenMomOGandTeenMom2': 47, 'self': 45, 'OkCupid': 44, 'holdmycosmo': 44, 'vaxxhappened': 44, 'DetroitPistons': 43, 'CollegeBasketball': 42, 'arrow': 42, 'raimimemes': 42, 'AnimalsBeingBros': 41, 'confessions': 41, 'danganronpa': 41, 'socialanxiety': 41, 'DunderMifflin': 40, 'chicagobulls': 40, 'rant': 40, 'wholesomememes': 40, 'My600lbLife': 39, 'VoteBlue': 39, 'awfuleverything': 39, 'devils': 39, 'unpopularopinion': 39, 'NYGiants': 38, 'PoliticalHumor': 38, 'The_Mueller': 38, 'Tinder': 38, 'TrollXChromosomes': 38, 'Whatcouldgowrong': 38, 'cringe': 38, 'datingoverthirty': 38, 'lewronggeneration': 38, 'tifu': 38, 'AnimalsBeingJerks': 37, 'Jokes': 37, 'NYYankees': 37, 'TheSimpsons': 37, 'TopMindsOfReddit': 37, 'rpdrcringe': 37, 'thatHappened': 37, 'SelfAwarewolves': 36, 'steelers': 36, '90dayfianceuncensored': 35, 'Divorce': 35, 'SuicideWatch': 35, 'relationship_advice': 35, 'Anarcho_Capitalism': 34, 'EdmontonOilers': 34, 'ForeverAlone': 34, 'JordanPeterson': 34})}
Labelset:  reddit
Labels: loveafterlockup, TeenMomOGandTeenMom2, self, OkCupid, holdmycosmo, vaxxhappened, DetroitPistons, CollegeBasketball, arrow, raimimemes, AnimalsBeingBros, confessions, danganronpa, socialanxiety, DunderMifflin, chicagobulls, rant, wholesomememes, My600lbLife, VoteBlue, awfuleverything, devils, unpopularopinion, NYGiants, PoliticalHumor, The_Mueller, Tinder, TrollXChromosomes, Whatcouldgowrong, cringe, datingoverthirty, lewronggeneration, tifu, AnimalsBeingJerks, Jokes, NYYankees, TheSimpsons, TopMindsOfReddit, rpdrcringe, thatHappened, SelfAwarewolves, steelers, 90dayfianceuncensored, Divorce, SuicideWatch, relationship_advice, Anarcho_Capitalism, EdmontonOilers, ForeverAlone, JordanPeterson
Tagged resources: 10000
-----------------
Vectorsets info :  vectorsets={'all-MiniLM-L6-v2': VectorSet(dimension=384)}
Vectorsets:  all-MiniLM-L6-v2
Dimension:  384

Search by label

Now, we could explore our dataset filtering by label. This could be useful for getting results from some specific subreddits, to figure out if we want them in our final dataset or not:

results = my_kb.search(
        filter=[Label(labelset="reddit", label="socialanxiety")]
    )
for result in results:
    print(result.text)

As expected, this is subreddit is full of talk about mental health:

"I compleatly get it. It makes me not trust people over the littlest things."
"yep, i think you should just try to be as honest as you can about how you feel, how it affects your daily life, etc"
"I've massively improved but it never really gets cured. There are always situations that will give you anxiety."
"It's good to know that I'm not the only one lol"
"I can help."
"Maybe they are just horrible people, and because you aren't like them they don't like you. "
"Obviously. If we were able to make such logical risk/reward assessments it wouldn't be a problem."
"It's f*cking incredible how many people would like you if you would just make eye contact and smile at them."

Search by keyword

Now, if we wanted to get sentences that contain a word or a set of words, we could also do it with our full text search. In this case we’ll look for a couple of technology related keywords, just to have a glimpse of how much tech talk there is in our data. To do so, we would have to perform different searches like this one on our data:

resources = my_kb.search(text="developers")
for result in results:
    print(result.text)

I used the code above to do three searches, for developers, code and tech, and got the following results:

"iT cAn'T bE sExISt bEcAuSE iT's a mErItocRaCy. -tech bro a******* justifying gender discrimination and systemic sexism"
"You are terrible developers."
"How does ones SO even go through your ph ? Mine is locked with a pass code and a finger print"
"Because a lot of people on this thread are ignoring the pirate code and I'm trying to explain that rules are there for a reason."
"IS THIS NOT THE GOOGLE???? GOOGLE CHICKEN POT PIE RECIPE. D*** TECHNOLOGY AND MELLINIALS."

We can agree in that some of these results are funny, but if I wanted more insight on my data, it would be nice to have also a more powerful search. And this brings us to the main event, our vector search.

Semantic search

All we need to do is to calculate the vectors of a sentence/ group of words and input them into our search, specifying the Vectorset we are using. This will show us the results in our database whose vectors are more similar to the ones we input ( we do this with cosine similarity). Like that, we can perform much more thorough explorations of our data while also getting insight about how a specific model performs in terms of similarity for our use case.

In this case, since our previous tech-related search did not give us a lot of interesting results, let’s do it with vectors. We’ll use the sentence “Technology, developers and coding” :

query_vectors = encoder.encode(["Technology, developers and coding"])[0]
results = my_kb.search(vector = query_vectors, vectorset="all-MiniLM-L6-v2", min_score=0.1)
for result in results:
    print(result.text)

And here you have some of the results we get:

"You are terrible developers."
"Yeah, if companies really wanted to ameliorate the lack of programmers they could hire and train more juniors, but that doesn't benefit them much."
"Likely some more b*tching about easily fixable features being swept under the rug by employees"
"Dead Island 2 is in developent hell still. They just changed devs again."
"Oh no way is it miners, it's blockstream fault they bought every developer that had commit access to bitcoin's repository. End of story."
"I'm beginning to think that ArcSys might be my favorite videogame company. They make such good fighting games."

As expected they seem more relevant and less tied to specific words or typos.

Now what?

Once you are done exploring your data, filtering, and maybe even re-tagging it, you can easily export your resources to an arrow file that can be subsequently loaded as a HuggingFace or Tensorflow Dataset. This can then be used to train downstream tasks such as a classifier. We’ll get into more details on how to export and label in following articles, in the meantime you can always check the docs!

Also, stay tuned for the next post, in which we’ll show you how to perform semantic search over code with NucliaDB 🙂