First things, first… What is Nuclia?
Nuclia is an advanced AI-powered search engine that enables swift and precise access to the information stored within your organization. Nuclia delivers lightning-fast and highly accurate results, making it an essential tool for your internal data searches.
Today, we will highlight two key functionalities of our platform that can significantly enhance your data management capabilities: fast model training for accurate classification and powerful semantic search capabilities. With our platform’s advanced features, you can train your classification models in a few minutes, enabling you to get better results based on accurate data analysis.
To simplify this process, we will be using Nuclia’s Web App, which provides a user-friendly interface to visualize and label resources, train multiple models and perform searches across a wide variety of data types. Behind the interface that we will see in a moment, we have developed a powerful machine learning pipeline that automates the entire training process, from data storage and preprocessing to vector transformation and model training so you can do the end-to-end without the need for any advanced technical knowledge.
Imagine that you have large amounts of unclassified data and you need to move quickly to find what you are looking for, or maybe you just need extra help to make it easier for you to work with. Using Nuclia you can train text classifiers at resource, paragraph and sentence level by labeling just a few examples of each category.
We would like to propose a use case centered around cooking recipes. This theme was selected for several reasons. Firstly, recipes are familiar and relatable to everyone, making it easy to understand and imagine the application of our proposed use case. Besides, recipes are available in multiple languages and formats, providing a diverse range of data sources for analysis.
So, we will use an empty KB to store information and label a few resources to train our machine learning models. We will show how we upload new docs to see the automatic classification and with our semantic search, we will look for information in a super smart way.
Cool, now let’s jump over to the Nuclia interface where we will take it step-by-step and train our classification model.
Uploading the resources.
So, let’s get started with the basics. The first thing we need to do is to get a dataset that we can label with some simple and clear examples. In our case, we decided to use some delicious recipes that we downloaded from various free domain websites.
The best part about using Nuclia for our labeling needs is that it’s not just limited to text-based data. Nuclia can work with any type of data, in any format, and in any language. So, in addition to our recipe texts, we also have videos and URLs from these free domain websites, and in different languages too.
Once we have our dataset, the first step will be to upload it to Nuclia. For this we will go to the resource list tab.
This is where we will be able to view all our resources once they have been uploaded and processed by Nuclia. Uploading them is as simple as clicking on the upload button that we find at the top of this tab and following the instructions that we will find in each of the sections. We have pdfs, videos and urls so we can use any of the first three options: Upload files, Upload a folder or Add links, but we could also upload resources in CSV format using the last option. For this example we will use the second and third option.
So, we click on Upload files, we drag the files that we want to upload and that’s all.
For the urls, we click on Add links, and, as we have many, we will use the Multiple links option, which allows us to copy our urls and place them one on each line to upload them all at once.
While the documents are being processed, Nuclia will extract all the relevant information for us. From summarizing the document to extracting interesting text from images. In addition, at the end of this guide we will also be able to automatically classify the information that we upload to our kb.
Creating the label sets and tagging the documents.
The documents that have been processed correctly will be visible in the list of resources, where we can navigate through them in a simple way.
However, to start training the models, we need to label our data. Nuclia makes it easy to upload and tag information using the user interface. Let’s take it step-by-step.
The first thing we are going to do is create a Label set. Label Set is the term we use to define a set of labels. For example, a Label Set could be countries and each label the name of a country. To create a label set we will go to the classification tab.
Here we can see an Add new button that will take us to the Label Set creation section.
We can define the name of our Label Set. We can choose the name, the color we wish to use as its representation, the classification type (i.e., whether it pertains to a resource or a paragraph), the possibility of assigning multiple labels to a single resource (in the case of multilabel problems), and the names of each label within the Label Set.
We will call our Label Set as “Flavors” and create it as a resource type with the color blue. To ensure that each document is assigned only one label, we will select the checkbox, as our Label Sets are typically multilabel by default. Next, we will proceed to create our two labels, “sweet” and “savory”. Once done, we can save the changes and we are all set.
Training our own models.
With our Label Set configured and labels created, we are going to go back to the resource list. Nuclia makes it very easy for us to annotate these resources using the Label Set we have created.
We have about 30 files but we will only label three of each type to demonstrate the quality of our classifiers in situations with little annotated data. To label them, we must click on the resource and click on the Add labels dropdown. Here we can select to which category each of our files belongs. Those recipes related to desserts will be labeled as sweet and other meals as savory. Remember that all our learning models can work with data in different languages.
Ok, once we have labeled the resources we are going to train our model. This step is very simple. We just have to go to the Training tab.
Here we can see three types of training: label search intent training, automatic resource label training and automatic paragraphs label training.
The first classifier is designed to work at the phrase level and oriented to make suggestions for the user when searching in the text box. The second one trains a classifier at the resource level that will automatically classify the documents uploaded to Nuclia that belong to the training categories and the last one will do the same as the previous one but focusing on paragraphs.
In this example, we will use label search intent training and automatic resource label training to train two different models using our data.
To start the training we just have to select the Label set we want to train and click on Start Training. The label search intent training option allows us to train a classifier using different Label Sets, so the suggestions for our search are not restricted to a single Label Set. For this case we will only train one.
Checking the results.
With the model training phase complete, the next step is to evaluate their performance. Now, our kb is prepared to automatically classify resources that we upload and suggest tags in our searches.
To fully activate the label suggester function we must first go to the Widget tab.
We need to enable the ‘Suggest labels’ option. Also, for our suggester labels to work properly we need to make our kb public.
To make our kb public we must go to the Home Dashboard tab in which we have started this guide. Here we will see a button on the right side that will allow us to publish our kb.
Now we are ready to start using all our models. We’ll start by uploading three examples to the list of resources and check how the models classify them automatically.
After processing we can see that our documents have been labeled automatically.
- “spaghetti_carbonara” has been classified as savory.
- “strawberry cake” has been classified as sweet.
- “tortilla francesa” has been classified as savory.
Now we’ll switch to the search tab to try the suggester and semantic search functionality. If we type “I want sugary food” on the search box we will obtain interesting results related to the sweet label.
On the other hand if we write “Can you give me a chicken recipe?”, we can see relevant search results related to savory and chicken recipes.
By following these steps, we have uploaded information, labeled it, trained multiple classifiers to enhance our information search, and validated these features.
I hope you enjoyed this little tutorial. Feel free to try Nuclia and all of its many tools. You can find us at Github, LinkedIn, Youtube and others. Thank you very much for your attention.
Start using Nuclia today! Sing up, free!