An approach to index and make any data findable.
In your email, when you look for information in your favorite search engine, when you are looking for a video, when you want to find a document in a file repository — you probably do hundreds of searches every day.
In this article we share our approach on building the next generation search engine for any data, while making sure data is findable. We’ll cover all the different pieces involved, the reasons, the whys, and the architecture.
Would you like to know how to index and make unstructured data searchable? You’ve come to the right place!
First, some context…
Technologies based on Transformers and Language Models have completely changed the capability of computers to understand text and binary data.
Thanks to those technologies—and also thanks to the pre-trained language models, trained with large amounts of data and computational power—we can create extremely accurate models able to predict data classification, generate text, create vector representations, language translations, and much more.. And also, on top of them, run fine-tuning processes, adaptation and one-shot training that provides very good results on domain-specific data.
All this technology is amazing, but we are still just seeing the tip of the iceberg. Multi-task models or question & answer models, for example, will become like small computational domain-specific models able to process and predict information on multiple domains.
In order to make these amazing—but extremely complicated—technologies be available and easily used, we require multiple complementary elements:
- A data normalization process for any data (structured and unstructured) capable of creating output usable to train and search (and find data).
- A scalable database to host and store any kind of information, designed to support training and searchability not only using keywords but also vectors and relations.
- A way to annotate, train and fine-tune (own) models to improve on multiple tasks and with domain specific information.
- A way to predict all these models and at the same time search information based on its predictions.
This is the first, and mandatory step.
Basically, it means the process of turning data into something usable for searching and understanding. This is what the Nuclia Understanding API does. This API runs a process to “understand” files, links, text, conversations, etc. It turns them into something smashed and crafted (we could compare it to something like Tika, only with superpowers) ready to be used by the other components we’ll describe later in this post.
The output of this process is a text extraction with its paragraphs, predicted classifications, entities, summaries, previews and vectors representing all the valuable sentences.
This API is designed to be used standalone or used with NucliaDB (Managed or OSS OnPrem).
In order to use it standalone, it just requires an API key and a webhook, or you can pull the result via our API. NUA API Documentation.
Once we have all the data output from our “understanding” process, the next step is storing all the information somewhere providing searchability.
At Nuclia, we evaluated multiple databases and indexing systems, and we were unable to find a single one that could fulfill all our requirements: scalability, semantic and keyboard searchability, an API designed for training and preview support, paragraphs and CRUD on top of a granular model. That’s the reason why we decided to build our own database: NucliaDB.
What is exactly NucliaDB?
Storing large amounts of information in multiple formats (vectors, text, entities, classifications, previews, conversations, files, etc. requires an Object mapping able to map all this information. To make this possible we decided to split data in a hierarchical format:
We designed NucliaDB to be run on a local machine without any external dependencies, but also to be scalable with multiple pods where read/write transactions are distributed.
To build it, we split data into two layers: a key value storage and a blob storage. Both of them are abstracted so we can implement a driver for any existing system.
Right now we already support S3/GCS/filesystem on blob storage and Redis/TiKv/Filesystem on key value storage, but we are planning to build drivers to support Postgres, AFS, MySQL and more.
To provide scalability, we’ve designed transactional partitions using Jetstream, so each resource is sequentially indexed in the same nucliadb_ingest component.
Once the transaction is processed and stored, in the objectstorage/blobstorage, index brains are created and serialized to a second stream that will distribute the indexing information to the nucliadb_node component.
Based on this writing pipeline and the scalability of the underlying layers (all key value storages) the scalability of the storage is infinite.
- Text field: Plain text, HTML, Markdown…
- File field: PDFs, Audios, Videos, Docs, Spreadsheets, Presentations…
- Link field: Any URL with optional headers, cookies…
- Conversation field: Any incremental conversation with attachments and multiple actors.
- Layout field (early stage): Any incremental layout format that suports adding blocks of information Date and keyword field: Specific dates and keywords
All these fields are normalized to paragraphs, sentence vectors, previews. We also detect entities, classifications, summary, etc…
Once all data is stored and persisted, it’s time to… search (and find)!
But… wait! What do we mean when we are talking about “search”?
On a basic level, we mean the way we can make sure that we are delivering the right sentence, paragraph, document, answer (or the closest answer possible) to the user.
To make this happen, NucliaDB provides four indexes:
– BM25 Keyword search thanks to tantivy
– Paragraph fuzzy search thanks to tantivy
– Vector search with our own nucliadb_vectors
– Relations search with our own nucliadb_relations
Each one of these indexes are implemented in nucliadb_node using Rust and distributed using shards and Scuttlebutt as the cluster management protocol.
Each Knowledge Box splits its indexing information into multiple shards depending on its size, and at the same time replicates the shards on multiple nodes to provide high availability.
Besides the horizontal scalability of the sharding technique, each node has three components isolated to provide a lineal performance: reader, sidecar and writer.
The Reader and the Writer are the components responsible for serializing and querying indexed information. The sidecar is responsible for receiving the indexed information from the transactions on a commit log stream to make sure transactionality is kept.
Of course, NucliaDB could run on a standalone process where it runs monolithic on the same API process using PyO3.
Once data is normalized we let account managers fine-tune similarity, create their own classifications, create their own entities, and their own specific question/answer models.
In order to do so, Nuclia provides a labeling feature together with ways to provide feedback, annotate and supervise already provided predictions that will be used during the training time to make sure it delivers better results.
Besides using the generated models to predict new labels on new ingested data, users can call Nuclia API to predict answers to questions, download our TF.js models to run intent detection in the browser, and also ask to predict labels to a transient sentence.
Open source is in our DNA, we believe that open source brings the opportunity to build something great together with athe community. It breaks down barriers to using Nuclia and also encourages us to deliver high quality software.
That’s why NucliaDB and all libraries/js code that connects to our API are open. We welcome anyone that wants to be involved.
Nuclia’s mission is to make the unsearchable, findible, just using an extremely easy to use API.
We provide to developers and no-coders, different entry levels depending on the use case
- Nuclia Understanding API
- Nuclia Learning API
- NucliaDB / Nuclia.cloud
Try it out!
You can use our managed service on the cloud with all the features included today with a nice free tier at: https://nuclia.cloud
… or you can use our Open Source Database without our understanding API and without persistence (for testing purposes):
docker run -it nuclia/nucliadb
… or you can use our Open Source Database with a Nuclia Understanding API Key (you can obtain one on https://nuclia.cloud) and run::
docker run -it nuclia/nucliadb -e
Once you have any of the possible options up and running go to our Documentation and give any feedback at Support!
In the next weeks, we will publish recipes about how to use NucliaDB and Nuclia API with multiple use cases.
Do you have a use case that you would like to talk about? Please feel free to send an email to firstname.lastname@example.org
We hope this project brings you the value we designed it for and allows developers to focus on what brings value to their product instead of providing searchability on their data!
Photo by Lea L on Unsplash