Hi! This is Ramon, CTO and co-founder at Nuclia. Together with Eudald, my co-founder and companion in battles for many years, we’ve been building something for the last two years. Our vision is to deliver an engine that allows engineers and builders to search on any domain specific set of data, focussing on unstructured data like video, text, PDFs, links, conversations, layouts and many other sources.
For many years we’ve been fighting with general purpose indexing tools to answer questions on top of a set of files/text/web/etc but kept failing: general purpose indexing tools and AI models do not have the same interface, the same goal, speed or scaling story. That’s why we decided to build an end-to-end solution to deliver this service as an API.
How we do it?
First we need to know what it means to answer a question on top of a set of data (Information Retrieval, IR, for those in the know)?
It’s a four step process:
- The first stage is to crunch the data. Tokenize, vectorize, pre-process, etc. From this, generate previews to visualize a representation of the data, extract all possible text from images, frames and audio along with the text of the resource itself. The data is then cleaned and split into paragraphs and sentences.
- Next, we use state of the art multilingual NLP techniques to understand all possible information on top of the extracted data: summarizing, obtaining insight entities (generic + custom domain defined), computing embedded vectorized representations, labeling with user defined labels and anonymizing if required. Finally, we end up with a normalized model that can represent all possible information.
- Then, we store all this information, in our case, on our scalable open source database NucliaDB. Back when we started it, the concept of vector databases did not exist so we implemented our own HSNW rust component along with our own knowledge graph rust component. These two components plus Tantivy (a Lucene-like implementation on top of rust) form a NucliaDB Node, an indexing rust GRPC service designed to store text, paragraphs, vectors and relations and provide a powerful search interface. On top of this indexing engine we layer a python based REST API as an interface for CRUD, a Search and a Dataset API.
- Finally, once the data is in NucliaDB we generate the models that will be used in what we call “the understanding layer” to generate new data or handle queries to our API to predict intent, classification of text, entities or embeddings. We’ve enabled anyone to generate this data with your own domain information by training a custom model or extracting from your own database.
Cloud & OSS
We want to make this technology available for everybody, so we made a big effort to create an easy to use workflow for self-hosting NucliaDB hosting or using our cloud.
- For our cloud, you just need to go to nuclia.cloud, signup and you will get a KnowledgeBox (our concept of a container of data) where you can upload any kind of data via the Nuclia Desktop application, the Nuclia Dashboard, or using our Nuclia SDKs (Typescript or Python) or via our HTTP Rest API.
- For self-hosting, you can download NucliaDB from pip or docker. To search the unstructured data (files, links, conversations, layouts,..) you need a Nuclia API Key from nuclia.cloud. We analyse your data and return it normalized without storing anything on our servers. Once you have the service running you can upload data using any of the same methods as for the cloud. NucliaDB OSS can be installed on your laptop or deployed fully distributed on a Kubernetes Cluster with our helm packages thanks to Jetstream streaming and TiKv.
Once the data has been processed, you can search via our Custom widget generator, the Nuclia SDKs or via HTTP Rest API. In order to do semantic search or relations search you will need a Nuclia API Key so it can use the Predict API and generate the embeddings at query time. The Nuclia Predict API is designed to be fast (from 5 to 10ms of computation time) to add the least overhead possible on the search query. No need to deploy GPU or a specific prediction architecture.
If you want to generate new models, you can play with our Nuclia Learning API:
- On nuclia.cloud you can trigger a training of Intent detection, Classification of text or Entity recognition with your own annotated data (via Nuclia Dashboard or via the HTTP Rest API).
- If you’re self-hosting you can create a training of Intent detection, Classification of text or Entity recognition on your Cloud account and then use the nucliadb_dataset python package to export annotated data (via the API) and push it to train the new model.
Once the model is computed you can predict using the Cloud Predict API or download in TensorflowJS format (Only for intent detection right now) and use it with the Typescript SDK.
Create your own NLP model
With the nucliadb_dataset python package, you can create your own stream of data for any of the supported tasks and create a partitioned set of Apache Arrow files that can be used on any of the common Datasets NLP libraries to train, evaluate and compute new models.
Roadmap
Ranking is our main goal for the next few weeks. You can already use a classical ranking for full text search (boosting and BM25) and a good generic multilingual semantic search. We know we can do better, so we will add the option to train your own bi-encoder with your domain data/languages and annotated data/queries.
We are working to deliver the best experience in our Cloud and self-hosted OSS environments and to provide tooling for moving from one to the other easily with a dump and a restore of a KnowledgeBox.
Building “ChatGPT” for internal data
We believe Nuclia & NucliaDB are the perfect match to train a model that can act like Chat GPT but with real up to date data from your own environment, the bi-encoder helps to use your custom data and provides context for our approach to generative AI, so we can provide a good qualified answer. No bias, proper information.
Our goal is to make it feasible to create a model, small enough to run in multiple edge devices and push the question answering to the client side.
We need your help!
Integrations via Nuclia Desktop applications (an open source electron app) are easy to develop and we invite everybody to help by adding integrations that would be useful to you!
We would like to invite all data scientists to test our nucliadb-sdk and nucliadb-dataset to create crazy models, give feedback and test the first NLP focused open source DB (pypi / docker)!
Any use case you have in your mind, please come share on our Discord server!
We will be publishing some articles explaining how to use all the power of Nuclia & NucliaDB on our Cloud and OSS.i