NucliaDB is the open source distributed database engine designed for Nuclia’s AI Search platform to efficiently scale to diverse workloads and datasets.
It is more than your typical vector database. It is a robust, hybrid search database designed for diverse needs around unstructured data storage, indexing, retrieval and search.
It is an important part of Nuclia’s full stack cloud offering for AI Search.
In this post, I will give you an overview of our architecture and how we scale NucliaDB at Nuclia.
Data storage architecture
NucliaDB is more than just a typical vector database. We need to support storage for a range of different types of data while also supporting multiple types of indexes.
In order to accomplish this, our data storage architecture is built around a 3-tier data layer approach.
Tier 3: Blob data
This is where we store files and extracted data. NucliaDB is compatible with both Amazon S3 and Google Cloud Storage(GCS) services.
Tier 2: Key-Value storage
Our key-value storage layer is our primary resource, field and metadata storage layer.
All resource and field metadata are stored here.
For this, we utilize TiKV, a highly scalable, fault tolerant, ACID key-value database.
Tier 1: Index data
Our sharded index storage layer is our most complex storage layer.
Each shard is a package of our indexes that power our Hybrid Search.
Knowledge boxes, Index Data and Sharding
With Nuclia, the top level data organization container is called a “Knowledge Box.”
A Knowledge Box is a collection of files, videos, audio and any other type of unstructured data that you want to be able to perform semantic searches on or get generative answers from.
Each Knowledge Box can then have many “shards” where its index data is managed.
Dynamic sharding
From a data architecture perspective, we’ve built our architecture around scaling vector searches inside a Knowledge Box. Vector indexes can utilize a lot of disk and memory so it is important to have an effective sharding strategy.
Additionally, in order to better dynamically scale our database, we wanted a sharding strategy that didn’t require you to know the correct size of your data ahead of time .
For example, in sharding strategies that databases like ElasticSearch utilize, you need to know the number of shards your dataset should utilize ahead of time. Strategies like this can work fine for time series data but do not work for other use cases.
Dynamic sharding allows us to add index data shards to a Knowledge Box as it grows. We manage the size of our shards mostly through controlling the number of vectors that are stored in the index. When we hit our threshold, we create a new shard.
Shard Storage and Index Node Architecture
The core part of our database is our Index Node component. An Index Node manages disk and reads and writes on that disk for a set of Shard Replicas. A Shard Replica is an instance of a shard.
Since we want to be fault tolerant and dynamically scale reads, our shards can have many replicas. Each Shard Replica integrates 4 separate indexes into a unified, transportable replica storage object. Our Replicas are then spread across many Index Nodes.
Cloud-Native + Standalone
We designed NucliaDB to be Cloud-Native and dynamically scalable.
NucliaDB is deployed as a service architecture. Search, reads, writes, indexing, processing and ingestion are all dynamically scalable.
We utilize NATS as our message bus and GRPC for service-to-service communication.
Standalone NucliaDB
NucliaDB Standalone is a packaging of NucliaDB with all batteries included and a single process to run. It allows customers to utilize the power of Nuclia on-premise while still having complete ownership of their own data.
What is unique about NucliaDB is that while it is Cloud-Native, utilizing NATS and GRPC to communicate between systems to provide a robust scaling cloud system, NucliaDB can also be run in this simple Standalone mode.
Python + Rust
We are able to deliver a Standalone user experience partly due to how we can leverage Rust’s PyO3 python bindings integration. By using Python and Rust together, we are able to develop more quickly while keeping the important part of our database extremely fast.
Rust’s PyO3 bindings support are an excellent way to glue Python and Rust together and have allowed Nuclia to integrate service layers directly into python. So components that normally communicate over GRPC can be integrated through PyO3 bindings.
Putting it all together
By utilizing a tiered data architecture and Cloud-Native service design, we are able to scale Nuclia’s Cloud service.
Nuclia is more than just a database. It is a full stack solution with text/docs/video/audio processing, inference, semantic search, generative answers, REST API, Desktop App and a UI in a single solution.