LangChain is a great Python framework leveraging language models to build powerful NLP applications.
In this post, we will see how LangChain can be used to collect data from various sources and make them searchable using Nuclia.
Install
First, you need to install LangChain and Nuclia.
pip install langchain
pip install nuclia
You also need to create a free Nuclia account on nuclia.cloud.
Once your account is created, go to your Knowledge Box dashboard and navigate to the API keys entry on the left menu to generate an API key. Make sure to select the Contributor role.
Simple Wikipedia AI search
In this example, we will use LangChain to collect data from Wikipedia and make it searchable using Nuclia.
Let’s first install the required dependency.
pip install wikipedia
Then, we can create a simple script that will collect information about Hedy Lamarr in Wikipedia using the LangChain’s Wikipedia document loader (named WikipediaLoader
) and push the extracted texts to Nuclia.
(If you do not know who is Hedy Lamarr, that’s the perfect opportunity to learn about her, she is my favorite inventor, and her life is fascinating!)
from langchain.document_loaders import WikipediaLoader
from langchain.vectorstores.nucliadb import NucliaDB
API_KEY = "YOUR-API-KEY"
documents = WikipediaLoader(query="Hedy Lamarr").load()
ndb = NucliaDB(knowledge_box="YOUR-KNOWLEDGE-BOX-ID", local=False, api_key=API_KEY)
ndb.add_texts([doc.page_content for doc in documents])
Once the script is executed, you can go to your Knowledge Box dashboard and see the documents that have been added in the Resources section.
It may take a few minutes for the documents to be indexed and searchable.
Once they are indexed, you can use LangChain to search among all the indexed paragraphs.
from langchain.document_loaders import WikipediaLoader
from langchain.vectorstores.nucliadb import NucliaDB
API_KEY = "YOUR-API-KEY"
ndb = NucliaDB(knowledge_box="YOUR-KNOWLEDGE-BOX-ID", local=False, api_key=API_KEY)
results = ndb.similarity_search("What did Hedy Lamarr invent?", 10)
print([result for result in results])
Indexing a website from its sitemap
In this example, we will use LangChain to index a full website in Nuclia.
LangChain has a SitemapLoader
that can be used to extract all the URLs from a website sitemap.
It requires lxml
to parse the sitemap. Let’s install it:
pip install lxml
This script will load all the pages mentioned in the sitemap and push them to Nuclia.
from langchain.document_loaders.sitemap import SitemapLoader
from langchain.vectorstores.nucliadb import NucliaDB
API_KEY = "YOUR-API-KEY"
sitemap_loader = SitemapLoader(web_path="https://nuclia.com/post-sitemap.xml")
documents = sitemap_loader.load()
ndb = NucliaDB(knowledge_box="YOUR-KNOWLEDGE-BOX-ID", local=False, api_key=API_KEY)
ndb.add_texts([doc.page_content for doc in documents])