How to use JSON files in vector stores with Langchain
Retrieval-Augmented Generation (RAG) is a technique for including contextual information from external sources in a large language model’s (LLM) prompt. In other terms, RAG is used to supplement an LLM’s answer by providing more information in the prompt; thus, removing the need to retrain or fine-tune a model.
For the purposes of this post, we will implement RAG by using Chroma DB as a vector store with the Nobel Prize data set.
Langchain with JSON data in a vector store
Chroma DB will be the vector storage system for this post. It’s easy to use, open-source, and provides additional filtering options for associated metadata.
Getting started
To begin, install langchain
, langchain-community
, chromadb
and jq
.
1pip install langchain langchain-community chromadb jq
jq
is required for the JSONLoader
class. Its purpose is to parse the JSON file and its contents.
Fetching relevant documents using document loaders
For reference, the prize.json
file has the following schema:
1{
2 "prizes": [
3 {
4 "year": "string",
5 "category": "string",
6 "laureates": [
7 {
8 "id": "string",
9 "firstname": "string",
10 "surname": "string",
11 "motivation": "string",
12 "share": "string"
13 }
14 ]
15 }
16 ]
17}
with the laureates
being optional since some years do not have any winners.
The following code is used to simply fetch relevant documents from Chroma.
1from langchain_community.document_loaders import JSONLoader
2from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
3from langchain_community.vectorstores import Chroma
4
5embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
6
7loader = JSONLoader(file_path="./prize.json", jq_schema=".prizes[]", text_content=False)
8documents = loader.load()
9
10db = Chroma.from_documents(documents, embedding_function)
11query = "What year did albert einstein win the nobel prize?"
12docs = db.similarity_search(query)
13print(docs[0].page_content)
Steps:
- Use the
SentenceTransformerEmbeddings
to create an embedding function using the open source model ofall-MiniLM-L6-v2
from huggingface. - Instantiate the loader for the JSON file using the
./prize.json
path. - Load the files
- Instantiate a Chroma DB instance from the documents & the embedding model
- Perform a cosine similarity search
- Print out the contents of the first retrieved document
Output:
1{"year": "1921", "category": "physics", "laureates": [{"id": "26", "firstname": "Albert", "surname": "Einstein", "motivation": "\"for his services to Theoretical Physics, and especially for his discovery of the law of the photoelectric effect\"", "share": "1"}]}
The text_content=False
flag allows each array JSON document entry to be vectorized. This works for this use case, but you may want something different.
For example, let’s vectorize the contents of the motivation
text attribute and add metadata about each motivation:
1from langchain_community.document_loaders import JSONLoader
2from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
3from langchain_community.vectorstores import Chroma
4
5
6def metadata_func(record: dict, metadata: dict) -> dict:
7 metadata["id"] = record.get("id")
8 metadata["firstname"] = record.get("firstname")
9 metadata["surname"] = record.get("surname")
10 metadata["share"] = record.get("share")
11 return metadata
12
13
14embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
15
16loader = JSONLoader(
17 file_path="./prize.json",
18 jq_schema=".prizes[].laureates[]?",
19 content_key="motivation",
20 metadata_func=metadata_func,
21)
22documents = loader.load()
23print(documents)
Steps:
- Use the
SentenceTransformerEmbeddings
to create an embedding function using the open source model ofall-MiniLM-L6-v2
from huggingface. - Instantiate the loader for the JSON file using the
./prize.json
path. - Use the
?
jq syntax to ignore nullables iflaureates
does not exist on the entry - Use a
metadata_func
to grab the fields of the JSON to put in the document’s metadata - Use the
content_key
to specify which field is used for the vector text - Load the files
- Instantiate a Chroma DB instance from the documents & the embedding model
- Print out the loaded documents
Output:
1[
2 Document(
3 page_content='"for the discovery and synthesis of quantum dots"',
4 metadata={
5 'source': 'path/to/prize.json',
6 'seq_num': 1,
7 'id': '1029',
8 'firstname': 'Moungi',
9 'surname': 'Bawendi',
10 'share': '3'
11 }
12 ),
13 ...
14]
Note: The metadata_func
is only required if you want to add your own arbitrary metadata. If that’s not a concern, you can omit it and use the jq_schema
.
1from langchain_community.document_loaders import JSONLoader
2from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
3from langchain_community.vectorstores import Chroma
4
5embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
6
7loader = JSONLoader(
8 file_path="./prize.json",
9 jq_schema=".prizes[].laureates[]?.motivation",
10)
11documents = loader.load()
Langchain Expression with Chroma DB JSON (RAG)
After exploring how to use JSON files in a vector store, let’s integrate Chroma DB using JSON data in a chain.
For the purposes of this code, I used OpenAI model and embeddings.
Install the following dependencies:
1pip install langchain langchain-community langchain-openai chromadb jq
Then run the following code:
1from langchain.prompts import ChatPromptTemplate
2
3from langchain_community.vectorstores import Chroma
4from langchain_community.document_loaders import JSONLoader
5from langchain_community.embeddings import OpenAIEmbeddings
6
7from langchain_core.output_parsers import StrOutputParser
8from langchain_core.runnables import RunnableLambda, RunnablePassthrough
9
10from langchain_openai import ChatOpenAI, OpenAIEmbeddings
11
12embedding_function = OpenAIEmbeddings()
13
14loader = JSONLoader(file_path="./prize.json", jq_schema=".prizes[]", text_content=False)
15documents = loader.load()
16
17db = Chroma.from_documents(documents, embedding_function)
18retriever = db.as_retriever()
19
20template = """Answer the question based only on the following context:
21{context}
22
23Question: {question}
24"""
25prompt = ChatPromptTemplate.from_template(template)
26
27model = ChatOpenAI()
28
29chain = (
30 {"context": retriever, "question": RunnablePassthrough()}
31 | prompt
32 | model
33 | StrOutputParser()
34)
35
36query = "What year did albert einstein win the nobel prize?"
37print(chain.invoke(query))
Steps:
- Use the
OpenAIEmbeddings
to create an embedding function - Load the JSON file
- Instantiate the Chroma DB instance from the documents & embedding model
- Create a prompt template with
context
andquestion
variables - Create a chain using the
ChatOpenAI
model with a retriever - Invoke the chain with the question
What year did albert einstein win the nobel prize?
Output:
1Albert Einstein won the Nobel Prize in the year 1921.
This tutorial only includes the basic functionality for Chroma DB. Please visit my Chroma DB guide where I walk step-by-step on how to use it for a more in-depth tutorial.