How to use JSON files in vector stores with Langchain

2024-01-28 · Thomas Taylor

How to use JSON files in vector stores with langchain step by step

Retrieval-Augmented Generation (RAG) is a technique for including contextual information from external sources in a large language model’s (LLM) prompt. In other terms, RAG is used to supplement an LLM’s answer by providing more information in the prompt; thus, removing the need to retrain or fine-tune a model.

For the purposes of this post, we will implement RAG by using Chroma DB as a vector store with the Nobel Prize data set.

Langchain with JSON data in a vector store

Chroma DB will be the vector storage system for this post. It’s easy to use, open-source, and provides additional filtering options for associated metadata.

Getting started

To begin, install langchain, langchain-community, chromadb and jq.

1pip install langchain langchain-community chromadb jq

jq is required for the JSONLoader class. Its purpose is to parse the JSON file and its contents.

Fetching relevant documents using document loaders

For reference, the prize.json file has the following schema:

 1{
 2    "prizes": [
 3        {
 4            "year": "string",
 5            "category": "string",
 6            "laureates": [
 7                {
 8                    "id": "string",
 9                    "firstname": "string",
10                    "surname": "string",
11                    "motivation": "string",
12                    "share": "string"
13                }
14            ]
15        }
16    ]
17}

with the laureates being optional since some years do not have any winners.

The following code is used to simply fetch relevant documents from Chroma.

 1from langchain_community.document_loaders import JSONLoader
 2from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
 3from langchain_community.vectorstores import Chroma
 4
 5embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
 6
 7loader = JSONLoader(file_path="./prize.json", jq_schema=".prizes[]", text_content=False)
 8documents = loader.load()
 9
10db = Chroma.from_documents(documents, embedding_function)
11query = "What year did albert einstein win the nobel prize?"
12docs = db.similarity_search(query)
13print(docs[0].page_content)

Steps:

Use the SentenceTransformerEmbeddings to create an embedding function using the open source model of all-MiniLM-L6-v2 from huggingface.
Instantiate the loader for the JSON file using the ./prize.json path.
Load the files
Instantiate a Chroma DB instance from the documents & the embedding model
Perform a cosine similarity search
Print out the contents of the first retrieved document

Output:

1{"year": "1921", "category": "physics", "laureates": [{"id": "26", "firstname": "Albert", "surname": "Einstein", "motivation": "\"for his services to Theoretical Physics, and especially for his discovery of the law of the photoelectric effect\"", "share": "1"}]}

The text_content=False flag allows each array JSON document entry to be vectorized. This works for this use case, but you may want something different.

For example, let’s vectorize the contents of the motivation text attribute and add metadata about each motivation:

 1from langchain_community.document_loaders import JSONLoader
 2from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
 3from langchain_community.vectorstores import Chroma
 4
 5
 6def metadata_func(record: dict, metadata: dict) -> dict:
 7    metadata["id"] = record.get("id")
 8    metadata["firstname"] = record.get("firstname")
 9    metadata["surname"] = record.get("surname")
10    metadata["share"] = record.get("share")
11    return metadata
12
13
14embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
15
16loader = JSONLoader(
17    file_path="./prize.json",
18    jq_schema=".prizes[].laureates[]?",
19    content_key="motivation",
20    metadata_func=metadata_func,
21)
22documents = loader.load()
23print(documents)

Steps:

Use the SentenceTransformerEmbeddings to create an embedding function using the open source model of all-MiniLM-L6-v2 from huggingface.
Instantiate the loader for the JSON file using the ./prize.json path.
Use the ? jq syntax to ignore nullables if laureates does not exist on the entry
Use a metadata_func to grab the fields of the JSON to put in the document’s metadata
Use the content_key to specify which field is used for the vector text
Load the files
Instantiate a Chroma DB instance from the documents & the embedding model
Print out the loaded documents

Output:

 1[
 2    Document(
 3        page_content='"for the discovery and synthesis of quantum dots"',
 4        metadata={
 5            'source': 'path/to/prize.json',
 6            'seq_num': 1,
 7            'id': '1029',
 8            'firstname': 'Moungi',
 9            'surname': 'Bawendi',
10            'share': '3'
11        }
12    ),
13    ...
14]

Note: The metadata_func is only required if you want to add your own arbitrary metadata. If that’s not a concern, you can omit it and use the jq_schema.

 1from langchain_community.document_loaders import JSONLoader
 2from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
 3from langchain_community.vectorstores import Chroma
 4
 5embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
 6
 7loader = JSONLoader(
 8    file_path="./prize.json",
 9    jq_schema=".prizes[].laureates[]?.motivation",
10)
11documents = loader.load()

Langchain Expression with Chroma DB JSON (RAG)

After exploring how to use JSON files in a vector store, let’s integrate Chroma DB using JSON data in a chain.

For the purposes of this code, I used OpenAI model and embeddings.

Install the following dependencies:

1pip install langchain langchain-community langchain-openai chromadb jq

Then run the following code:

 1from langchain.prompts import ChatPromptTemplate
 2
 3from langchain_community.vectorstores import Chroma
 4from langchain_community.document_loaders import JSONLoader
 5from langchain_community.embeddings import OpenAIEmbeddings
 6
 7from langchain_core.output_parsers import StrOutputParser
 8from langchain_core.runnables import RunnableLambda, RunnablePassthrough
 9
10from langchain_openai import ChatOpenAI, OpenAIEmbeddings
11
12embedding_function = OpenAIEmbeddings()
13
14loader = JSONLoader(file_path="./prize.json", jq_schema=".prizes[]", text_content=False)
15documents = loader.load()
16
17db = Chroma.from_documents(documents, embedding_function)
18retriever = db.as_retriever()
19
20template = """Answer the question based only on the following context:
21{context}
22
23Question: {question}
24"""
25prompt = ChatPromptTemplate.from_template(template)
26
27model = ChatOpenAI()
28
29chain = (
30    {"context": retriever, "question": RunnablePassthrough()}
31    | prompt
32    | model
33    | StrOutputParser()
34)
35
36query = "What year did albert einstein win the nobel prize?"
37print(chain.invoke(query))

Steps:

Use the OpenAIEmbeddings to create an embedding function
Load the JSON file
Instantiate the Chroma DB instance from the documents & embedding model
Create a prompt template with context and question variables
Create a chain using the ChatOpenAI model with a retriever
Invoke the chain with the question What year did albert einstein win the nobel prize?

Output:

1Albert Einstein won the Nobel Prize in the year 1921.

This tutorial only includes the basic functionality for Chroma DB. Please visit my Chroma DB guide where I walk step-by-step on how to use it for a more in-depth tutorial.

#Chroma-Db #Generative-Ai #Python

Reply to this post by email ↪