How to use CSV files in vector stores with Langchain
Retrieval-Augmented Generation (RAG) is a technique for improving an LLM’s response by including contextual information from external sources. In other terms, it helps a large language model answer a question by providing facts and information for the prompt.
For the purposes of this tutorial, we will implement RAG by leveraging a Chroma DB as a vector store with the FDIC Failed Bank List dataset.
Langchain with CSV data in a vector store
A vector store leverages a vector database, like Chroma DB, to fetch relevant documents using cosine similarity searches.
Install the dependencies:
1pip install langchain chromadb sentence-transformers
Use the following code:
1from langchain.document_loaders import CSVLoader
2from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
3from langchain.vectorstores import Chroma
4
5embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
6
7loader = CSVLoader("./banklist.csv", encoding="windows-1252")
8documents = loader.load()
9
10db = Chroma.from_documents(documents, embedding_function)
11query = "Did a bank fail in North Carolina?"
12docs = db.similarity_search(query)
13print(docs[0].page_content)
Steps:
- Use the
SentenceTransformerEmbeddings
to create an embedding function using the open source model ofall-MiniLM-L6-v2
from huggingface. - Instantiate the loader for the csv files from the
banklist.csv
file. I had to usewindows-1252
for the encoding ofbanklist.csv
. - Load the files
- Instantiate a Chroma DB instance from the documents & the embedding model
- Perform a cosine similarity search
- Print out the contents of the first retrieved document
Langchain Expression with Chroma DB CSV (RAG)
After exploring how to use CSV files in a vector store, let’s now explore a more advanced application: integrating Chroma DB using CSV data in a chain.
This section will demonstrate how to enhance the capabilities of our language model by incorporating RAG.
For the purposes of the following code, I opted for the OpenAI model and embeddings.
Install the dependencies:
1pip install langchain chromadb openai tiktoken
Use the following code:
1from langchain.chat_models import ChatOpenAI
2from langchain.document_loaders import CSVLoader
3from langchain.embeddings import OpenAIEmbeddings
4from langchain.prompts import ChatPromptTemplate
5from langchain.vectorstores import Chroma
6from langchain_core.output_parsers import StrOutputParser
7from langchain_core.runnables import RunnableLambda, RunnablePassthrough
8
9embedding_function = OpenAIEmbeddings()
10
11loader = CSVLoader("./banklist.csv", encoding="windows-1252")
12documents = loader.load()
13
14db = Chroma.from_documents(documents, embedding_function)
15retriever = db.as_retriever()
16
17template = """Answer the question based only on the following context:
18{context}
19
20Question: {question}
21"""
22prompt = ChatPromptTemplate.from_template(template)
23
24model = ChatOpenAI()
25
26chain = (
27 {"context": retriever, "question": RunnablePassthrough()}
28 | prompt
29 | model
30 | StrOutputParser()
31)
32
33print(chain.invoke("What bank failed in North Carolina?"))
Borrowing from the prior example, we:
- Created a prompt template with
context
andquestion
variables - Created a chain using the
ChatOpenAI
model with a retriever - Invoked the chain with the question
What bank failed in North Carolina?
Output:
1The bank that failed in North Carolina is Blue Ridge Savings Bank, Inc.
This tutorial only includes the basic functionality for Chroma DB. Please visit my Chroma DB guide where I walk step-by-step on how to use it for a more in-depth tutorial.