how.wtf

How to use CSV files in vector stores with Langchain

· Thomas Taylor

how to use csv files in vector stores with langchain step by step

Retrieval-Augmented Generation (RAG) is a technique for improving an LLM’s response by including contextual information from external sources. In other terms, it helps a large language model answer a question by providing facts and information for the prompt.

For the purposes of this tutorial, we will implement RAG by leveraging a Chroma DB as a vector store with the FDIC Failed Bank List dataset.

Langchain with CSV data in a vector store

A vector store leverages a vector database, like Chroma DB, to fetch relevant documents using cosine similarity searches.

Install the dependencies:

1pip install langchain chromadb sentence-transformers

Use the following code:

 1from langchain.document_loaders import CSVLoader
 2from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
 3from langchain.vectorstores import Chroma
 4
 5embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
 6
 7loader = CSVLoader("./banklist.csv", encoding="windows-1252")
 8documents = loader.load()
 9
10db = Chroma.from_documents(documents, embedding_function)
11query = "Did a bank fail in North Carolina?"
12docs = db.similarity_search(query)
13print(docs[0].page_content)

Steps:

  1. Use the SentenceTransformerEmbeddings to create an embedding function using the open source model of all-MiniLM-L6-v2 from huggingface.
  2. Instantiate the loader for the csv files from the banklist.csv file. I had to use windows-1252 for the encoding of banklist.csv.
  3. Load the files
  4. Instantiate a Chroma DB instance from the documents & the embedding model
  5. Perform a cosine similarity search
  6. Print out the contents of the first retrieved document

Langchain Expression with Chroma DB CSV (RAG)

After exploring how to use CSV files in a vector store, let’s now explore a more advanced application: integrating Chroma DB using CSV data in a chain.

This section will demonstrate how to enhance the capabilities of our language model by incorporating RAG.

For the purposes of the following code, I opted for the OpenAI model and embeddings.

Install the dependencies:

1pip install langchain chromadb openai tiktoken

Use the following code:

 1from langchain.chat_models import ChatOpenAI
 2from langchain.document_loaders import CSVLoader
 3from langchain.embeddings import OpenAIEmbeddings
 4from langchain.prompts import ChatPromptTemplate
 5from langchain.vectorstores import Chroma
 6from langchain_core.output_parsers import StrOutputParser
 7from langchain_core.runnables import RunnableLambda, RunnablePassthrough
 8
 9embedding_function = OpenAIEmbeddings()
10
11loader = CSVLoader("./banklist.csv", encoding="windows-1252")
12documents = loader.load()
13
14db = Chroma.from_documents(documents, embedding_function)
15retriever = db.as_retriever()
16
17template = """Answer the question based only on the following context:
18{context}
19
20Question: {question}
21"""
22prompt = ChatPromptTemplate.from_template(template)
23
24model = ChatOpenAI()
25
26chain = (
27    {"context": retriever, "question": RunnablePassthrough()}
28    | prompt
29    | model
30    | StrOutputParser()
31)
32
33print(chain.invoke("What bank failed in North Carolina?"))

Borrowing from the prior example, we:

  1. Created a prompt template with context and question variables
  2. Created a chain using the ChatOpenAI model with a retriever
  3. Invoked the chain with the question What bank failed in North Carolina?

Output:

1The bank that failed in North Carolina is Blue Ridge Savings Bank, Inc.

This tutorial only includes the basic functionality for Chroma DB. Please visit my Chroma DB guide where I walk step-by-step on how to use it for a more in-depth tutorial.

#chroma-db   #generative-ai   #python  

Reply to this post by email ↪