RAG
Retrieval-augmented generation, or RAG, is a technique that enhances an LLM’s responses by incorporating external information.
RAG leverages a database to fetch the most contextually relevant results based on semantic search at generation time. This technique is particularly useful for prompting off-the-shelf LLMs that were fine-tuned for general chat, enabling them to handle queries beyond their original training data. It also allows LLMs to access updated information without retraining and reduce hallucinations.
Workflow steps
A typical RAG workflow consists of two main phases: ingestion and online processing. Ingestion, or preprocessing, (steps 1-3) involves preparing the data for the RAG system and usually occurs offline. Online processing (steps 5-7) handles the real-time aspects of the RAG system, focusing on data retrieval and response generation.
Document parsing
Data is loaded and formatted into digital text using document loaders tailored to specific file types, such as CSV, Markdown, and PDF. This step includes extracting content and associated metadata like source and page numbers. Depending on the quality and format of the files, data cleaning and customization may be necessary to meet different use case requirements.
Splitting
Documents are divided into smaller chunks to accommodate the model’s context window limitations. There are various strategies for data chunking, with chunk size, overlap size, and sliding window being key hyperparameters that can be tuned.
Vector embeddings
Each text chunk is converted into a vector representation using an embedding model. These embeddings facilitate content retrieval based on semantic similarity.
Vector store
The embeddings with their corresponding content and metadata are stored in a vector database, using the embedding as the index. Several vector database options are available, including FAISS, ChromaDB, Qdrant, and Milvus for example.
Retrieval
The query is also embedded, and a retriever function is used to identify the closest chunk vectors in the vector database to the query vector, based on a specific similarity metric. This step leverages the embeddings stored during the ingestion workflow.
Reranking
This is an optional step in the online processing phase. A reranker model can also be used to sort the retrieved chunks in order of relevance and remove the unnecessary chunks.
Q&A generation
The LLM gets the user query and the final retrieved chunks to generate a grounded response.
Try implementing a RAG application end-to-end using LangChain and SambaNova by checking out the enterprise knowledge retriever (EKR) kit.