What RAG is and when you need it
Retrieval-Augmented Generation (RAG) retrieves relevant content from your knowledge base and provides it as context to the LLM before it generates a response. The result: the AI answers based on your actual documentation rather than its training data, with dramatically lower hallucination rates.
Use RAG when: the automation must answer questions about proprietary knowledge (company policies, specific product features, client-specific info); the knowledge base is too large to include entirely in every prompt; content changes frequently enough that training data cannot be relied on; or hallucination is a serious risk (customer-facing decisions, compliance, financial matters).
RAG is overkill when: the knowledge base fits in a single prompt (under ~10,000 tokens); the task requires only general knowledge; or for classification/extraction where LLM training data is sufficient.
The two-phase RAG architecture
Phase 1: Indexing (run once, or when content changes)
Documents (PDFs, Word, web pages, Notion)
↓ Load & extract text
↓ Split into chunks (500-800 tokens, 100-token overlap)
↓ Embed each chunk (text-embedding-3-small)
↓ Store in vector database (FAISS / Pinecone / Supabase)Phase 2: Query (runs for every query)
User question
↓ Embed question with same model
↓ Similarity search in vector database
↓ Retrieve top-3 to 5 most relevant chunks
↓ Prompt: [System] + [Retrieved chunks] + [Question]
↓ LLM generates grounded responseKey component choices
Embedding model: OpenAI text-embedding-3-small ($0.02/million tokens) for most use cases. For zero API cost, use sentence-transformers/all-MiniLM-L6-v2 locally via the sentence-transformers library.
Chunk size: 500–800 tokens with 100-token overlap is the practical starting point. Test with 20 representative queries and adjust based on retrieval quality before committing.
Vector database: FAISS (in-memory, no server) for development and small deployments; Pinecone or Supabase pgvector (both have free tiers) for production.
Minimal working RAG in Python (FAISS)
pip install openai faiss-cpu numpy
import openai, faiss, numpy as np, json
client = openai.OpenAI()
def embed(text):
r = client.embeddings.create(model="text-embedding-3-small", input=text)
return r.data[0].embedding
def chunk(text, size=600, overlap=100):
words = text.split()
return [" ".join(words[i:i+size]) for i in range(0, len(words), size-overlap) if words[i:i+size]]
class RAGSystem:
def __init__(self):
self.chunks = []
self.index = None
def build_index(self, docs):
all_chunks = []
for doc in docs:
for c in chunk(doc["content"]):
all_chunks.append({"text": c, "title": doc.get("title","")})
vecs = np.array([embed(c["text"]) for c in all_chunks], dtype=np.float32)
faiss.normalize_L2(vecs)
self.index = faiss.IndexFlatIP(vecs.shape[1])
self.index.add(vecs)
self.chunks = all_chunks
print(f"Indexed {len(all_chunks)} chunks")
def search(self, query, k=3):
q = np.array([embed(query)], dtype=np.float32)
faiss.normalize_L2(q)
_, idxs = self.index.search(q, k)
return [self.chunks[i] for i in idxs[0] if i != -1]
def answer(self, question):
retrieved = self.search(question)
context = "
".join(f"[{c['title']}]
{c['text']}" for c in retrieved)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": (
"Answer questions ONLY using the context provided below. "
"If the answer is not in the context, state clearly that "
"you do not have that information. Do not use outside knowledge.
"
f"KNOWLEDGE BASE:
{context}"
)},
{"role": "user", "content": question}
],
temperature=0.1
)
return response.choices[0].message.content
# Example usage
rag = RAGSystem()
rag.build_index([
{"title": "Refund Policy",
"content": "Full refunds available within 30 days of purchase. After 30 days, we offer store credit only. Contact billing@company.com with your order number. Processing takes 5-7 business days."},
{"title": "Pro Plan Features",
"content": "Pro plan includes unlimited workflows, priority support, and API access. Starter plan is limited to 10 active workflows and email support only."},
])
print(rag.answer("Can I get a refund after 30 days?"))Production vector databases
Pinecone (managed, scales to billions of vectors)
Most popular managed vector database. Free tier: 1 index, 100K vectors — adequate for most small business RAG applications. Simple SDK, fully managed infrastructure, no server administration.
pip install pinecone-client
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key="your-key")
pc.create_index("kb", dimension=1536, metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1"))
index = pc.Index("kb")
# Upsert
index.upsert([{"id": "chunk_001", "values": vector,
"metadata": {"text": text, "source": "policy"}}])
# Query
results = index.query(vector=query_vec, top_k=3, include_metadata=True)
chunks = [r.metadata["text"] for r in results.matches]Supabase pgvector (free tier, SQL-friendly)
PostgreSQL with the pgvector extension. Free tier sufficient for small deployments. Advantage: combine vector search with SQL filters (e.g., "retrieve relevant chunks that are also tagged active and belong to product version 2.x").
Common RAG quality issues and fixes
Incomplete answers (context split across chunks)
Fix: Increase chunk overlap from 100 to 200 tokens; increase top_k from 3 to 5–7; or implement parent document retrieval (match on small chunks, expand to the parent paragraph for context delivery).
AI ignores retrieved context
Fix: Strengthen the constraint in your system prompt: "You MUST answer ONLY from the context sections below. If the answer is not explicitly present, say so. Generating any information not in the context is a critical error."
Poor retrieval for short queries
Fix: Query expansion before retrieval. Have an LLM rewrite the question to be more specific: "What is the price?" becomes "What are the subscription pricing tiers, monthly costs, and annual billing options for this software product?" The expanded query produces significantly better semantic matches.
Knowledge base going stale
Fix: For most small business knowledge bases changing weekly or less, a scheduled nightly full re-index is the simplest reliable approach. For larger bases, implement incremental updates: identify documents by a source metadata field, delete and re-index only changed documents when their source files are updated.
Related: Vector databases for AI automation — deep coverage of vector DB selection, indexing strategies, and performance tuning.
Frequently asked questions
For small knowledge bases (under ~10,000 tokens), paste the content directly into the system prompt — this is effectively RAG without the retrieval step. For proper RAG with automatic document ingestion and semantic retrieval, Python or a dedicated platform is required. Relevance AI provides a no-code RAG interface worth evaluating for teams without Python skills. Make.com can call vector database APIs via HTTP modules but cannot handle chunking and indexing natively.
For a small business customer support RAG system handling 500 queries per day: OpenAI embedding for initial indexing (~$0.10 for a typical FAQ), vector DB (free tier for Pinecone or Supabase), and LLM costs for query responses ($0.15–$0.50/day depending on response length and model). Total: approximately $5–$20/month for most small business RAG deployments.
Start with 600 tokens (approximately 450 words) with 100-token overlap — this works well for FAQ pages, policy documents, and product documentation. Increase to 800–1000 tokens if answers regularly require context from multiple consecutive sections. Decrease to 400–500 tokens if irrelevant content is being retrieved alongside relevant content. Test with 20 representative queries before committing to a chunk size.
Continue building expertise
The complete guide covers every tool and architecture.
Complete AI Automation Guide →ThinkForAI Editorial Team
All code verified in production. Updated November 2024.
