How to Build Production RAG Systems: Lessons from Processing 10K+ Documents

At CleverFlow, I built a RAG (Retrieval Augmented Generation) system that processes over 10,000 documents daily with multimodal analysis capabilities. This system serves multiple enterprise clients with 99.5% uptime and sub-second query response times.

In this guide, I’ll share the real-world architecture, lessons learned, and optimization techniques that can help you build production-grade RAG systems.

What is RAG and Why Does It Matter?

RAG (Retrieval Augmented Generation) combines the power of large language models with external knowledge retrieval. Instead of relying solely on the LLM’s training data, RAG systems:

Retrieve relevant documents from a knowledge base
Augment the LLM prompt with retrieved context
Generate accurate responses based on your specific data

This approach solves key LLM limitations:

Hallucinations - Reduced by grounding answers in real documents
Outdated knowledge - Always use latest data without retraining
Domain specificity - Tailor responses to your business context

Production RAG Architecture Overview

Here’s the high-level architecture we use at CleverFlow:

┌─────────────┐      ┌──────────────┐      ┌─────────────┐
│  Documents  │─────▶│   Chunking   │─────▶│  Embedding  │
│  (10K+)     │      │  & Processing│      │   Model     │
└─────────────┘      └──────────────┘      └─────────────┘
                                                    │
                                                    ▼
                                            ┌──────────────┐
                                            │Vector Database│
                                            │   (Qdrant)   │
                                            └──────────────┘
                                                    │
┌─────────────┐      ┌──────────────┐      ┌─────────────┐
│   User      │─────▶│    Query     │◀─────│  Semantic   │
│   Query     │      │  Processing  │      │   Search    │
└─────────────┘      └──────────────┘      └─────────────┘
                            │
                            ▼
                    ┌──────────────┐
                    │  Re-ranking  │
                    └──────────────┘
                            │
                            ▼
                    ┌──────────────┐      ┌─────────────┐
                    │     LLM      │◀─────│   Redis     │
                    │  (GPT/LLaMA) │      │   Cache     │
                    └──────────────┘      └─────────────┘
                            │
                            ▼
                    ┌──────────────┐
                    │  FastAPI     │
                    │   Response   │
                    └──────────────┘

Step 1: Vector Database Selection

The vector database is the heart of your RAG system. Based on our production experience, here’s what matters:

Qdrant (Our Choice)

Performance: HNSW indexing provides sub-100ms search on millions of vectors
Filtering: Powerful metadata filtering narrows search space
Scalability: Horizontal scaling with collection sharding
Production-ready: Built-in monitoring and health checks

Alternatives

Faiss: Great for single-machine deployments, excellent speed
Pinecone: Managed service, good for quick POCs
Weaviate: Strong multi-modal capabilities

Key Decision Factors:

# Example: Qdrant setup with optimal settings
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

client = QdrantClient(url="http://localhost:6333")

# Create collection with HNSW index
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(
        size=768,  # BGE embedding dimension
        distance=Distance.COSINE,
        on_disk=False  # Keep in memory for speed
    ),
    hnsw_config={
        "m": 16,  # Number of edges per node
        "ef_construct": 100  # Quality vs speed tradeoff
    }
)

Step 2: Embedding Model Selection

We tested multiple embedding models. Here’s what we found:

BGE (BAAI General Embedding) - Our Winner

Model: BAAI/bge-large-en-v1.5
Dimension: 1024 (we use 768 variant for speed)
Performance: Best retrieval quality in our domain
Speed: 50ms latency for batch of 10 queries

Other Strong Candidates

E5-large: Excellent multilingual support
OpenAI text-embedding-3: Good but costly at scale
Sentence-Transformers: Great for custom fine-tuning

Production Tip:

from sentence_transformers import SentenceTransformer

# Load model once at startup
model = SentenceTransformer('BAAI/bge-large-en-v1.5')

# Batch processing for efficiency
def embed_documents(texts: list[str]) -> list[list[float]]:
    return model.encode(
        texts,
        batch_size=32,
        show_progress_bar=False,
        normalize_embeddings=True  # For cosine similarity
    ).tolist()

Step 3: Chunking Strategy

This is where most RAG systems fail. Poor chunking = poor retrieval = poor answers.

Our Production Chunking Approach

We use semantic chunking with sliding windows:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,        # Tokens, not characters
    chunk_overlap=50,      # Maintain context
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len
)

chunks = splitter.split_text(document_text)

Chunking Rules We Follow

Size: 512-1024 tokens (matches embedding model context)
Overlap: 10-20% overlap preserves context across boundaries
Semantic boundaries: Split on paragraphs, then sentences
Metadata preservation: Track source document, page number, section

Advanced: Chunk Optimization

We achieved 30% better retrieval with these techniques:

def create_enriched_chunks(document):
    """Add context to chunks for better retrieval"""
    chunks = []

    for i, chunk in enumerate(base_chunks):
        # Prepend document context
        enriched = f"""
        Document: {document.title}
        Section: {document.sections[i]}

        {chunk}
        """

        chunks.append({
            'text': enriched,
            'metadata': {
                'doc_id': document.id,
                'page': document.pages[i],
                'chunk_index': i,
                'total_chunks': len(base_chunks)
            }
        })

    return chunks

Step 4: Hybrid Search Implementation

Pure vector search misses exact matches. We use hybrid search:

def hybrid_search(query: str, top_k: int = 10):
    # 1. Vector search (semantic)
    query_vector = embed_model.encode(query)
    vector_results = qdrant_client.search(
        collection_name="documents",
        query_vector=query_vector,
        limit=top_k
    )

    # 2. Keyword search (BM25)
    keyword_results = bm25_index.search(query, top_k)

    # 3. Fusion (RRF - Reciprocal Rank Fusion)
    combined = reciprocal_rank_fusion(
        [vector_results, keyword_results],
        weights=[0.7, 0.3]  # Favor semantic
    )

    return combined[:top_k]

Result: 20-30% improvement in retrieval accuracy over vector-only search.

Step 5: Re-ranking Layer

Retrieved documents aren’t always in optimal order. Re-ranking fixes this:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_results(query: str, results: list):
    # Score each result with cross-encoder
    pairs = [[query, doc.content] for doc in results]
    scores = reranker.predict(pairs)

    # Sort by scores
    reranked = sorted(
        zip(results, scores),
        key=lambda x: x[1],
        reverse=True
    )

    return [doc for doc, score in reranked]

Step 6: Caching Layer (40% Cost Reduction)

Redis caching was our biggest performance win:

import redis
import hashlib
import json

redis_client = redis.Redis(host='localhost', port=6379)

def cached_rag_query(query: str):
    # Create cache key
    cache_key = f"rag:{hashlib.md5(query.encode()).hexdigest()}"

    # Check cache
    cached = redis_client.get(cache_key)
    if cached:
        return json.loads(cached)

    # Process query
    result = rag_pipeline(query)

    # Cache for 1 hour
    redis_client.setex(
        cache_key,
        3600,
        json.dumps(result)
    )

    return result

Impact:

60% faster response times
40% lower inference costs
Better user experience

Step 7: FastAPI Microservice

Production-ready API with proper error handling:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import logging

app = FastAPI(title="RAG API")
logger = logging.getLogger(__name__)

class Query(BaseModel):
    text: str
    top_k: int = 5
    use_cache: bool = True

@app.post("/query")
async def query_documents(query: Query):
    try:
        # Input validation
        if len(query.text) < 10:
            raise HTTPException(400, "Query too short")

        # Process with timeout
        result = await asyncio.wait_for(
            rag_pipeline(query.text, query.top_k),
            timeout=30.0
        )

        return {
            "answer": result.answer,
            "sources": result.sources,
            "confidence": result.confidence
        }

    except asyncio.TimeoutError:
        logger.error(f"Timeout for query: {query.text}")
        raise HTTPException(504, "Query timeout")

    except Exception as e:
        logger.exception("RAG error")
        raise HTTPException(500, str(e))

@app.get("/health")
async def health_check():
    """Health check for monitoring"""
    return {
        "status": "healthy",
        "vector_db": check_qdrant(),
        "llm": check_llm()
    }

Step 8: Docker Deployment

Production containerization:

FROM python:3.11-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY . .

# Download models at build time
RUN python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('BAAI/bge-large-en-v1.5')"

# Run
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

Docker Compose for full stack:

version: '3.8'
services:
  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"
    volumes:
      - ./qdrant_data:/qdrant/storage

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  rag-api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - QDRANT_URL=http://qdrant:6333
      - REDIS_URL=redis://redis:6379
    depends_on:
      - qdrant
      - redis

Production Metrics We Track

Essential monitoring:

from prometheus_client import Counter, Histogram

# Metrics
query_counter = Counter('rag_queries_total', 'Total queries')
query_duration = Histogram('rag_query_duration_seconds', 'Query latency')
cache_hits = Counter('rag_cache_hits_total', 'Cache hits')

@query_duration.time()
def process_query(query: str):
    query_counter.inc()

    # Check cache
    if cached_result := get_from_cache(query):
        cache_hits.inc()
        return cached_result

    # Process
    return rag_pipeline(query)

Our Production Stats:

Uptime: 99.5%
P95 Latency: 850ms
Cache Hit Rate: 42%
Daily Queries: 50,000+
Documents Indexed: 10,000+

Optimization Techniques That Worked

1. Batch Processing

Process documents in batches of 100 for 3x faster indexing.

2. Async Processing

Use asyncio for concurrent LLM calls - 2x throughput improvement.

3. Model Quantization

4-bit quantization reduced memory by 75% with minimal accuracy loss.

4. Connection Pooling

Reuse database connections - 40% latency reduction.

5. Prompt Optimization

Shorter, focused prompts reduced tokens by 30%, cutting costs.

Common Pitfalls to Avoid

Too large chunks - LLMs lose focus beyond 1000 tokens
No metadata filtering - Wastes time retrieving irrelevant docs
Single vector search - Misses exact keyword matches
No caching - Paying for same query repeatedly
Synchronous processing - Can’t scale under load
Poor error handling - System crashes on edge cases
No monitoring - Can’t optimize what you don’t measure

Cost Optimization

Our monthly costs for 50K queries:

Qdrant Cloud: $150 (2GB RAM, 10M vectors)
OpenAI API: $200 (with caching: was $500)
Redis: $20 (managed instance)
Compute: $100 (2x CPU instances)

Total: ~$470/month for production RAG

Next Steps and Advanced Topics

Once you have a basic RAG system, consider:

Multi-modal RAG - Images, tables, charts
Agentic RAG - LLM decides which tools to use
Fine-tuned embeddings - Domain-specific embedding models
Query rewriting - Improve retrieval with query expansion
Feedback loops - User ratings improve retrieval over time

Conclusion

Building production RAG systems is an iterative process. Our system at CleverFlow took 3 months to reach production quality, processing 10,000+ documents with 99.5% uptime.

Key Takeaways:

Vector database choice matters - Qdrant scales well
Chunking is critical - Semantic chunking beats fixed-size
Hybrid search wins - Combine vector + keyword
Cache aggressively - 40% cost savings
Monitor everything - Can’t optimize blindly
Iterate based on data - User feedback drives improvements

The architecture I shared here is battle-tested and serves enterprise clients reliably. Start with the basics, measure performance, and optimize bottlenecks.

Tools and Technologies Used

LangChain - RAG orchestration
Qdrant - Vector database
FastAPI - API framework
Redis - Caching layer
Docker - Containerization
PyTorch - Model inference
Sentence Transformers - Embeddings

Connect With Me

Building RAG systems? Have questions? Let’s connect:

GitHub: github.com/huzaifa525
LinkedIn: linkedin.com/in/huzefanalkheda
HuggingFace: huggingface.co/huzaifa525

Have you built RAG systems? What challenges did you face? Share your experience in the comments or reach out - I’d love to hear your story!