When we launched our RAG system at CleverFlow, average query latency was 2.1 seconds. Users complained it felt slow. After implementing the optimizations I’ll share in this post, we brought it down to 850ms - a 60% reduction.
Here’s exactly how we did it, with code examples and real metrics.
The Performance Problem
Our initial RAG system architecture:
User Query (50ms)
↓
Embedding Generation (200ms)
↓
Vector Search (1500ms) ⬅️ BOTTLENECK
↓
LLM Generation (350ms)
───────────────────────
Total: 2.1 seconds
Vector search was taking 71% of total time.
Why Vector Search Was Slow
After profiling, we found 4 major issues:
- No indexing - Brute force search across 2M vectors
- No caching - Repeated queries hitting database
- Large embeddings - 1536 dimensions (OpenAI embeddings)
- No connection pooling - New connection per query
Let’s fix each one.
Optimization 1: HNSW Indexing (40% Faster)
The Problem
Our initial Qdrant setup used flat indexing (brute force):
# Slow: Brute force search
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
# No index specified = brute force
)
Search time: 1500ms across 2M vectors
The Solution: HNSW
HNSW (Hierarchical Navigable Small World) creates a multi-layer graph for logarithmic search time.
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, HnswConfigDiff
client = QdrantClient(url="http://localhost:6333")
client.create_collection(
collection_name="documents_optimized",
vectors_config=VectorParams(
size=768, # Reduced from 1536 (explained below)
distance=Distance.COSINE,
on_disk=False # Keep in RAM for speed
),
hnsw_config=HnswConfigDiff(
m=16, # Edges per node (balance speed/accuracy)
ef_construct=100, # Construction time quality
full_scan_threshold=10000 # Use HNSW above this size
)
)
HNSW Parameters Explained
m (number of edges):
- Higher = more accuracy, more memory
- Sweet spot: 16-32
- We use 16 for good balance
ef_construct (build quality):
- Higher = better index quality, slower build
- We use 100 (default is 100)
ef (search quality):
# Search with custom ef
results = client.search(
collection_name="documents_optimized",
query_vector=query_embedding,
limit=10,
search_params={"hnsw_ef": 128} # Higher = more accurate
)
Results
| Configuration | Search Time | Accuracy |
|---|---|---|
| Flat indexing | 1500ms | 100% |
| HNSW (m=8, ef=64) | 420ms | 94% |
| HNSW (m=16, ef=128) | 900ms | 98% |
| HNSW (m=32, ef=256) | 1200ms | 99.5% |
Winner: m=16, ef=128 → 900ms (40% faster, 98% accuracy)
Optimization 2: Dimension Reduction (30% Faster)
The Problem
We were using OpenAI’s text-embedding-3-large (1536 dimensions):
# Expensive embeddings
embeddings = openai.Embedding.create(
model="text-embedding-3-large",
input=texts
) # Returns 1536-dim vectors
Issues:
- More compute for similarity calculation
- More memory usage
- Slower search
The Solution: Switch to BGE-large (768 dims)
from sentence_transformers import SentenceTransformer
# Smaller, faster, free
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
embeddings = model.encode(texts) # Returns 768-dim vectors
Dimension Comparison
| Model | Dimensions | Search Time | Quality | Cost/1M |
|---|---|---|---|---|
| OpenAI 3-large | 1536 | 900ms | Excellent | $130 |
| OpenAI 3-small | 1536 | 900ms | Very Good | $20 |
| BGE-large | 768 | 630ms | Very Good | $0 |
| BGE-base | 768 | 630ms | Good | $0 |
BGE-large wins: 30% faster, 98% quality, $0 cost.
Migration Strategy
def migrate_to_bge():
"""Migrate from OpenAI embeddings to BGE"""
# Load new embedding model
bge_model = SentenceTransformer('BAAI/bge-large-en-v1.5')
# Create new collection
client.create_collection(
collection_name="documents_bge",
vectors_config=VectorParams(size=768, distance=Distance.COSINE)
)
# Re-embed and upload
batch_size = 100
for i in range(0, len(documents), batch_size):
batch = documents[i:i+batch_size]
texts = [doc['content'] for doc in batch]
# Generate new embeddings
embeddings = bge_model.encode(texts, show_progress_bar=True)
# Upload to new collection
client.upsert(
collection_name="documents_bge",
points=[
{
"id": doc['id'],
"vector": embedding.tolist(),
"payload": doc['metadata']
}
for doc, embedding in zip(batch, embeddings)
]
)
print("Migration complete!")
Optimization 3: Redis Caching (50% Hit Rate)
The Problem
Many users ask similar questions:
- “What is RAG?”
- “How to fine-tune LLaMA?”
- “Best practices for vector databases?”
We were hitting the database every time, even for identical queries.
The Solution: Semantic Caching
import redis
import hashlib
import numpy as np
from typing import Optional
redis_client = redis.Redis(host='localhost', port=6379, decode_responses=False)
def semantic_cache_key(query_vector: np.ndarray) -> str:
"""Create cache key from embedding"""
# Round to reduce cache misses from tiny differences
rounded = np.round(query_vector, decimals=4)
vector_hash = hashlib.md5(rounded.tobytes()).hexdigest()
return f"rag:vector:{vector_hash}"
def get_cached_results(query_vector: np.ndarray) -> Optional[list]:
"""Check cache for similar query"""
cache_key = semantic_cache_key(query_vector)
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
return None
def cache_results(query_vector: np.ndarray, results: list, ttl: int = 3600):
"""Cache search results for 1 hour"""
cache_key = semantic_cache_key(query_vector)
redis_client.setex(
cache_key,
ttl,
json.dumps(results)
)
# Usage
def search_with_cache(query: str, top_k: int = 10):
# Generate embedding
query_vector = bge_model.encode(query)
# Check cache
cached = get_cached_results(query_vector)
if cached:
logger.info("Cache hit!")
return cached
# Cache miss - search database
results = client.search(
collection_name="documents_bge",
query_vector=query_vector.tolist(),
limit=top_k
)
# Cache results
cache_results(query_vector, results)
return results
Cache Performance
After 1 week of production traffic:
Total queries: 50,234
Cache hits: 24,891 (49.5%)
Cache misses: 25,343 (50.5%)
Average latency:
- Cache hit: 12ms
- Cache miss: 630ms
Effective latency: (0.495 × 12ms) + (0.505 × 630ms) = 324ms
50% cache hit rate = 51% latency reduction!
Advanced: Approximate Matching
For even better hit rates, use approximate matching:
from scipy.spatial.distance import cosine
def find_similar_cached(query_vector: np.ndarray, threshold: float = 0.95):
"""Find similar cached queries"""
# Scan recent cache keys (store vectors in Redis)
recent_keys = redis_client.keys("rag:vector:*")[:1000]
for key in recent_keys:
cached_vector = np.frombuffer(redis_client.get(f"{key}:vec"))
similarity = 1 - cosine(query_vector, cached_vector)
if similarity >= threshold:
return redis_client.get(key)
return None
Optimization 4: Connection Pooling
The Problem
Creating a new Qdrant connection per query:
def search(query):
client = QdrantClient(url="http://qdrant:6333") # ❌ New connection
results = client.search(...)
return results
Overhead: 50-100ms per connection.
The Solution: Reuse Connections
from qdrant_client import QdrantClient
from functools import lru_cache
@lru_cache(maxsize=1)
def get_qdrant_client() -> QdrantClient:
"""Singleton Qdrant client"""
return QdrantClient(
url="http://qdrant:6333",
timeout=30,
prefer_grpc=True # Faster than HTTP
)
# Usage
client = get_qdrant_client() # Reused across requests
results = client.search(...)
Benefit: -50ms per query
Optimization 5: Parallel Queries
The Problem
Sequential search for multiple queries:
def search_multiple(queries: list[str]):
results = []
for query in queries:
result = search(query) # ❌ Sequential
results.append(result)
return results
# Takes: len(queries) × 630ms
The Solution: Async Parallel
import asyncio
from qdrant_client import AsyncQdrantClient
async_client = AsyncQdrantClient(url="http://qdrant:6333")
async def search_async(query: str):
query_vector = bge_model.encode(query)
return await async_client.search(
collection_name="documents_bge",
query_vector=query_vector.tolist(),
limit=10
)
async def search_multiple_parallel(queries: list[str]):
tasks = [search_async(q) for q in queries]
results = await asyncio.gather(*tasks)
return results
# Usage
results = asyncio.run(search_multiple_parallel(queries))
# Takes: max(query times) ≈ 630ms regardless of count!
Benefit: 5 queries in 630ms instead of 3150ms
Optimization 6: Metadata Filtering
The Problem
Searching all 2M vectors when you only need recent documents:
# Slow: Search everything
results = client.search(
collection_name="documents_bge",
query_vector=query_vector,
limit=10
)
The Solution: Pre-filter by Metadata
from qdrant_client.models import Filter, FieldCondition, MatchValue
# Fast: Filter before vector search
results = client.search(
collection_name="documents_bge",
query_vector=query_vector,
query_filter=Filter(
must=[
FieldCondition(
key="year",
match=MatchValue(value=2024)
),
FieldCondition(
key="category",
match=MatchValue(value="medical")
)
]
),
limit=10
)
Impact:
- Search space: 2M → 150K vectors
- Latency: 630ms → 280ms
- Accuracy: Same (filtering irrelevant docs anyway)
Optimization 7: Batch Indexing
The Problem
Inserting documents one at a time:
for doc in documents:
client.upsert(
collection_name="documents_bge",
points=[{
"id": doc['id'],
"vector": embed(doc['text']),
"payload": doc['metadata']
}]
)
# Takes: 20 minutes for 10K docs
The Solution: Batch Upload
from tqdm import tqdm
def batch_index(documents: list, batch_size: int = 100):
"""Efficient batch indexing"""
for i in tqdm(range(0, len(documents), batch_size)):
batch = documents[i:i+batch_size]
# Batch embed
texts = [doc['text'] for doc in batch]
embeddings = bge_model.encode(texts, batch_size=32)
# Batch upsert
points = [
{
"id": doc['id'],
"vector": embedding.tolist(),
"payload": doc['metadata']
}
for doc, embedding in zip(batch, embeddings)
]
client.upsert(
collection_name="documents_bge",
points=points,
wait=False # Don't wait for confirmation
)
# Takes: 3 minutes for 10K docs (6.7x faster!)
Optimization 8: Monitoring & Profiling
What to Monitor
from prometheus_client import Histogram, Counter
import time
# Metrics
search_latency = Histogram(
'vector_search_latency_seconds',
'Vector search latency',
buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)
cache_hits = Counter('cache_hits_total', 'Cache hits')
cache_misses = Counter('cache_misses_total', 'Cache misses')
def monitored_search(query: str):
start = time.time()
# Check cache
if cached := get_cached_results(query):
cache_hits.inc()
search_latency.observe(time.time() - start)
return cached
cache_misses.inc()
# Search
results = search(query)
search_latency.observe(time.time() - start)
return results
Grafana Dashboard
Key metrics to track:
- P50, P95, P99 latency - Understand tail latency
- Cache hit rate - Optimize cache strategy
- Query volume - Plan capacity
- Error rate - Catch issues early
- Vector DB memory - Know when to scale
Final Architecture
After all optimizations:
User Query (50ms)
↓
Embedding Generation (80ms) ⬅️ BGE faster than OpenAI
↓
Redis Cache Check (12ms) ⬅️ 50% hit rate
↓ (50% miss)
Metadata Filtering (20ms) ⬅️ Reduce search space
↓
HNSW Vector Search (280ms) ⬅️ Optimized index + smaller dims
↓
LLM Generation (350ms)
───────────────────────────────
Cache Hit: 442ms (79% faster)
Cache Miss: 850ms (60% faster)
Effective: 646ms (69% faster)
Performance Summary
| Optimization | Latency Reduction | Cost Impact |
|---|---|---|
| HNSW Indexing | -600ms (40%) | +10% memory |
| BGE Embeddings | -270ms (30%) | $0 (was $130/M) |
| Redis Caching | -318ms (51% × 630ms) | +$20/month |
| Connection Pooling | -50ms | $0 |
| Metadata Filtering | -350ms | $0 |
| Total | -1250ms (60%) | Saves $110/month |
Code: Complete Optimized System
from qdrant_client import QdrantClient
from sentence_transformers import SentenceTransformer
import redis
import hashlib
import json
from functools import lru_cache
# Initialize once
@lru_cache(maxsize=1)
def get_clients():
qdrant = QdrantClient(url="http://qdrant:6333", prefer_grpc=True)
redis_client = redis.Redis(host='localhost', port=6379)
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
return qdrant, redis_client, model
def optimized_search(
query: str,
top_k: int = 10,
filters: dict = None,
use_cache: bool = True
):
"""Production-optimized vector search"""
qdrant, redis_client, model = get_clients()
# 1. Generate embedding (80ms)
query_vector = model.encode(query)
# 2. Check cache (12ms)
if use_cache:
cache_key = f"rag:{hashlib.md5(query.encode()).hexdigest()}"
if cached := redis_client.get(cache_key):
return json.loads(cached)
# 3. Build filters
query_filter = None
if filters:
query_filter = Filter(must=[
FieldCondition(key=k, match=MatchValue(value=v))
for k, v in filters.items()
])
# 4. Vector search with HNSW (280ms)
results = qdrant.search(
collection_name="documents_bge",
query_vector=query_vector.tolist(),
query_filter=query_filter,
limit=top_k,
search_params={"hnsw_ef": 128}
)
# 5. Cache results
if use_cache:
redis_client.setex(cache_key, 3600, json.dumps(results))
return results
Lessons Learned
What Worked
- HNSW indexing - Single biggest win (40% faster)
- Caching - 50% hit rate = massive savings
- Smaller embeddings - BGE 768 dims vs OpenAI 1536
- Monitoring - Can’t optimize what you don’t measure
What Didn’t Work
- Quantization - 8-bit vectors hurt accuracy too much
- Aggressive caching - TTL > 1 hour caused stale results
- Very large m (64) - Minimal accuracy gain, 2x memory
Mistakes to Avoid
- ❌ Optimizing without profiling - Wasted time
- ❌ Over-caching - Stale data issues
- ❌ Ignoring tail latency - P99 matters
- ❌ No monitoring - Flying blind
Tools & Technologies
- Qdrant - Vector database
- Redis - Caching layer
- BGE-large - Embedding model
- Prometheus - Metrics
- Grafana - Dashboards
Conclusion
Reducing vector search latency from 2.1s to 850ms (60% improvement) required systematic optimization:
- HNSW indexing for logarithmic search
- Smaller embeddings (BGE 768 vs OpenAI 1536)
- Redis caching (50% hit rate)
- Connection pooling and metadata filtering
- Continuous monitoring and profiling
These techniques are battle-tested in production at CleverFlow, serving 50K+ queries daily with 99.5% uptime.
Start with profiling, optimize the bottleneck, measure impact, repeat.
Resources
- Qdrant Docs: qdrant.tech/documentation
- BGE Model: HuggingFace
- Redis: redis.io
Connect
Optimizing RAG systems? Let’s chat:
- GitHub: github.com/huzaifa525
- LinkedIn: linkedin.com/in/huzefanalkheda
What’s your biggest RAG performance challenge? Share in the comments!