Back to Blog

Vector Database Optimization: Reducing RAG Query Latency by 60%

Practical techniques for optimizing vector database performance in production RAG systems, covering indexing strategies, caching, and monitoring that reduced our query latency from 2.1s to 850ms.

When we launched our RAG system at CleverFlow, average query latency was 2.1 seconds. Users complained it felt slow. After implementing the optimizations I’ll share in this post, we brought it down to 850ms - a 60% reduction.

Here’s exactly how we did it, with code examples and real metrics.

The Performance Problem

Our initial RAG system architecture:

User Query (50ms)
    ↓
Embedding Generation (200ms)
    ↓
Vector Search (1500ms)  ⬅️ BOTTLENECK
    ↓
LLM Generation (350ms)
───────────────────────
Total: 2.1 seconds

Vector search was taking 71% of total time.

Why Vector Search Was Slow

After profiling, we found 4 major issues:

  1. No indexing - Brute force search across 2M vectors
  2. No caching - Repeated queries hitting database
  3. Large embeddings - 1536 dimensions (OpenAI embeddings)
  4. No connection pooling - New connection per query

Let’s fix each one.

Optimization 1: HNSW Indexing (40% Faster)

The Problem

Our initial Qdrant setup used flat indexing (brute force):

# Slow: Brute force search
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
    # No index specified = brute force
)

Search time: 1500ms across 2M vectors

The Solution: HNSW

HNSW (Hierarchical Navigable Small World) creates a multi-layer graph for logarithmic search time.

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, HnswConfigDiff

client = QdrantClient(url="http://localhost:6333")

client.create_collection(
    collection_name="documents_optimized",
    vectors_config=VectorParams(
        size=768,  # Reduced from 1536 (explained below)
        distance=Distance.COSINE,
        on_disk=False  # Keep in RAM for speed
    ),
    hnsw_config=HnswConfigDiff(
        m=16,              # Edges per node (balance speed/accuracy)
        ef_construct=100,  # Construction time quality
        full_scan_threshold=10000  # Use HNSW above this size
    )
)

HNSW Parameters Explained

m (number of edges):

  • Higher = more accuracy, more memory
  • Sweet spot: 16-32
  • We use 16 for good balance

ef_construct (build quality):

  • Higher = better index quality, slower build
  • We use 100 (default is 100)

ef (search quality):

# Search with custom ef
results = client.search(
    collection_name="documents_optimized",
    query_vector=query_embedding,
    limit=10,
    search_params={"hnsw_ef": 128}  # Higher = more accurate
)

Results

Configuration Search Time Accuracy
Flat indexing 1500ms 100%
HNSW (m=8, ef=64) 420ms 94%
HNSW (m=16, ef=128) 900ms 98%
HNSW (m=32, ef=256) 1200ms 99.5%

Winner: m=16, ef=128 → 900ms (40% faster, 98% accuracy)

Optimization 2: Dimension Reduction (30% Faster)

The Problem

We were using OpenAI’s text-embedding-3-large (1536 dimensions):

# Expensive embeddings
embeddings = openai.Embedding.create(
    model="text-embedding-3-large",
    input=texts
)  # Returns 1536-dim vectors

Issues:

  • More compute for similarity calculation
  • More memory usage
  • Slower search

The Solution: Switch to BGE-large (768 dims)

from sentence_transformers import SentenceTransformer

# Smaller, faster, free
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
embeddings = model.encode(texts)  # Returns 768-dim vectors

Dimension Comparison

Model Dimensions Search Time Quality Cost/1M
OpenAI 3-large 1536 900ms Excellent $130
OpenAI 3-small 1536 900ms Very Good $20
BGE-large 768 630ms Very Good $0
BGE-base 768 630ms Good $0

BGE-large wins: 30% faster, 98% quality, $0 cost.

Migration Strategy

def migrate_to_bge():
    """Migrate from OpenAI embeddings to BGE"""
    # Load new embedding model
    bge_model = SentenceTransformer('BAAI/bge-large-en-v1.5')

    # Create new collection
    client.create_collection(
        collection_name="documents_bge",
        vectors_config=VectorParams(size=768, distance=Distance.COSINE)
    )

    # Re-embed and upload
    batch_size = 100
    for i in range(0, len(documents), batch_size):
        batch = documents[i:i+batch_size]
        texts = [doc['content'] for doc in batch]

        # Generate new embeddings
        embeddings = bge_model.encode(texts, show_progress_bar=True)

        # Upload to new collection
        client.upsert(
            collection_name="documents_bge",
            points=[
                {
                    "id": doc['id'],
                    "vector": embedding.tolist(),
                    "payload": doc['metadata']
                }
                for doc, embedding in zip(batch, embeddings)
            ]
        )

    print("Migration complete!")

Optimization 3: Redis Caching (50% Hit Rate)

The Problem

Many users ask similar questions:

  • “What is RAG?”
  • “How to fine-tune LLaMA?”
  • “Best practices for vector databases?”

We were hitting the database every time, even for identical queries.

The Solution: Semantic Caching

import redis
import hashlib
import numpy as np
from typing import Optional

redis_client = redis.Redis(host='localhost', port=6379, decode_responses=False)

def semantic_cache_key(query_vector: np.ndarray) -> str:
    """Create cache key from embedding"""
    # Round to reduce cache misses from tiny differences
    rounded = np.round(query_vector, decimals=4)
    vector_hash = hashlib.md5(rounded.tobytes()).hexdigest()
    return f"rag:vector:{vector_hash}"

def get_cached_results(query_vector: np.ndarray) -> Optional[list]:
    """Check cache for similar query"""
    cache_key = semantic_cache_key(query_vector)
    cached = redis_client.get(cache_key)

    if cached:
        return json.loads(cached)
    return None

def cache_results(query_vector: np.ndarray, results: list, ttl: int = 3600):
    """Cache search results for 1 hour"""
    cache_key = semantic_cache_key(query_vector)
    redis_client.setex(
        cache_key,
        ttl,
        json.dumps(results)
    )

# Usage
def search_with_cache(query: str, top_k: int = 10):
    # Generate embedding
    query_vector = bge_model.encode(query)

    # Check cache
    cached = get_cached_results(query_vector)
    if cached:
        logger.info("Cache hit!")
        return cached

    # Cache miss - search database
    results = client.search(
        collection_name="documents_bge",
        query_vector=query_vector.tolist(),
        limit=top_k
    )

    # Cache results
    cache_results(query_vector, results)

    return results

Cache Performance

After 1 week of production traffic:

Total queries: 50,234
Cache hits: 24,891 (49.5%)
Cache misses: 25,343 (50.5%)

Average latency:
- Cache hit: 12ms
- Cache miss: 630ms

Effective latency: (0.495 × 12ms) + (0.505 × 630ms) = 324ms

50% cache hit rate = 51% latency reduction!

Advanced: Approximate Matching

For even better hit rates, use approximate matching:

from scipy.spatial.distance import cosine

def find_similar_cached(query_vector: np.ndarray, threshold: float = 0.95):
    """Find similar cached queries"""
    # Scan recent cache keys (store vectors in Redis)
    recent_keys = redis_client.keys("rag:vector:*")[:1000]

    for key in recent_keys:
        cached_vector = np.frombuffer(redis_client.get(f"{key}:vec"))
        similarity = 1 - cosine(query_vector, cached_vector)

        if similarity >= threshold:
            return redis_client.get(key)

    return None

Optimization 4: Connection Pooling

The Problem

Creating a new Qdrant connection per query:

def search(query):
    client = QdrantClient(url="http://qdrant:6333")  # ❌ New connection
    results = client.search(...)
    return results

Overhead: 50-100ms per connection.

The Solution: Reuse Connections

from qdrant_client import QdrantClient
from functools import lru_cache

@lru_cache(maxsize=1)
def get_qdrant_client() -> QdrantClient:
    """Singleton Qdrant client"""
    return QdrantClient(
        url="http://qdrant:6333",
        timeout=30,
        prefer_grpc=True  # Faster than HTTP
    )

# Usage
client = get_qdrant_client()  # Reused across requests
results = client.search(...)

Benefit: -50ms per query

Optimization 5: Parallel Queries

The Problem

Sequential search for multiple queries:

def search_multiple(queries: list[str]):
    results = []
    for query in queries:
        result = search(query)  # ❌ Sequential
        results.append(result)
    return results

# Takes: len(queries) × 630ms

The Solution: Async Parallel

import asyncio
from qdrant_client import AsyncQdrantClient

async_client = AsyncQdrantClient(url="http://qdrant:6333")

async def search_async(query: str):
    query_vector = bge_model.encode(query)
    return await async_client.search(
        collection_name="documents_bge",
        query_vector=query_vector.tolist(),
        limit=10
    )

async def search_multiple_parallel(queries: list[str]):
    tasks = [search_async(q) for q in queries]
    results = await asyncio.gather(*tasks)
    return results

# Usage
results = asyncio.run(search_multiple_parallel(queries))

# Takes: max(query times) ≈ 630ms regardless of count!

Benefit: 5 queries in 630ms instead of 3150ms

Optimization 6: Metadata Filtering

The Problem

Searching all 2M vectors when you only need recent documents:

# Slow: Search everything
results = client.search(
    collection_name="documents_bge",
    query_vector=query_vector,
    limit=10
)

The Solution: Pre-filter by Metadata

from qdrant_client.models import Filter, FieldCondition, MatchValue

# Fast: Filter before vector search
results = client.search(
    collection_name="documents_bge",
    query_vector=query_vector,
    query_filter=Filter(
        must=[
            FieldCondition(
                key="year",
                match=MatchValue(value=2024)
            ),
            FieldCondition(
                key="category",
                match=MatchValue(value="medical")
            )
        ]
    ),
    limit=10
)

Impact:

  • Search space: 2M → 150K vectors
  • Latency: 630ms → 280ms
  • Accuracy: Same (filtering irrelevant docs anyway)

Optimization 7: Batch Indexing

The Problem

Inserting documents one at a time:

for doc in documents:
    client.upsert(
        collection_name="documents_bge",
        points=[{
            "id": doc['id'],
            "vector": embed(doc['text']),
            "payload": doc['metadata']
        }]
    )
# Takes: 20 minutes for 10K docs

The Solution: Batch Upload

from tqdm import tqdm

def batch_index(documents: list, batch_size: int = 100):
    """Efficient batch indexing"""
    for i in tqdm(range(0, len(documents), batch_size)):
        batch = documents[i:i+batch_size]

        # Batch embed
        texts = [doc['text'] for doc in batch]
        embeddings = bge_model.encode(texts, batch_size=32)

        # Batch upsert
        points = [
            {
                "id": doc['id'],
                "vector": embedding.tolist(),
                "payload": doc['metadata']
            }
            for doc, embedding in zip(batch, embeddings)
        ]

        client.upsert(
            collection_name="documents_bge",
            points=points,
            wait=False  # Don't wait for confirmation
        )

# Takes: 3 minutes for 10K docs (6.7x faster!)

Optimization 8: Monitoring & Profiling

What to Monitor

from prometheus_client import Histogram, Counter
import time

# Metrics
search_latency = Histogram(
    'vector_search_latency_seconds',
    'Vector search latency',
    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)
cache_hits = Counter('cache_hits_total', 'Cache hits')
cache_misses = Counter('cache_misses_total', 'Cache misses')

def monitored_search(query: str):
    start = time.time()

    # Check cache
    if cached := get_cached_results(query):
        cache_hits.inc()
        search_latency.observe(time.time() - start)
        return cached

    cache_misses.inc()

    # Search
    results = search(query)

    search_latency.observe(time.time() - start)
    return results

Grafana Dashboard

Key metrics to track:

  1. P50, P95, P99 latency - Understand tail latency
  2. Cache hit rate - Optimize cache strategy
  3. Query volume - Plan capacity
  4. Error rate - Catch issues early
  5. Vector DB memory - Know when to scale

Final Architecture

After all optimizations:

User Query (50ms)
    ↓
Embedding Generation (80ms)  ⬅️ BGE faster than OpenAI
    ↓
Redis Cache Check (12ms)     ⬅️ 50% hit rate
    ↓ (50% miss)
Metadata Filtering (20ms)    ⬅️ Reduce search space
    ↓
HNSW Vector Search (280ms)   ⬅️ Optimized index + smaller dims
    ↓
LLM Generation (350ms)
───────────────────────────────
Cache Hit: 442ms (79% faster)
Cache Miss: 850ms (60% faster)
Effective: 646ms (69% faster)

Performance Summary

Optimization Latency Reduction Cost Impact
HNSW Indexing -600ms (40%) +10% memory
BGE Embeddings -270ms (30%) $0 (was $130/M)
Redis Caching -318ms (51% × 630ms) +$20/month
Connection Pooling -50ms $0
Metadata Filtering -350ms $0
Total -1250ms (60%) Saves $110/month

Code: Complete Optimized System

from qdrant_client import QdrantClient
from sentence_transformers import SentenceTransformer
import redis
import hashlib
import json
from functools import lru_cache

# Initialize once
@lru_cache(maxsize=1)
def get_clients():
    qdrant = QdrantClient(url="http://qdrant:6333", prefer_grpc=True)
    redis_client = redis.Redis(host='localhost', port=6379)
    model = SentenceTransformer('BAAI/bge-large-en-v1.5')
    return qdrant, redis_client, model

def optimized_search(
    query: str,
    top_k: int = 10,
    filters: dict = None,
    use_cache: bool = True
):
    """Production-optimized vector search"""
    qdrant, redis_client, model = get_clients()

    # 1. Generate embedding (80ms)
    query_vector = model.encode(query)

    # 2. Check cache (12ms)
    if use_cache:
        cache_key = f"rag:{hashlib.md5(query.encode()).hexdigest()}"
        if cached := redis_client.get(cache_key):
            return json.loads(cached)

    # 3. Build filters
    query_filter = None
    if filters:
        query_filter = Filter(must=[
            FieldCondition(key=k, match=MatchValue(value=v))
            for k, v in filters.items()
        ])

    # 4. Vector search with HNSW (280ms)
    results = qdrant.search(
        collection_name="documents_bge",
        query_vector=query_vector.tolist(),
        query_filter=query_filter,
        limit=top_k,
        search_params={"hnsw_ef": 128}
    )

    # 5. Cache results
    if use_cache:
        redis_client.setex(cache_key, 3600, json.dumps(results))

    return results

Lessons Learned

What Worked

  1. HNSW indexing - Single biggest win (40% faster)
  2. Caching - 50% hit rate = massive savings
  3. Smaller embeddings - BGE 768 dims vs OpenAI 1536
  4. Monitoring - Can’t optimize what you don’t measure

What Didn’t Work

  1. Quantization - 8-bit vectors hurt accuracy too much
  2. Aggressive caching - TTL > 1 hour caused stale results
  3. Very large m (64) - Minimal accuracy gain, 2x memory

Mistakes to Avoid

  1. ❌ Optimizing without profiling - Wasted time
  2. ❌ Over-caching - Stale data issues
  3. ❌ Ignoring tail latency - P99 matters
  4. ❌ No monitoring - Flying blind

Tools & Technologies

  • Qdrant - Vector database
  • Redis - Caching layer
  • BGE-large - Embedding model
  • Prometheus - Metrics
  • Grafana - Dashboards

Conclusion

Reducing vector search latency from 2.1s to 850ms (60% improvement) required systematic optimization:

  1. HNSW indexing for logarithmic search
  2. Smaller embeddings (BGE 768 vs OpenAI 1536)
  3. Redis caching (50% hit rate)
  4. Connection pooling and metadata filtering
  5. Continuous monitoring and profiling

These techniques are battle-tested in production at CleverFlow, serving 50K+ queries daily with 99.5% uptime.

Start with profiling, optimize the bottleneck, measure impact, repeat.

Resources

Connect

Optimizing RAG systems? Let’s chat:


What’s your biggest RAG performance challenge? Share in the comments!

Share this post

Huzefa Nalkheda Wala

About the Author

Huzefa Nalkheda Wala

AI Product Engineer at CleverFlow specializing in Medical AI, Large Language Models, and RAG systems. Patent holder (Design No. 375474-001) and IIT Ropar graduate. Creator of MedGenius LLaMA medical AI model.