HNSW Parameter Tuning: What I Learned Benchmarking My Vector Search Engine

Why I benchmarked my own engine

I built VectorVault — a vector search engine from scratch in C++20 — because I wanted to understand how vector databases actually work under the hood. Not just the API, but the algorithms, the memory access patterns, the SIMD optimizations, the persistence layer.

Once the HNSW implementation was working, I needed to answer the practical question every vector search deployment faces: what parameters do you actually need?

The HNSW paper (Malkov & Yashunin, 2018) provides theoretical guidance, but the parameter space is large and the interaction between M, efConstruction, efSearch, dimensionality, and dataset size is hard to reason about without concrete numbers.

So I measured everything.

Setup

Engine: VectorVault — HNSW with AVX2 SIMD-accelerated distance computation (L2), memory-mapped persistence, thread-safe concurrent reads.

Datasets: Synthetic random vectors (uniform distribution) at various scales. While not as representative as SIFT or GloVe for real-world distribution clustering, uniform random vectors provide a controlled baseline where results depend purely on the algorithm’s graph structure rather than dataset-specific locality.

Baseline: Exact brute-force k-NN search (linear scan, same SIMD distance function). All recall numbers are measured against this ground truth.

Metric: Recall@10 — the fraction of true top-10 nearest neighbors found by the approximate search.

The three knobs

HNSW has three parameters that matter. Everything else is secondary.

Parameter	What it does	When it applies
`M`	Max bidirectional connections per node in the graph	Build time (fixed after index construction)
`efConstruction`	Search width during graph construction	Build time (higher = better graph, slower build)
`efSearch`	Search width during queries	Query time (the only knob you tune at runtime)

VectorVault’s defaults: M=16, efConstruction=200, efSearch=50. Here’s how I arrived at those.

Benchmark 1: efSearch — the runtime recall/speed tradeoff

This is the parameter you tune in production. Everything else is baked into the index at build time.

10K vectors, 384 dimensions:

efSearch	P50 Latency	P95 Latency	QPS	Recall@10	vs Brute Force
10	0.18ms	0.32ms	~5,500	87.3%	45x faster
50	0.42ms	0.78ms	~2,400	96.8%	19x faster
100	0.71ms	1.24ms	~1,400	99.2%	11x faster
Brute force	8.2ms	9.1ms	~122	100%	baseline

100K vectors, 768 dimensions:

efSearch	P50 Latency	P95 Latency	P99 Latency	QPS	Recall@10
10	0.21ms	0.45ms	0.68ms	4,761	87%
50	0.54ms	1.12ms	1.58ms	1,852	97%
100	0.89ms	1.76ms	2.34ms	1,124	99%
200	1.45ms	2.89ms	3.67ms	690	99.7%

Key observations:

ef=50 is the production sweet spot. 97% recall with sub-millisecond P50 latency, even at 100K vectors in 768 dimensions. You sacrifice 3% recall to get 19x the throughput of brute force.
Diminishing returns past ef=100. Going from ef=50 to ef=100 buys you 2% more recall but cuts QPS by 40%. Going from ef=100 to ef=200 buys 0.7% more recall and cuts QPS by another 40%. The cost per percentage point of recall increases exponentially.
P95 and P99 latency scale ~2x with each ef doubling. This is important for SLA-driven systems. If your P99 budget is 2ms, you can’t go above ef=100 at this scale.
Higher dimensions barely affect latency at the same ef. At ef=50, the P50 jumps from 0.42ms (384d) to 0.54ms (768d) despite 2x the dimension. This is because HNSW’s search cost is dominated by graph traversal, not distance computation — especially with AVX2 processing 8 floats per instruction.

Benchmark 2: build performance

Index construction is where M and efConstruction interact. Higher values produce a better graph but take longer to build.

Fixed parameters: M=16, efConstruction=200

Vectors	Dimensions	Build Time	Throughput	Peak Memory
100K	384	8.2s	12,200 vec/s	~850 MB
100K	768	14.1s	7,100 vec/s	~1.5 GB
1M	384	98s	10,200 vec/s	~8.2 GB

Key observations:

Throughput scales sub-linearly with dataset size. 100K inserts at 12,200/s, but 1M inserts drop to 10,200/s. This is expected — later insertions traverse deeper graphs.
Memory scales linearly with vectors x dimensions. At 768d, each vector is 3KB (768 floats x 4 bytes). Add the graph structure overhead (neighbor lists, ~M x 2 x 4 bytes per node) and you get roughly 15 bytes per dimension per vector at M=16.
Doubling dimensions roughly doubles build time and memory. The relationship is nearly linear because distance computation (the inner loop of HNSW construction) is O(d), and memory per vector is O(d + M).

Benchmark 3: the M parameter

M controls graph connectivity. Higher M means each node connects to more neighbors, which improves recall but costs memory and build time.

I tested M at 32 with efConstruction=200 on 100K vectors, 384 dimensions, querying at ef=100:

M	Recall@10	Peak Memory	Build Time	Notes
8	95.1%	~520 MB	5.4s	Lean, adequate for exploratory search
16	99.2%	~850 MB	8.2s	Default. Best recall-per-byte
32	99.8%	~1.5 GB	15.7s	Diminishing returns

Why M=16 is the default:

Going from M=8 to M=16 gains 4.1% recall for 1.6x the memory. Worth it.
Going from M=16 to M=32 gains 0.6% recall for 1.8x the memory. Rarely worth it.
The HNSW paper recommends M in the range [12, 48], with M=16 as a good general default. My benchmarks confirm this.

The exception: if your recall requirement is literally 99.9%+, you need M=32 or higher. But at that point, question whether approximate search is the right tool — exact search with smart pruning (like quantization + reranking) may serve you better.

What I chose for VectorVault defaults

Based on these benchmarks, VectorVault ships with:

M = 16              -- best recall-per-byte
efConstruction = 200 -- good graph quality without excessive build time
efSearch = 50        -- 97% recall, sub-millisecond latency

These are intentionally conservative. A user who needs higher recall can bump efSearch at query time without rebuilding the index. A user who’s memory-constrained can drop to M=8 and accept 95% recall.

The key design principle: make the cheap knob (efSearch) the one users tune at runtime, and set the expensive knobs (M, efConstruction) to values that rarely need changing.

Practical recommendations

For anyone deploying HNSW in production, here’s what these numbers translate to:

RAG pipelines (semantic search over documents):

M=16, efC=200, efSearch=50-100
97-99% recall is fine — you’re retrieving context, not exact matches
Optimize for latency (lower ef) if search is in the hot path of user requests

Recommendation systems:

M=16, efC=200, efSearch=100-200
Higher recall matters when surfacing content to users
Latency budget is usually more generous (100-500ms acceptable)

Duplicate detection / deduplication:

M=32, efC=400, efSearch=200+
Missing a near-duplicate is worse than being slow
Consider exact search if dataset fits in memory

Exploratory / research workloads:

M=8, efC=100, efSearch=10-20
Speed over precision
Good for rapid prototyping, dataset exploration

What I’d do next

These benchmarks use synthetic data. The natural extension is running against standard benchmarks (SIFT-1M, GloVe-100, Deep-1M from ann-benchmarks) to see how data distribution affects the recall/latency curve. Real data clusters differently than uniform random vectors, which changes the graph structure and navigation patterns.

I’d also like to benchmark the impact of AVX2 vs scalar distance computation in isolation, controlling for graph traversal cost. My current numbers include both — separating them would show exactly how much SIMD contributes at different scales.

The full benchmark suite and VectorVault source code are at github.com/oneKn8/VectorVault.

Stay in the loop