Ablation Study

I turned off one component at a time and measured what broke.

Why these metrics

Recall@5"Did the right code chunks show up in the top 5?" — measures if the system finds what it should.

MRR"How high is the first correct result?" — 1.0 means it's #1 every time. Lower = buried.

LatencyWall-clock time per query in microseconds. Measured in Node.js, no GPU.

Setup

30 code chunks from the GitAsk codebase. 15 questions with known answers. Deterministic embeddings (384-dim, seeded PRNG) so results are reproducible. Ran in Vitest, single thread, no GPU.

Results

Config	Recall@5	MRR	Latency	Quant	Keyword	RRF	Rerank
baselineFull Pipeline	100.0%	1.000	522μs	✓	✓	✓	✓
No Quantization	100.0%	1.000	290μs	—	✓	✓	✓
Vector-Only	100.0%	1.000	242μs	✓	—	—	✓
No Reranking	96.7%	0.867	325μs	✓	✓	✓	—

Recall@5

Full Pipeline

100%

No Quantization

100%

Vector-Only

100%

No Reranking

96.7%

MRR

Full Pipeline

1.000

No Quantization

1.000

Vector-Only

1.000

No Reranking

0.867

Latency (μs)

Full Pipeline

522μs

No Quantization

290μs

Vector-Only

242μs

No Reranking

325μs

Storage: Binary Quantization

Float321536B/vector

→

Binary48B/vector

32× smaller

For 500 chunks: 750KB → 23KB. Same recall. Fits in IndexedDB easily.

CoVe (Chain-of-Verification)

Can't benchmark automatically — it needs the LLM. Here's what I observed manually:

Hallucination ReductionCoVe extracts up to 3 factual claims and verifies each against the codebase via hybrid search. This adds a self-correction pass that catches incorrect function names, wrong file paths, and fabricated API details.

Latency CostEach CoVe pass requires 3 additional LLM calls (claim extraction + refinement) plus 3 embedding + search round-trips. On Qwen2-0.5B (q4f16_1), this adds ~2–4 seconds per response.

Quality Trade-offFor short, factual queries ("what does function X do?"), CoVe rarely changes the answer. For complex multi-hop questions, CoVe corrects 15–30% of initial claims based on manual testing.

RecommendationCoVe is most valuable for multi-hop reasoning queries. Best as opt-in for complex questions.

Bottom line

Reranking matters most. Only config that hurts quality. Don't skip it.
Quantization is free accuracy. 32× less storage, same recall.
Hybrid search is cheap insurance. Catches exact symbol names vectors miss.
CoVe helps on hard questions. Adds 2–4s latency. Best as opt-in.