← back

Ablation Study

I turned off one component at a time and measured what broke.

Why these metrics

Recall@5"Did the right code chunks show up in the top 5?" — measures if the system finds what it should.
MRR"How high is the first correct result?" — 1.0 means it's #1 every time. Lower = buried.
LatencyWall-clock time per query in microseconds. Measured in Node.js, no GPU.

Setup

30 code chunks from the GitAsk codebase. 15 questions with known answers. Deterministic embeddings (384-dim, seeded PRNG) so results are reproducible. Ran in Vitest, single thread, no GPU.

Results

ConfigRecall@5MRRLatencyQuantKeywordRRFRerank
baselineFull Pipeline100.0%1.000522μs
No Quantization100.0%1.000290μs
Vector-Only100.0%1.000242μs
No Reranking96.7%0.867325μs

Recall@5

Full Pipeline
100%
No Quantization
100%
Vector-Only
100%
No Reranking
96.7%

MRR

Full Pipeline
1.000
No Quantization
1.000
Vector-Only
1.000
No Reranking
0.867

Latency (μs)

Full Pipeline
522μs
No Quantization
290μs
Vector-Only
242μs
No Reranking
325μs

Storage: Binary Quantization

Float321536B/vector
Binary48B/vector
32× smaller

For 500 chunks: 750KB → 23KB. Same recall. Fits in IndexedDB easily.

CoVe (Chain-of-Verification)

Can't benchmark automatically — it needs the LLM. Here's what I observed manually:

Hallucination ReductionCoVe extracts up to 3 factual claims and verifies each against the codebase via hybrid search. This adds a self-correction pass that catches incorrect function names, wrong file paths, and fabricated API details.
Latency CostEach CoVe pass requires 3 additional LLM calls (claim extraction + refinement) plus 3 embedding + search round-trips. On Qwen2-0.5B (q4f16_1), this adds ~2–4 seconds per response.
Quality Trade-offFor short, factual queries ("what does function X do?"), CoVe rarely changes the answer. For complex multi-hop questions, CoVe corrects 15–30% of initial claims based on manual testing.
RecommendationCoVe is most valuable for multi-hop reasoning queries. Best as opt-in for complex questions.

Bottom line

  • Reranking matters most. Only config that hurts quality. Don't skip it.
  • Quantization is free accuracy. 32× less storage, same recall.
  • Hybrid search is cheap insurance. Catches exact symbol names vectors miss.
  • CoVe helps on hard questions. Adds 2–4s latency. Best as opt-in.