Entropy metric predicts when LLM rewriting boosts code retrieval

New study shows full natural-language rewriting lifts code search by +0.51 NDCG@10, but corpus-only rewrites degrade 62% of configs; token entropy Delta H predicts gain with rho=+0.593.

May 13, 2026

Entropy metric predicts when LLM rewriting boosts code retrieval

A preprint from Andrea Gurioli, Federico Pennino, and Maurizio Gabbrielli tests three LLM rewriting strategies for embedding-based code retrieval — stylistic rephrasing, natural-language-enriched pseudocode, and full natural-language transcription — across six CoIR benchmarks, five encoders, and three rewriter families (Qwen, DeepSeek, Mistral). The team evaluated joint query-corpus (QC) and corpus-only (C) augmentation modes. Full NL rewriting with QC delivered the largest gain: +0.51 absolute NDCG@10 on the CT-Contest benchmark for the MoSE-18 encoder. Corpus-only rewriting, by contrast, degraded retrieval in 56 of 90 configurations tested — roughly 62 percent.

The authors introduce two diagnostics: Delta H (token entropy) and Delta s (embedding cosine). Delta H emerged as a cheap, rewriter-agnostic proxy for predicting retrieval gain under QC augmentation. Pooled across all three rewriter families, Delta H correlated with NDCG@10 lift at Spearman rho = +0.436 (p < 0.001) on DeepSeek+Codestral; rho climbed to +0.593 on Codestral alone and +0.356 on Qwen. The findings reframe LLM rewriting as a cost-benefit decision: it works best as a remediation layer for lightweight encoders on code-dominant queries, with diminishing returns for strong encoders or natural-language-heavy queries. The paper is the first to evaluate NL-enriched pseudocode and snippet-level natural language as direct retrieval representations rather than transient intermediates.

More in Releases