Active learning cuts LLM reranking calls while lifting NDCG@10
A new preprint reframes pairwise ranking prompting as active learning from noisy comparisons, showing that active rankers outperform classical sorting on top-K retrieval when call budgets are tight.

Pairwise Ranking Prompting (PRP) asks an LLM to judge which of two candidates is better, then aggregates those judgments into a ranked list. The standard approach runs a classical sorting algorithm—quicksort, mergesort—until it recovers a full permutation, then truncates to the desired top-K. But LLM judgments are noisy, order-sensitive, and sometimes intransitive, so the assumptions behind sorting algorithms don't hold. More importantly, truncating a sort midway through does not reliably surface the best items when the call budget runs out.
A preprint by Jeremías Figueiredo Paschmann, Juan Kaplan, Francisco Nattero, Santiago Barron, Juan Wisznia, and Luciano del Corro reframes PRP reranking as an active-learning problem. Instead of running a full sort, their framework selects which pairs to compare based on what the model has already learned, focusing calls on the comparisons that most reduce uncertainty about the top-K. In the call-constrained regime—where you can afford only a fixed number of LLM queries—active rankers deliver higher NDCG@10 per call than classical sorting. The authors also introduce a randomized-direction oracle that uses a single LLM call per pair instead of the usual bidirectional pair (A vs B, then B vs A). By randomizing which order the pair is presented, systematic position bias—the tendency for LLMs to favor the first or second item—averages out to zero-mean noise, eliminating the need for twice as many calls to cancel bias.
The randomized-direction oracle is the most immediately practical contribution: halving the call cost of unbiased ranking is a straightforward win for anyone running PRP rerankers in production. The noise-robust active-learning approach is model-agnostic and works with off-the-shelf LLMs. What remains to be seen is how the active-selection heuristics scale to larger candidate pools and whether the NDCG gains hold up when the underlying LLM is fine-tuned specifically for pairwise judgments rather than general instruction-following.