F-GRPO trains LLMs to generate and rank search results in one pass

Researchers propose factorized group-relative policy optimization to train LLMs to generate and rank candidates end-to-end, solving the credit-assignment problem that plagues two-stage retrieval pipelines.

May 13, 2026

F-GRPO trains LLMs to generate and rank search results in one pass

F-GRPO is a training framework that unifies candidate generation and ranking inside a single LLM forward pass, addressing a fundamental problem in retrieval systems. Traditional pipelines split the work—first retrieve a candidate pool, then rerank it—but that decoupling means the ranker is constrained by whatever the retriever provides. LLMs can generate a subset and order it autoregressively in one pass, but the utility signal arrives only after the full sequence completes. This creates a credit-assignment gap: when results are poor, the model cannot tell whether the generation phase picked the wrong candidates or the ranking phase ordered them incorrectly.

Researchers including Rohan Surana, Gagan Mundada, and Junda Wu posted the preprint to arXiv on May 14. Their framework factorizes the policy into two phases while keeping a shared LLM backbone. The model trains both phases end-to-end with two reward signals: an order-invariant coverage reward for generation and a position-aware utility reward for ranking. F-GRPO computes separate group-relative advantages for each phase within a two-phase sequence-level objective, letting the model learn which phase caused a utility drop. Across sequential recommendation and multi-hop question answering benchmarks, F-GRPO outperforms standard GRPO and decoupled baselines on top-ranked metrics, beats supervised alternatives, and stays competitive with strong zero-shot rerankers—all without changing the architecture at inference time.

The evaluation is limited to recommendation and QA tasks; retrieval-augmented generation for long-form text, code search, and multimodal retrieval remain untested. The preprint does not report wall-clock training costs or sample efficiency curves relative to decoupled pipelines. If F-GRPO scales to production retrieval workloads without prohibitive compute overhead, it could replace the two-model pattern that dominates commercial search and recommendation systems.

More in Research