GCSL treats graded feedback as explicit goals, outperforming DPO without reward models
New arXiv preprint proposes goal-conditioned supervised learning, an offline fine-tuning method that treats feedback as explicit goals, outperforming SFT and DPO on toxicity, code, and recommendation tasks.
Goal-conditioned supervised learning (GCSL) is an offline fine-tuning framework for large language models that treats feedback signals—ratings, scores, or quality grades—as explicit goals and trains models through supervised learning to generate responses that hit those targets. Unlike reinforcement learning approaches that require external reward models and iterative rollouts, GCSL runs entirely offline. Unlike supervised fine-tuning (SFT), which collapses graded feedback into binary accept/reject labels, and direct preference optimization (DPO), which needs paired preference data, GCSL uses raw graded feedback directly. The authors introduce a goal formulation that trains models to produce outputs above a quality threshold rather than imitating a filtered high-quality subset, a design intended to avoid the bounded-learning effect where models plateau at the ceiling of their training data. Natural-language goal representations let the model apply its semantic reasoning to the quality target itself.
Evaluated on non-toxic generation, code generation, and LLM-based recommendation, GCSL outperformed standard offline baselines—SFT and DPO—while keeping the data requirements and computational cost of supervised learning. The method does not require a separate reward model, preference annotations, or online sampling. The preprint (arXiv:2605.16345v1) was posted May 19, 2026.
