PBSD turns sparse RL rewards into turn-level credit signals with Bayesian distillation
A new preprint from Tian et al. introduces Privileged Bayesian Self-Distillation, a method that decomposes trajectory-level RL rewards into fine-grained turn-by-turn credit scores using Bayes' rule and a privileged teacher model.

PBSD (Privileged Bayesian Self-Distillation) is a reinforcement learning method that addresses the credit assignment problem in long-horizon agentic tasks. The preprint, posted to arXiv this week, describes a technique for converting sparse outcome-based rewards—where only the final result is scored—into turn-level signals that identify which intermediate reasoning steps or tool calls contributed to success or failure.
The core idea is to measure trajectory quality through the posterior-to-prior probability ratio of the verified answer, then apply Bayes' rule to rewrite that ratio as a tractable likelihood comparison between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields per-turn credit signals. The method is fully compatible with standard policy optimization pipelines and does not require changes to the underlying RL algorithm.
On generalization
Experiments show consistent performance gains in both in-domain and out-of-domain settings. The authors report that PBSD transfers knowledge from short-context training to long-context inference, suggesting that the fine-grained credit assignment mechanism helps the policy generalize beyond its training distribution. The approach is especially relevant for multi-turn search agents, where successful trajectories may contain misleading actions and failed trajectories may contain valuable evidence-gathering steps that a trajectory-level reward would otherwise obscure.
The preprint is available on arXiv and the HuggingFace papers hub. Authors are Yang Tian, Rui Wang, Xumeng Wen, Junjie Li, Shizhao Sun, and Lei Song.






