ProRL corrects gradient bias in recommendation path planning
ProRL, a new reinforcement learning framework, corrects two policy gradient defects that plague proactive recommender systems, outperforming baselines on three real-world datasets.

ProRL is a reinforcement learning framework that fixes two fundamental gradient estimation failures in proactive recommender systems—networks designed to steer user preferences toward target items through chains of intermediate recommendations.
Standard policy gradient methods break down in this setting because path-level rewards decompose into step-level rewards with positive mean. This creates a length-dependent bias: the optimizer favors longer paths over better ones, since extending any path yields higher cumulative reward. A second defect compounds the problem: weighting each step by the full path reward ignores the decomposition structure, inflating gradient variance and drowning out the signal.
Stepwise centering and position-specific baselines
ProRL introduces two mechanisms to rectify these failures. Stepwise Reward Centering subtracts expected rewards at each step, neutralizing the length bias so that path extension yields zero expected gradient signal. The optimizer is forced to optimize for path quality, not length. Position-Specific Advantage Estimation computes step-dependent baselines that respect the reward decomposition, cutting variance without sacrificing gradient signal.
Authors Hongru Hou, Tiehua Mei, Denghui Geng, Jinhui Huang, Ao Xu, and Hengrui Chen tested ProRL on three real-world datasets and report significant performance gains over existing proactive recommendation methods. Code is available at https://github.com/hongruhou89/ProRL.

