CPQL brings multi-step value estimation to offline reinforcement learning

Researchers propose Conservative Peng's Q(λ), the first offline RL algorithm to use a multi-step operator for conservative value estimation, outperforming single-step baselines on D4RL benchmarks.

May 16, 2026

CPQL brings multi-step value estimation to offline reinforcement learning

A new offline reinforcement learning algorithm adapts a multi-step operator to avoid the over-pessimism that plagues existing conservative methods. Conservative Peng's Q(λ) (CPQL), introduced in a preprint posted to arXiv on May 15, replaces the standard Bellman operator with Peng's Q(λ) operator for value estimation, achieving near-optimal performance guarantees while maintaining implicit behavior regularization.

Offline RL trains agents on fixed datasets without environment interaction, but conservative approaches often underestimate values to avoid overestimation errors. CPQL is the first algorithm to theoretically and empirically demonstrate that a multi-step operator can mitigate this over-pessimism while fully leveraging offline trajectories. The fixed point of the PQL operator naturally lies closer to the behavior policy's value function, inducing regularization without explicit constraints.

What stands out:

01Multi-step advantage: CPQL is the first offline RL method to use a multi-step operator for conservative value estimation, a departure from single-step Bellman backups used in prior work.
02Performance guarantees: The algorithm achieves performance greater than or equal to the behavior policy while providing near-optimal guarantees — a milestone previous conservative methods could not reach.
03D4RL benchmark wins: Extensive experiments on the D4RL benchmark suite show CPQL consistently and significantly outperforms existing offline single-step baselines across multiple domains.
04Offline-to-online transfer: Q-functions pre-trained with CPQL enable online PQL agents to avoid the performance drop typically seen at the start of fine-tuning, delivering robust improvements during online adaptation.

More in Research