POISE cuts reasoning-model training compute by using hidden states as value signals
A new RL method estimates baselines from the policy model's own hidden states and entropy signals, matching DAPO's performance on math benchmarks while cutting the compute overhead of separate critic networks and multi-rollout sampling.

Reinforcement learning for large reasoning models typically demands either a second model at policy scale (PPO's critic) or multiple rollouts per prompt (GRPO's empirical mean baseline). A preprint from researchers at Seoul National University and Kakao Brain introduces POISE—Policy Optimization with Internal State Value Estimation—which sidesteps both costs by training a lightweight probe on the policy model's own hidden states and token-entropy statistics.
The probe predicts expected verifiable reward from a single rollout's internal signals, cutting variance without the compute overhead of a full-scale value network. On Qwen3-4B and DeepSeek-R1-Distill-Qwen-1.5B across math reasoning benchmarks, POISE matches DAPO (a recent GRPO variant) while requiring less compute. Because the value estimator consumes signals already computed during the policy forward pass, it adds negligible inference cost.
Cross-rollout construction preserves gradient unbiasedness. To avoid biasing gradients with trajectory-conditioned features, POISE predicts each rollout's value from an independent rollout's internal states. This architectural choice lets the probe use rich trajectory-level signals without coupling the value estimate to the policy's own outputs.
Single-rollout baselines unlock higher prompt diversity. GRPO samples multiple completions per prompt to compute a stable empirical mean; POISE needs only one rollout per prompt for its value estimate. The saved compute redirects to a larger prompt batch, which reduces gradient variance and improves learning stability.
The probe generalizes across verifiable tasks. The lightweight value estimator—trained online alongside the policy—performs comparably to a separate LLM-scale value model on math reasoning and transfers to other verifiable reward settings without retraining.
No separate critic, no multi-rollout overhead. PPO's critic is typically as large as the policy; GRPO's group-mean baseline requires multiple rollouts per prompt. POISE replaces both with a probe that reads hidden states and entropy statistics already in memory, reducing the compute footprint of baseline estimation to near zero.
Verifiable-reward setting only. POISE is designed for RLVR—tasks with ground-truth correctness signals like math proofs or code execution. It does not address preference-based RL (RLHF) or settings where verifiable rewards are unavailable.