POW3R framework cuts rubric RL training steps by 2.5–4× on multimodal tasks
A new policy-aware rubric reward method adapts criterion weights during training, reaching the same performance plateau in a fraction of the steps required by standard rubric GRPO.

POW3R is a policy-aware rubric reward framework that speeds up reinforcement learning when models must satisfy multiple qualitative criteria at once. The method addresses a core inefficiency in rubric-based RL: standard approaches treat all criteria equally during training, even when some are already saturated or currently out of reach for the policy. POW3R instead uses rollout-level contrast to emphasize criteria that currently separate the policy's outputs, making the reward signal more informative without changing the underlying evaluation target.
Across three base policies on two datasets spanning multimodal and text-only settings, POW3R won 24 of 30 base-policy/metric comparisons against vanilla GRPO with rubric rewards. It improved both mean rubric reward and strict completion—the fraction of prompts whose response satisfies every required rubric criterion—while reaching the same performance plateau in 2.5 to 4 times fewer training steps. The framework preserves human-assigned importance weights and category balance as the rubric objective, adapting only the criterion-level reward weights during training to reflect what the current policy can actually learn from.
The paper argues that rubric rewards should distinguish what should matter in the final answer from what can teach the current policy. Many important criteria are already saturated or currently unreachable early in training, while criteria that distinguish rollouts are not necessarily those with the largest human weights. POW3R's policy-aware weighting makes the GRPO reward more informative at each step, cutting the number of training iterations needed to reach the same strict completion rate by more than half.