ActGuide-RL matches SFT+RL without supervised fine-tuning, using human action data
New RL method from researchers uses everyday human action data as adaptive fallback guidance, matching SFT+RL pipeline performance on GAIA and XBench without supervised fine-tuning overhead.
ActGuide-RL is a reinforcement learning technique that addresses the cold-start problem in agentic RL for large language models. Instead of iterative supervised fine-tuning to bootstrap exploration, the method injects action data from everyday human interactions as plan-style reference guidance. When the base policy cannot reach reward states on its own, ActGuide-RL invokes guidance as an adaptive fallback, then optimizes guided and unguided rollouts jointly to internalize exploration gains back into the unguided policy. The approach follows a minimal intervention principle, activating guidance only when task difficulty demands it, which reduces off-policy risk while preserving learning signal.
On search-agent benchmarks, ActGuide-RL delivered a 10.7 percentage point improvement over zero-RL baselines on GAIA and a 19 point gain on XBench when using Qwen3-4B as the base model. Performance matched the standard SFT+RL pipeline without requiring any cold-start supervised data. The mixed-policy training regime jointly optimizes guided and unguided rollouts, a design choice supported by both theoretical analysis and empirical ablation. Tested on Qwen3-4B, a 4-billion-parameter model, the approach suggests scalability to mid-tier open-weight LLMs without requiring frontier-scale compute.
