Agent-BRACE splits LLM reasoning from action to cut context bloat on long-horizon tasks
A new RL method represents belief states as natural language claims tagged with certainty labels, cutting context bloat and lifting long-horizon task accuracy by up to 14.5 percent over history-based baselines.
Large language models struggle with long-horizon tasks in partially observable environments because they must track uncertain world state across dozens of steps while context windows fill with irrelevant history. Agent-BRACE, a reinforcement learning framework introduced this week, solves both problems by splitting an LLM agent into two jointly trained components: a belief state model that outputs structured claims about the environment—each tagged with a verbalized certainty label from certain to unknown—and a policy model that selects actions conditioned on those claims rather than the full interaction log.
The belief representation is a set of atomic natural language statements. For example, "The red key is in the drawer" might carry a likely label, while "The drawer is locked" could be marked unknown. This structured approximate posterior compresses history into a compact, decision-relevant summary, keeping context size near-constant regardless of episode length. The policy model learns to act under explicit uncertainty, a departure from typical LLM agents that either hallucinate missing information or drown in concatenated transcripts.
Across embodied language benchmarks, Agent-BRACE lifted task success by an average of 14.5 percentage points when applied to Qwen2.5-3B-Instruct and 5.3 points on Qwen3-4B-Instruct, outperforming strong RL baselines that condition on full history. The learned belief becomes better calibrated as an episode progresses and evidence accumulates, suggesting the model genuinely tracks uncertainty rather than simply labeling claims at random. The preprint is available on HuggingFace Papers; code and model checkpoints have not yet been released.
