FATE framework cuts harmful agent compliance 82.6% via trajectory repair
Researchers propose FATE, an on-policy self-evolution method that converts verifier-scored failure trajectories into repair supervision, reducing attack success rates and over-refusal without degrading task performance.
FATE is a self-evolution framework from Bo Yin, Qi Li, and Xinchao Wang that addresses tool-using LLM agent safety by learning from failure trajectories rather than final responses alone. The preprint, posted to HuggingFace on May 13, tackles a problem that has plagued open-weight agent deployments: unsafe tool calls, prompt injection compliance, harmful request fulfillment, and over-refusal of benign tasks—all failure modes that existing response-level alignment signals miss.
The core insight is that agents fail through trajectories, not just final outputs. A model might produce a seemingly safe answer while executing an unsafe tool call mid-chain, or refuse a harmless request because safety tuning was too broad. Existing alignment methods typically rely on off-policy signals or sparse response-level rewards, creating a safety-utility trade-off where improving one dimension degrades the other. FATE sidesteps this by using on-policy repair candidates scored by verifiers across four dimensions: security, utility, over-refusal control, and trajectory validity. The same policy that generated the failure proposes repair candidates, which are re-scored and filtered to produce dense trajectory-level supervision without expert demonstrations.
The framework introduces Pareto-Front Policy Optimization (PFPO), which combines supervised warmup with Pareto-aware policy optimization to preserve the safety-utility frontier during training. This matters for practitioners running open-weight agents in production: you want the model to refuse jailbreaks without also refusing legitimate edge-case requests. PFPO explicitly models that trade-off rather than collapsing it into a single scalar reward.
Experiments on AgentDojo, AgentHarm, and ATBench show that FATE improves safety across different model scales while maintaining useful behavior. Compared to strong baselines, the method reduces attack success rate by 33.5%, harmful compliance by 82.6%, and improves external trajectory-safety diagnosis by 6.5%.
