GRAFT internalizes tool graphs as tokens to fix LLM planning errors

A new arXiv preprint introduces GRAFT, a framework that maps tool dependencies as special tokens within the language model, reducing invalid execution sequences in multi-step workflows.

May 13, 2026

GRAFT internalizes tool graphs as tokens to fix LLM planning errors

Researchers have proposed GRAFT, a graph-tokenized language model framework that addresses a persistent failure mode in LLM tool planning: models generate semantically plausible tool sequences that violate execution dependencies. The preprint, posted to arXiv on May 13, argues that existing methods—retrieval, serialization, or prompt-level injection of tool graphs—treat the dependency graph as external context. When an early tool selection is incorrect, error accumulation pushes subsequent predictions off the valid execution path, and the model has no internal mechanism to recover.

GRAFT internalizes the tool graph by mapping each node to a dedicated special token and learning directed dependencies within the model's representation space. The framework also introduces on-policy tool context distillation, training the model on its own sampled trajectories while distilling stepwise planning signals. In experiments, GRAFT achieves state-of-the-art performance on exact sequence matching and dependency legality, suggesting the approach reduces the semantic-plausibility traps that plague prompt-level graph injection. By making constraint satisfaction part of the forward pass rather than a post-hoc filter, the model learns to avoid invalid states during generation.

The core insight is that LLMs trained on natural-language corpora excel at generating plausible-sounding plans but lack native awareness of hard execution constraints. The on-policy distillation step is designed to close the gap between training-time oracle paths and inference-time sampling—a gap the authors identify as a major source of drift in multi-step workflows. What remains unclear is how the approach scales to tool graphs with hundreds of nodes, whether the special-token vocabulary can generalize to unseen tools without retraining, and how the method performs when tool descriptions are ambiguous or incomplete. The preprint does not yet include open weights or a public implementation, so practitioners will be watching for a code release and reproducibility details.

More in Research