SkillEvolBench reveals LLM agents struggle to convert experience into durable skills
A new benchmark with 180 tasks across six environments reveals that current agents adapt locally but rarely form robust procedural skills from episodic trajectories.

SkillEvolBench is a diagnostic benchmark that evaluates whether large language model agents can distill episodic task experience into reusable procedural skills. The benchmark contains 180 tasks spanning six real-world agent environments, organized into role-conditioned task families that share underlying procedures.
Agents first learn from acquisition tasks, then update an external skill library using compacted trajectories and verifier feedback. They face frozen deployment tasks that test context shift, adversarial shortcuts, and composition. The benchmark compares self-generated and curated-start skill evolution against no-skill and raw-trajectory controls, isolating procedural abstraction from base capability and direct reuse of episodic traces.
On deployment stability
Across ten model configurations and three agent harnesses, the researchers found that current agents often adapt locally but rarely form robust reusable skills. Skill-based conditions can improve acquisition or replay, and individual models sometimes gain on specific deployment axes—but these gains are unstable under frozen deployment. Raw-trajectory reuse frequently outperforms distilled skills, suggesting that current abstraction procedures discard contextual and procedural cues that remain useful for future tasks.
Capacity and cost analyses show that writing more skills or larger resource libraries is not sufficient. Additional updates can improve coverage while introducing episode-specific drift and procedural clutter. The benchmark positions itself as a testbed for measuring when one-off experience becomes durable procedural knowledge rather than task-local memory.

