STALE benchmark: LLM agents fail to update outdated memories even with new evidence in context
New 400-scenario benchmark shows frontier models reach only 55% accuracy when asked to detect and act on implicitly invalidated beliefs, even when updated evidence sits in context.
Large language model agents are supposed to remember your preferences and update them when circumstances change. A new benchmark reveals they're terrible at it.
STALE, a 400-scenario evaluation suite published this week, tests whether LLM agents can recognize when earlier memories have been invalidated by later context—without explicit negation. The benchmark's 1,200 queries span everyday situations: a user moves cities, changes jobs, or adopts a new diet, and the agent must infer that prior stored beliefs are now stale. The best-performing model in the study achieved 55.2% overall accuracy. Most failed to act on updated information even when it appeared verbatim in a 150K-token context window.
What stands out
- 01Implicit conflicts break retrieval-action coupling. Models often surface the correct updated fact during retrieval but still answer as if the old state were true. The gap between "finding" and "using" new evidence is the study's central finding.
- 02Premise resistance is the hardest dimension. When a user's query falsely presupposes a stale state ("What's the weather like in Boston?" after the user moved to Seattle), models accept the outdated framing rather than push back. This dimension saw the lowest accuracy across all tested systems.
- 03State propagation rarely happens. Changing one aspect of a user's life (new job) should invalidate related memories (old commute route, former office address). Models struggle to trace these dependencies without explicit prompting.
- 04CUPMem prototype offers a path forward. The paper's baseline system uses structured state consolidation at write-time and propagation-aware search to adjudicate conflicts. It outperforms vanilla retrieval-augmented generation, suggesting that explicit state tracking—not just semantic search over raw text—may be necessary for robust long-term memory.
