MEME benchmark exposes 1% accuracy on multi-entity memory reasoning
New arXiv preprint reveals LLM-based agents fail catastrophically on dependency reasoning across persistent sessions, with all tested memory systems scoring below 3% on cascade updates.

MEME (Multi-entity & Evolving Memory Evaluation) is a benchmark from researchers at NAVER AI Lab and the University of Tübingen that tests how LLM-based agents handle information across multiple sessions when facts depend on each other or get deleted. The preprint, posted to arXiv on May 13, 2026, defines six tasks spanning multi-entity and evolving memory scenarios—three of which (Cascade, Absence, Deletion) have never been scored by prior work.
The team evaluated six memory systems across three paradigms on 100 controlled episodes. Every system collapsed on dependency reasoning: Cascade tasks (where updating one entity should propagate to related entities) averaged 3% accuracy, and Absence tasks (reasoning about missing dependencies) hit 1%. Static retrieval performance was fine, but the moment facts had to update in tandem or reason about what's no longer there, the systems broke. Prompt optimization, deeper retrieval, reduced noise, and stronger base LLMs all failed to close the gap.
Only one configuration partially recovered: a file-based agent running Claude Opus 4.7 as its internal LLM. That setup improved scores but cost roughly 70 times the baseline, making it impractical at scale. Code and data are live at https://seokwonjung-jay.github.io/meme-eval/. The paper lists Seokwon Jung, Alexander Rubinstein, Arnas Uselis, Sangdoo Yun, and Seong Joon Oh as authors.