MemEye benchmark exposes gaps in multimodal agent memory for visual reasoning
New framework measures whether AI agents preserve fine-grained visual evidence across time, revealing that current architectures fail at pixel-level detail retention and temporal state tracking.

MemEye is a visual-centric evaluation framework that tests whether multimodal agents retain the fine-grained visual details needed for later reasoning. Developed by Minghao Guo, Qingyue Jiao, Zeru Shi, Yihao Quan, Boxuan Zhang, and Danrui Li, the framework measures memory along two axes: the granularity of decisive visual evidence (scene-level down to pixel-level) and how that evidence must be used (single retrieval versus evolutionary synthesis across changing states). Existing benchmarks let agents answer visually grounded questions using only captions or text traces, sidestepping the need to preserve actual visual data. MemEye counters this with eight life-scenario tasks and ablation-driven validation gates that check answerability, shortcut resistance, visual necessity, and reasoning structure.
Evaluating 13 memory methods across four vision-language model backbones, the researchers found that current architectures struggle to preserve fine-grained visual details and reason about state changes over time. Three bottlenecks emerged: evidence routing (selecting which visual data to store), temporal tracking (maintaining state across sequences), and detail extraction (capturing pixel-level information rather than scene summaries). The preprint was posted to HuggingFace Papers on May 15, 2026.