MementoGUI equips GUI agents with learned memory control for long-horizon tasks
A new plug-in framework teaches multimodal language models to selectively store and retrieve visual interface events, replacing raw screenshot replay with learned memory control.

MementoGUI is a memory framework from researchers at the University of Rochester, IBM Research, and collaborators that gives GUI agents a learned controller for managing task state across extended interface sessions. The system addresses a core brittleness in current GUI agents: they either replay every screenshot from a session—overwhelming the model with redundant frames—or collapse history into text summaries that discard the localized visual evidence needed to click the right button three steps later.
The framework introduces MementoCore, a controller that modularizes memory operations into four specialized functions: step processing (deciding what to store from the current screen), memory compression (summarizing accumulated events), episodic writing (archiving reusable trajectories), and episodic selection (retrieving relevant past sessions). Working memory holds task-relevant interface events as text summaries paired with region-of-interest visual crops; episodic memory stores past trajectories and surfaces them through learned relevance scoring. The controller plugs into existing MLLM-based GUI agents without requiring backbone fine-tuning—it operates as a preprocessing layer that curates what the agent sees.
Benchmark results
The team tested MementoGUI on GUI-Odyssey, MM-Mind2Web, and a new long-horizon benchmark called MementoGUI-Bench, which they built to evaluate decision-making over extended task sequences. Across all three, agents augmented with MementoGUI outperformed three baselines: no-history (the agent sees only the current screen), history-replay (the agent sees every prior screenshot), and text-only memory (summaries without visual crops). Larger MementoCore backbones—the controller itself is an MLLM—delivered stronger memory-augmented performance, suggesting the memory-control problem benefits from scale in the same way action prediction does.
The authors developed a data curation pipeline that converts computer-use trajectories into training examples for the memory controller, pairing each interaction step with ground-truth memory operations. They also introduced MLLM-based evaluation metrics for semantic action matching (did the agent click the right element, even if pixel coordinates differ slightly?), task progress (how far did the agent advance toward the goal?), and memory consistency (does retrieved episodic memory actually match the current task context?). The preprint is available on HuggingFace Papers.