Foundation models learn to refine their own strategies mid-run without human feedback
A new research framework enables foundation models to refine their own prompts, skills, and memory online during a single run, achieving sustained progress in Pokemon without human intervention or environment resets.

Foundation models can iteratively improve their own decision-making scaffolding in real time, without the episode resets that current prompt-optimization methods require, according to a new paper from Google and collaborating researchers.
Continual Harness is a self-improving framework for embodied agents that formalizes what the team observed in their Gemini Plays Pokemon (GPP) experiments. GPP became the first AI system to complete Pokemon Blue, Yellow Legacy on hard mode, and Pokemon Crystal without losing a single battle. During the hardest stages, the agent began iterating on its own strategy through long-context memory, surfacing emergent self-improvement signals alongside human-in-the-loop refinement. Continual Harness removes the human from that loop entirely.
The framework alternates between acting and refining its own prompt, sub-agents, skills, and memory, drawing on any past trajectory data. Starting from only a minimal environment interface—no curated knowledge, no hand-crafted tools, no domain scaffolding—Continual Harness substantially reduces button-press cost relative to a minimalist baseline on Pokemon Red and Emerald across frontier models. It recovers a majority of the gap to a hand-engineered expert harness, with gains that scale with model capability.
The researchers then closed the loop with the model itself through an online process-reward co-learning mechanism. An open-source agent's rollouts through the refining harness are relabeled by a frontier teacher and used to update the model, driving sustained in-game milestone progress on Pokemon Red without resetting the environment between training iterations.