Vicarious conditioning lets RL agents learn from observation without demonstrator policies

A new arXiv preprint introduces vicarious conditioning as an intrinsic reward mechanism, enabling reinforcement learning agents to learn from demonstrators without needing their policies or reward functions.

May 14, 2026

Vicarious conditioning lets RL agents learn from observation without demonstrator policies

Vicarious conditioning, a learning paradigm borrowed from psychology, now powers reinforcement learning agents that learn by watching others—no policy or reward function required. The approach, detailed in a preprint posted May 13, 2026, implements four memory-based steps drawn from cognitive science: attention, retention, reproduction, and reinforcement. Unlike off-policy methods that demand access to a demonstrator's internal mechanics, this technique lets agents extract useful behavior from observation alone, supporting low-shot learning scenarios where data is scarce.

Traditional intrinsic reward mechanisms operate under direct conditioning—the agent must experience states and actions firsthand to learn from them. Off-policy and imitation learning methods can leverage external demonstrations, but they require either the demonstrator's policy parameters or access to its reward function. That dependency makes them impractical when observing agents in the wild or when the demonstrator uses a different architecture. Vicarious conditioning sidesteps both constraints by treating observation as a distinct learning channel, much as humans and animals learn socially without needing to reverse-engineer another's decision-making process. The authors tested the method in MiniWorld Sidewalk, an environment with a non-descriptive terminal condition—the agent dies without receiving a reward signal—and extended trials to Box2D's CarRacing. Across both benchmarks, vicarious conditioning lengthened episode durations by steering agents away from fatal states and toward desirable outcomes, even when the environment provides no explicit feedback on failure.

More in Research