Agent-ValueBench: 4,335 tasks expose how harness design steers autonomous agent ethics
A new benchmark reveals that agent values diverge sharply from their underlying LLMs, and that harness design and embedded skills steer behavior far more than model alignment alone.
Researchers at Peking University and collaborators have released Agent-ValueBench, the first benchmark dedicated to measuring how autonomous AI agents navigate value conflicts in real executable tasks. The benchmark spans 394 environments across 16 domains, with 4,335 value-conflict tasks covering 28 value systems and 332 dimensions. Every task was co-synthesized through an end-to-end pipeline and curated per-instance by professional psychologists.
The team benchmarked 14 frontier models—both proprietary and open-weight—across four mainstream agent harnesses, including OpenClaw. Each task includes two "pole-aligned" golden trajectories that anchor a rubric-based judge for scoring agent behavior across multi-step action sequences, a structure that captures how values play out over time rather than in single outputs.
What stands out
- 01Agent values diverge from LLM values. The agentic modality introduces dataset-, evaluation-, and system-level challenges absent from text-only benchmarks. An agent's behavior cannot be predicted from its underlying language model alone.
- 02Harness design exerts more influence than model alignment. Agent values bend "non-additively" under harness pull—the framework wrapping the model (its tools, memory, planning loop) steers behavior more decisively than classical model fine-tuning or prompt engineering.
- 03Skill steering outpaces model tuning. Embedding specific skills into the agent shifts values more decisively than tuning the underlying model. The implication: agent alignment is migrating from model-level work toward harness alignment and skill-level steering.
- 04Cross-model homogeneity beneath surface variation. Researchers observed a "Value Tide" of surprising uniformity in how agents resolve ethical conflicts, even when base LLMs differ. Interpretable counter-currents exist, but the overall pattern is convergence.
