ChildAgentEval measures AI agents against child cognitive milestones
Researchers introduce ChildAgentEval, the first psychometric benchmark that measures how AI agents stack up against age-specific human cognitive milestones from the Wechsler Intelligence Scale for Children.

ChildAgentEval is a psychometrically grounded interactive benchmark that tests multimodal large language model–based agents against the Wechsler Intelligence Scale for Children (WISC). The benchmark exposes where current AI agents fail at foundational tasks that children routinely solve, offering the first systematic comparison of agent reasoning performance to age-specific human developmental stages.
Despite advances in multimodal reasoning and tool integration, state-of-the-art AI agents still struggle with simple tasks that a child can handle with ease. ChildAgentEval frames the problem as cognitive age alignment — whether an agent's reasoning matches the expected performance of a human at a given developmental stage. By anchoring evaluation to WISC, the benchmark provides a concrete developmental yardstick rather than abstract capability claims.
The work, authored by Yifan Shen, Jiawen Zhang, Jian Xu, Junho Kim, Ismini Lourentzou, and Xu Cao, arrived on HuggingFace Papers on May 19, 2026. Results reveal that current MLLM-based agents cannot consistently simulate age-specific cognitive behavior — the first benchmark to apply psychometric grounding to interactive agent evaluation.