Elmes* automates LLM teaching evaluation across 330 educational scenarios
Researchers released Elmes*, an automated system that constructs fine-grained rubrics to evaluate how large language models teach across 330 educational scenarios spanning 11 subjects and three grade levels.

Elmes is a framework that automates the construction of scenario-specific evaluation rubrics for large language models in education. Released on arXiv this week, the system addresses a fundamental gap: existing benchmarks measure what models know rather than how they teach. Elmes combines a multi-agent engine that simulates teacher-student-judge interactions with SceneGen, a self-evolving module that co-optimizes evaluation criteria and test data from expert-defined pedagogical dimensions.
Using the framework, researchers built Edu-330, a benchmark covering 330 scenarios across 11 subjects, three grade bands, and 10 task types. The benchmark includes over 1,000 second-level indicators designed to capture fine-grained teaching behaviors—spanning long-tail pedagogical situations that manual rubric design struggles to cover at scale.
On benchmark findings
Experiments revealed that educational capability is multidimensional. Top-tier LLMs differ primarily in creativity and values integration, while knowledge-strong models may fail at Socratic scaffolding—the practice of guiding students through questions rather than direct answers. InnoSpark, an education-specialized model, achieved the best human-evaluated average score across the benchmark.
LLM judges preserved human-comparable rankings with much lower scoring variance than human evaluators, but exhibited judge-specific biases including self-preference. Expert-scored few-shot anchoring improved alignment between human and LLM judgments, while reasoning enforcement and greedy decoding effects varied by model. The framework positions Elmes* as scalable diagnostic infrastructure for pedagogically grounded LLM evaluation, particularly useful where manual rubric creation is impractical.
