EVA-Bench benchmarks voice agents: no system scores above 0.5 on accuracy and experience combined
New open-source framework reveals voice agents fail robustness tests across accent, noise, and task metrics, with peak performance diverging sharply from reliable capability.

EVA-Bench, a new end-to-end evaluation framework for voice agents, tests both simulation realism and voice-specific failure modes across enterprise conversational AI systems. Researchers including Tara Bogavelli, Gabrielle Gauthier Melançon, and Hoang H. Nguyen orchestrated bot-to-bot audio conversations over 213 scenarios spanning three enterprise domains, automatically validating user simulator behavior and regenerating flawed dialogues before scoring.
The framework introduces two composite metrics: EVA-A (Accuracy), measuring task completion, faithfulness, and speech fidelity; and EVA-X (Experience), capturing conversation flow, conciseness, and turn-taking timing. Testing 12 systems across all three major voice agent architectures revealed substantial gaps. No system exceeded 0.5 on both EVA-A pass@1 and EVA-X pass@1 simultaneously. Peak performance diverged sharply from reliable performance, with a median gap of 0.44 between pass@k and pass^k scores on the accuracy metric. Accent and noise perturbations exposed further robustness issues, with mean performance drops reaching 0.314 depending on architecture, system, and metric. The full framework, evaluation suite, and benchmark data are available under an open-source license.