EVA-Bench finds no voice agent exceeds 0.5 on accuracy and experience combined
New open-source framework measures voice agents on task accuracy and conversational experience through bot-to-bot audio dialogues, revealing substantial gaps between peak and reliable performance.

EVA-Bench, a new end-to-end evaluation framework released May 14, tests voice agents—AI systems conducting spoken conversations to complete tasks—across 213 enterprise scenarios. The benchmark addresses two core evaluation challenges that existing frameworks have not jointly tackled: generating realistic simulated conversations and measuring quality across the full scope of voice-specific failure modes.
The framework orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues with automatic simulation validation that detects user simulator error and regenerates conversations before scoring. It introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to different agent architectures, enabling direct cross-architecture comparison. Results across 12 systems spanning all three voice agent architectures reveal substantial weaknesses: no system simultaneously exceeded 0.5 on both EVA-A pass@1 and EVA-X pass@1. Peak and reliable performance diverged substantially, with a median pass@k minus pass^k gap of 0.44 on EVA-A. Accent and noise perturbations exposed robustness gaps with mean degradation reaching 0.314 across metrics. The full framework, evaluation suite, and benchmark data are available under an open-source license.