EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Tara Bogavelli,Gabrielle Gauthier Melançon,Katrina Stankiewicz,Oluwanifemi Bamgbose,Fanny Riols,Hoang H. Nguyen,Raghav Mehndiratta,Lindsay Devon Brin,Joseph Marinier,Hari Subramani,Anil Madamala,Sridhar Krishna Nemala,Srinivas Sunkara

from arxiv, Work in progress

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to all major agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k--pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean $Δ$ up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.

翻译：语音代理——通过语音对话完成任务的人工智能系统——正越来越多地部署于企业应用场景中。然而，现有基准尚未能同时解决两大核心评估挑战：生成逼真的模拟对话，以及全面衡量语音特有故障模式下的质量。我们提出EVA-Bench，一种同时应对上述挑战的端到端评估框架。在模拟方面，EVA-Bench通过动态多轮对话编排机器人间的音频对话，并采用自动模拟验证机制，在评分前检测用户模拟器错误并适当地重新生成对话。在测量方面，EVA-Bench引入两个复合指标：EVA-A（准确性），涵盖任务完成度、忠实度及音频层面的语音保真度；EVA-X（体验），涵盖对话进展、口语简洁性与话轮交接时机。两项指标均适用于所有主流代理架构，支持直接的跨架构比较。EVA-Bench包含跨越三个企业领域的213个场景、一套用于口音和噪声鲁棒性的受控扰动测试集，以及区分峰值能力与可靠能力的pass@1、pass@k、pass^k测量指标。在对涵盖全部三种架构的12个系统的评估中，我们发现：（1）尚无系统能同时在EVA-A pass@1和EVA-X pass@1上超过0.5；（2）峰值性能与可靠性能存在显著差异（EVA-A上pass@k与pass^k的中位差距为0.44）；（3）口音与噪声扰动暴露出显著的鲁棒性差距，其影响因架构、系统及指标而异（平均$\Delta$高达0.314）。我们以开源许可协议发布完整框架、评估套件及基准数据。