Large scale Speech Language Models have enabled voice assistants capable of understanding natural spoken queries and performing complex tasks. However, existing speech benchmarks largely focus on isolated capabilities such as transcription or question answering and do not systematically evaluate agentic behavior or adversarial robustness. To address this, we introduce VoiceAgentBench, a comprehensive benchmark for evaluating SpeechLMs in realistic spoken agentic settings, comprising 6,000+ synthetic spoken queries spanning single-tool invocations, multi-tool workflows, multi-turn dialogue, and safety evaluations across English and six Indic languages. To ensure speaker diversity, we further simulate speaker variability using a novel sampling strategy that selects audios for TTS voice conversion based on speaker embeddings to maximize acoustic diversity. Our evaluation measures tool selection accuracy, structural consistency, and the correctness of tool invocations, including adversarial robustness. Across agentic tasks, ASR-LLM pipelines outperform end-to-end SpeechLMs, achieving up to 60.6% average parameter-filling accuracy on English, while SpeechLMs exhibit lower performance and sharper degradation on Indic languages. All models struggle in sequential workflows and safety evaluations, highlighting persistent limitations in tool orchestration, multilingual generalization, and safety robustness. VoiceAgentBench is publicly available on Hugging Face at https://huggingface.co/datasets/krutrim-ai-labs/VoiceAgentBench, and the codebase is released at https://github.com/ola-krutrim/VoiceAgentBench.
翻译:大规模语音语言模型使得语音助手能够理解自然口语查询并执行复杂任务。然而,现有语音基准测试主要关注孤立能力(如语音转录或问答),并未系统评估智能体行为或对抗鲁棒性。为此,我们推出VoiceAgentBench——一个用于在真实口语智能体场景中评估SpeechLM的综合基准,包含6,000余条合成口语查询,涵盖单工具调用、多工具工作流、多轮对话及安全评估,支持英语和六种印度语言。为确保说话人多样性,我们通过新颖的采样策略模拟说话人变异:基于说话人嵌入特征选择语音转换音频以最大化声学多样性。评估指标包括工具选择准确率、结构一致性及工具调用正确性(含对抗鲁棒性)。在智能体任务中,ASR-LLM流水线模型优于端到端SpeechLM,英语参数填充准确率最高达60.6%,而SpeechLM在印度语言上表现较低且性能衰减更显著。所有模型在序列工作流和安全评估中均存在困难,凸显了工具编排、多语言泛化与安全鲁棒性方面的持续局限。VoiceAgentBench已公开于Hugging Face平台(https://huggingface.co/datasets/krutrim-ai-labs/VoiceAgentBench),代码库发布于https://github.com/ola-krutrim/VoiceAgentBench。