Voice assistants increasingly rely on Speech Language Models (SpeechLMs) to interpret spoken queries and execute complex tasks, yet existing benchmarks lack domain breadth, acoustic diversity, and compositional reasoning complexity to evaluate tool-calling performance. We introduce Audio2Tool, a large-scale dataset comprising approximately 30,000 queries designed to assess tool-calling capabilities of SpeechLMs across three primary domains: Smart Car, Smart Home, and Wearables. Our benchmark features a multi-tier complexity hierarchy, ranging from simple direct commands to complex multi-intent and needle-in-a-haystack extraction to isolate distinct failure modes. To ensure realism, we employ zero-shot voice cloning text-to-speech synthesis and diverse noise profiles to simulate in-the-wild conditions. Evaluations of state-of-the-art SpeechLMs and ASR-LLM pipelines show strong performance on simple commands but significant degradation under compositional and acoustic challenges. Code and dataset are publicly available on the project page: https://audio2tool.github.io/.
翻译:摘要:语音助手日益依赖语音语言模型(SpeechLMs)来理解口语查询并执行复杂任务,然而现有基准在领域广度、声学多样性以及组合推理复杂性方面存在不足,难以评估工具调用性能。我们提出Audio2Tool,这是一个包含约30,000条查询的大规模数据集,旨在评估语音语言模型在智能汽车、智能家居和可穿戴设备三大主要领域的工具调用能力。本基准具有多层次复杂性分级体系,涵盖从简单直接指令到复杂多意图及“大海捞针”式提取,以隔离不同的失败模式。为确保真实性,我们采用零样本语音克隆文本转语音合成技术及多样化的噪声配置来模拟真实场景。对当前最先进的语音语言模型及ASR-LLM管线(自动语音识别-大语言模型级联系统)的评估表明,它们在简单指令上表现良好,但在组合性与声学挑战下性能显著下降。代码和数据集已公开在项目页面:https://audio2tool.github.io/。