Building on the success of large language models (LLMs), recent advancements such as GPT-4o have enabled real-time speech interactions through LLM-based voice assistants, offering a significantly improved user experience compared to traditional text-based interactions. However, the absence of benchmarks designed to evaluate these speech interaction capabilities has hindered progress of LLM-based voice assistants development. Current evaluations focus primarily on automatic speech recognition (ASR) or general knowledge evaluation with clean speeches, neglecting the more intricate, real-world scenarios that involve diverse speaker characteristics, environmental and content factors. To address this, we introduce VoiceBench, the first benchmark designed to provide a multi-faceted evaluation of LLM-based voice assistants. VoiceBench also includes both real and synthetic spoken instructions that incorporate the above three key real-world variations. Extensive experiments reveal the limitations of current LLM-based voice assistant models and offer valuable insights for future research and development in this field.
翻译:基于大语言模型(LLMs)的成功,诸如GPT-4o等最新进展已能通过基于LLM的语音助手实现实时语音交互,相比传统的文本交互提供了显著改善的用户体验。然而,由于缺乏专门评估此类语音交互能力的基准测试,阻碍了基于LLM的语音助手的发展进程。当前的评估主要集中于自动语音识别(ASR)或使用纯净语音的通用知识评估,忽略了涉及多样说话人特征、环境与内容因素等更为复杂、真实的场景。为解决此问题,我们推出了VoiceBench,这是首个旨在对基于LLM的语音助手进行多维度评估的基准测试。VoiceBench同时包含了融合上述三种关键现实世界变体的真实与合成语音指令。大量实验揭示了当前基于LLM的语音助手模型的局限性,并为该领域未来的研究与发展提供了宝贵的见解。