Speech Large Language Models (SLLMs) have rapidly expanded, supporting a wide range of tasks. These models are typically evaluated using text prompts, which may not reflect real-world scenarios where users interact with speech. To address this gap, we introduce DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions. Spanning 9 tasks and 11 languages, it provides 10 prompt variants per task-language pair, across five styles. Using DOWIS, we benchmark state-of-the-art SLLMs, analyzing the interplay between prompt modality, style, language, and task type. Results show that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings. Only for tasks with speech output, spoken prompts do close the gap, highlighting the need for speech-based prompting in SLLM evaluation.
翻译:语音大语言模型(SLLMs)已迅速扩展,支持广泛的任务。这些模型通常使用文本提示进行评估,这可能无法反映用户通过语音交互的真实场景。为弥补这一差距,我们引入了DoWhatISay(DOWIS),一个包含人类录制的语音和书面提示的多语言数据集,旨在与任何现有基准配对,以便在语音指令条件下对SLLMs进行真实评估。该数据集涵盖9个任务和11种语言,为每个任务-语言对提供了五种风格下的10个提示变体。利用DOWIS,我们对最先进的SLLMs进行了基准测试,分析了提示模态、风格、语言和任务类型之间的相互作用。结果表明,文本提示始终优于语音提示,尤其是在低资源和跨语言设置中。仅在具有语音输出的任务中,语音提示才能缩小差距,这突显了在SLLM评估中基于语音提示的必要性。