The rapid development of large language models (LLMs) has brought significant attention to speech models, particularly recent progress in speech2speech protocols supporting speech input and output. However, the existing benchmarks adopt automatic text-based evaluators for evaluating the instruction following ability of these models lack consideration for paralinguistic information in both speech understanding and generation. To address these issues, we introduce S2S-Arena, a novel arena-style S2S benchmark that evaluates instruction-following capabilities with paralinguistic information in both speech-in and speech-out across real-world tasks. We design 154 samples that fused TTS and live recordings in four domains with 21 tasks and manually evaluate existing popular speech models in an arena-style manner. The experimental results show that: (1) in addition to the superior performance of GPT-4o, the speech model of cascaded ASR, LLM, and TTS outperforms the jointly trained model after text-speech alignment in speech2speech protocols; (2) considering paralinguistic information, the knowledgeability of the speech model mainly depends on the LLM backbone, and the multilingual support of that is limited by the speech module; (3) excellent speech models can already understand the paralinguistic information in speech input, but generating appropriate audio with paralinguistic information is still a challenge.
翻译:大型语言模型(LLM)的快速发展使语音模型受到广泛关注,尤其是近期支持语音输入与输出的语音到语音协议研究进展显著。然而,现有基准测试采用基于文本的自动评估器来评估这些模型的指令跟随能力,未能充分考虑语音理解与生成中的副语言信息。为解决这些问题,我们提出了S2S-Arena——一种新颖的竞技场式语音到语音基准测试,通过在真实世界任务中同时考察语音输入与语音输出的副语言信息来评估指令跟随能力。我们设计了涵盖4个领域、21项任务的154个融合TTS合成语音与现场录音的样本,并以竞技场方式对现有主流语音模型进行人工评估。实验结果表明:(1)除GPT-4o的卓越表现外,在语音到语音协议中,级联ASR、LLM与TTS的语音模型性能优于经过文本-语音对齐的联合训练模型;(2)考虑副语言信息时,语音模型的知识能力主要取决于LLM主干网络,而其多语言支持能力受限于语音模块;(3)优秀语音模型已能理解语音输入中的副语言信息,但生成具有恰当副语言信息的音频仍面临挑战。