S2S-Arena：基于副语言信息的指令跟随能力语音到语音协议评估 (S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following with Paralinguistic Information)

The rapid development of large language models (LLMs) has brought significant attention to speech models, particularly recent progress in speech2speech protocols supporting speech input and output. However, the existing benchmarks adopt automatic text-based evaluators for evaluating the instruction following ability of these models lack consideration for paralinguistic information in both speech understanding and generation. To address these issues, we introduce S2S-Arena, a novel arena-style S2S benchmark that evaluates instruction-following capabilities with paralinguistic information in both speech-in and speech-out across real-world tasks. We design 154 samples that fused TTS and live recordings in four domains with 21 tasks and manually evaluate existing popular speech models in an arena-style manner. The experimental results show that: (1) in addition to the superior performance of GPT-4o, the speech model of cascaded ASR, LLM, and TTS outperforms the jointly trained model after text-speech alignment in speech2speech protocols; (2) considering paralinguistic information, the knowledgeability of the speech model mainly depends on the LLM backbone, and the multilingual support of that is limited by the speech module; (3) excellent speech models can already understand the paralinguistic information in speech input, but generating appropriate audio with paralinguistic information is still a challenge.

翻译：大型语言模型（LLM）的快速发展使语音模型受到广泛关注，尤其是近期支持语音输入与输出的语音到语音协议研究进展显著。然而，现有基准测试采用基于文本的自动评估器来评估这些模型的指令跟随能力，未能充分考虑语音理解与生成中的副语言信息。为解决这些问题，我们提出了S2S-Arena——一种新颖的竞技场式语音到语音基准测试，通过在真实世界任务中同时考察语音输入与语音输出的副语言信息来评估指令跟随能力。我们设计了涵盖4个领域、21项任务的154个融合TTS合成语音与现场录音的样本，并以竞技场方式对现有主流语音模型进行人工评估。实验结果表明：（1）除GPT-4o的卓越表现外，在语音到语音协议中，级联ASR、LLM与TTS的语音模型性能优于经过文本-语音对齐的联合训练模型；（2）考虑副语言信息时，语音模型的知识能力主要取决于LLM主干网络，而其多语言支持能力受限于语音模块；（3）优秀语音模型已能理解语音输入中的副语言信息，但生成具有恰当副语言信息的音频仍面临挑战。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日