Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics

The recent wave of audio foundation models (FMs) could provide new capabilities for conversational modeling. However, there have been limited efforts to evaluate these audio FMs comprehensively on their ability to have natural and interactive conversations. To engage in meaningful conversation with the end user, we would want the FMs to additionally perform a fluent succession of turns without too much overlapping speech or long stretches of silence. Inspired by this, we ask whether the recently proposed audio FMs can understand, predict, and perform turn-taking events? To answer this, we propose a novel evaluation protocol that can assess spoken dialog system's turn-taking capabilities using a supervised model as a judge that has been trained to predict turn-taking events in human-human conversations. Using this protocol, we present the first comprehensive user study that evaluates existing spoken dialogue systems on their ability to perform turn-taking events and reveal many interesting insights, such as they sometimes do not understand when to speak up, can interrupt too aggressively and rarely backchannel. We further evaluate multiple open-source and proprietary audio FMs accessible through APIs on carefully curated test benchmarks from Switchboard to measure their ability to understand and predict turn-taking events and identify significant room for improvement. We will open source our evaluation platform to promote the development of advanced conversational AI systems.

翻译：近期涌现的音频基础模型（FMs）为对话建模提供了新的能力。然而，目前对这些音频FMs在实现自然交互对话方面的能力进行全面评估的研究仍较为有限。为了与终端用户进行有意义的对话，我们希望FMs能够额外实现流畅的轮转衔接，避免过多的语音重叠或长时间的沉默。受此启发，我们探究近期提出的音频FMs是否能够理解、预测并执行轮转事件？为回答这一问题，我们提出了一种新颖的评估协议，该协议利用一个经过训练的监督模型作为评判器来评估口语对话系统的轮转能力，该评判器已在人-人对话的轮转事件预测任务上完成训练。基于此协议，我们开展了首个全面的用户研究，评估现有口语对话系统执行轮转事件的能力，并揭示了诸多重要发现，例如这些系统有时无法理解何时该发言，可能出现过激的打断行为，且极少使用反馈性应答。我们进一步在精心筛选的Switchboard测试基准上，评估了多个可通过API访问的开源及专有音频FMs，以衡量其理解和预测轮转事件的能力，结果表明这些模型仍有显著的改进空间。我们将开源评估平台以促进先进对话AI系统的发展。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日