The recent wave of audio foundation models (FMs) could provide new capabilities for conversational modeling. However, there have been limited efforts to evaluate these audio FMs comprehensively on their ability to have natural and interactive conversations. To engage in meaningful conversation with the end user, we would want the FMs to additionally perform a fluent succession of turns without too much overlapping speech or long stretches of silence. Inspired by this, we ask whether the recently proposed audio FMs can understand, predict, and perform turn-taking events? To answer this, we propose a novel evaluation protocol that can assess spoken dialog system's turn-taking capabilities using a supervised model as a judge that has been trained to predict turn-taking events in human-human conversations. Using this protocol, we present the first comprehensive user study that evaluates existing spoken dialogue systems on their ability to perform turn-taking events and reveal many interesting insights, such as they sometimes do not understand when to speak up, can interrupt too aggressively and rarely backchannel. We further evaluate multiple open-source and proprietary audio FMs accessible through APIs on carefully curated test benchmarks from Switchboard to measure their ability to understand and predict turn-taking events and identify significant room for improvement. We will open source our evaluation platform to promote the development of advanced conversational AI systems.
翻译:近期涌现的音频基础模型(FMs)为对话建模提供了新的能力。然而,目前对这些音频FMs在实现自然交互对话方面的能力进行全面评估的研究仍较为有限。为了与终端用户进行有意义的对话,我们希望FMs能够额外实现流畅的轮转衔接,避免过多的语音重叠或长时间的沉默。受此启发,我们探究近期提出的音频FMs是否能够理解、预测并执行轮转事件?为回答这一问题,我们提出了一种新颖的评估协议,该协议利用一个经过训练的监督模型作为评判器来评估口语对话系统的轮转能力,该评判器已在人-人对话的轮转事件预测任务上完成训练。基于此协议,我们开展了首个全面的用户研究,评估现有口语对话系统执行轮转事件的能力,并揭示了诸多重要发现,例如这些系统有时无法理解何时该发言,可能出现过激的打断行为,且极少使用反馈性应答。我们进一步在精心筛选的Switchboard测试基准上,评估了多个可通过API访问的开源及专有音频FMs,以衡量其理解和预测轮转事件的能力,结果表明这些模型仍有显著的改进空间。我们将开源评估平台以促进先进对话AI系统的发展。