In human conversations, short backchannel utterances such as "yeah" and "oh" play a crucial role in facilitating smooth and engaging dialogue. These backchannels signal attentiveness and understanding without interrupting the speaker, making their accurate prediction essential for creating more natural conversational agents. This paper proposes a novel method for real-time, continuous backchannel prediction using a fine-tuned Voice Activity Projection (VAP) model. While existing approaches have relied on turn-based or artificially balanced datasets, our approach predicts both the timing and type of backchannels in a continuous and frame-wise manner on unbalanced, real-world datasets. We first pre-train the VAP model on a general dialogue corpus to capture conversational dynamics and then fine-tune it on a specialized dataset focused on backchannel behavior. Experimental results demonstrate that our model outperforms baseline methods in both timing and type prediction tasks, achieving robust performance in real-time environments. This research offers a promising step toward more responsive and human-like dialogue systems, with implications for interactive spoken dialogue applications such as virtual assistants and robots.
翻译:在人类对话中,诸如"是的"、"嗯"和"哦"等简短反馈话语对于促进流畅且引人入胜的对话起着至关重要的作用。这些反馈信号表明倾听者的注意力和理解,而不会打断说话者,因此准确预测它们对于创建更自然的对话代理至关重要。本文提出了一种利用微调后的语音活动投影模型进行实时、连续反馈预测的新方法。现有方法通常依赖于基于轮次或人工平衡的数据集,而我们的方法则能在不平衡的真实世界数据集上,以连续且逐帧的方式预测反馈的时机和类型。我们首先在通用对话语料库上对VAP模型进行预训练,以捕捉对话动态,然后在一个专注于反馈行为的专门数据集上对其进行微调。实验结果表明,我们的模型在时机和类型预测任务上均优于基线方法,并在实时环境中实现了稳健的性能。这项研究为开发更具响应性和类人化的对话系统迈出了有希望的一步,对虚拟助手和机器人等交互式口语对话应用具有重要影响。