Predicting turn-taking in multiparty conversations has many practical applications in human-computer/robot interaction. However, the complexity of human communication makes it a challenging task. Recent advances have shown that synchronous multi-perspective egocentric data can significantly improve turn-taking prediction compared to asynchronous, single-perspective transcriptions. Building on this research, we propose a new multimodal transformer-based architecture for predicting turn-taking in embodied, synchronized multi-perspective data. Our experimental results on the recently introduced EgoCom dataset show a substantial performance improvement of up to 14.01% on average compared to existing baselines and alternative transformer-based approaches. The source code, and the pre-trained models of our 3M-Transformer will be available upon acceptance.
翻译:在多人群对话中预测对话轮次转换在人机交互领域具有广泛的实际应用价值。然而,人类交流的复杂性使其成为一项极具挑战性的任务。最新研究表明,相较于异步单视角转录数据,同步多视角自我中心数据能显著提升对话轮次转换的预测性能。基于这一研究成果,我们提出了一种全新的多模态Transformer架构,用于处理具身化同步多视角数据的轮次转换预测任务。在近期发布的EgoCom数据集上的实验结果表明,与现有基线方法及其他基于Transformer的替代方案相比,本方法的平均性能提升可达14.01%。本研究的源代码及3M-Transformer预训练模型将在论文接收后公开发布。