Predicting turn-taking in multiparty conversations has many practical applications in human-computer/robot interaction. However, the complexity of human communication makes it a challenging task. Recent advances have shown that synchronous multi-perspective egocentric data can significantly improve turn-taking prediction compared to asynchronous, single-perspective transcriptions. Building on this research, we propose a new multimodal transformer-based architecture for predicting turn-taking in embodied, synchronized multi-perspective data. Our experimental results on the recently introduced EgoCom dataset show a substantial performance improvement of up to 14.01% on average compared to existing baselines and alternative transformer-based approaches. The source code, and the pre-trained models of our 3T-Transformer will be available upon acceptance.
翻译:预测多人对话中的轮次转换在人机交互/机器人交互中具有众多实际应用。然而,人类交流的复杂性使其成为一项具有挑战性的任务。最新研究表明,与异步单视角转录数据相比,同步多视角自我中心数据能显著提升轮次转换预测性能。基于此研究,我们提出了一种新的多模态Transformer架构,用于预测具身化同步多视角数据中的轮次转换。在近期发布的EgoCom数据集上的实验结果显示,与现有基线模型及其他基于Transformer的替代方法相比,我们的方法平均性能提升高达14.01%。源代码及3T-Transformer预训练模型将在接收后公开。