Current multiparty turn-taking models often rely on complex microphone arrays or multi-camera setups, limiting their applicability in human-robot interaction scenarios. We introduce MuVAP, a causal multimodal framework that extends Voice Activity Projection by grounding acoustic predictions in face tracks, enabling speaker-aware turn-taking predictions from a monaural audio stream and a single camera view. To address the combinatorial complexity of modeling multiple speakers, we propose Role-Relative Projection, which maps any N-speaker interaction onto a fixed current versus next floor-holder state. Because existing audiovisual datasets contain disruptive editing cuts that break causal tracking, we introduce the Audio-Visual Conversation Corpus, a 31-hour dataset of unedited, single-camera multiparty conversations. Evaluations demonstrate that MuVAP outperforms strong baselines on Shift-Hold and next-speaker prediction tasks across two- and three-speaker settings.
翻译:当前多方轮次预测模型通常依赖复杂的麦克风阵列或多摄像头设置,限制了其在人机交互场景中的适用性。我们提出MuVAP——一种因果多模态框架,通过将声学预测锚定在人脸轨迹上扩展了语音活动投影方法,使系统能够仅从单声道音频流和单个摄像头视角实现考虑说话者身份的轮次预测。针对多说话者建模的组合复杂性,我们提出角色相对投影方法,可将任意N说话者交互映射至固定的当前与下一话语权持有者状态。鉴于现有视听数据集中存在破坏因果追踪的剪辑切变,我们引入视听对话语料库——包含31小时未经剪辑的单摄像头多方对话数据。实验证明,MuVAP在双说话者和三说话者场景下的转换-保持预测与下一说话者预测任务中均优于强基线方法。