Speaker diarization is well studied for constrained audios but little explored for challenging in-the-wild videos, which have more speakers, shorter utterances, and inconsistent on-screen speakers. We address this gap by proposing an audio-visual diarization model which combines audio-only and visual-centric sub-systems via late fusion. For audio, we show that an attractor-based end-to-end system (EEND-EDA) performs remarkably well when trained with our proposed recipe of a simulated proxy dataset, and propose an improved version, EEND-EDA++, that uses attention in decoding and a speaker recognition loss during training to better handle the larger number of speakers. The visual-centric sub-system leverages facial attributes and lip-audio synchrony for identity and speech activity estimation of on-screen speakers. Both sub-systems surpass the state of the art (SOTA) by a large margin, with the fused audio-visual system achieving a new SOTA on the AVA-AVD benchmark.
翻译:说话人日志在受约束音频场景中已得到深入研究,但在更具挑战性的野外视频场景中仍鲜有探索——这类视频包含更多说话人、更短话语片段以及不一致的屏幕内说话人。我们通过提出一种音视频融合说话人日志模型来填补这一空白,该模型通过晚融合方式将纯音频子系统和视觉中心子系统相结合。在音频方面,我们证明基于吸引子的端到端系统(EEND-EDA)在采用我们提出的模拟代理数据集训练方案时表现优异,并进一步提出改进版本EEND-EDA++:通过解码时引入注意力机制和训练时加入说话人识别损失函数,有效应对更多说话人的场景。视觉中心子系统利用面部特征和口型-音频同步性,实现屏幕内说话人的身份识别与语音活动估计。两个子系统均以显著优势超越现有最优技术(SOTA),融合后的音视频系统在AVA-AVD基准测试中达到新的最优性能。