In this paper, we propose a quality-aware end-to-end audio-visual neural speaker diarization framework, which comprises three key techniques. First, our audio-visual model takes both audio and visual features as inputs, utilizing a series of binary classification output layers to simultaneously identify the activities of all speakers. This end-to-end framework is meticulously designed to effectively handle situations of overlapping speech, providing accurate discrimination between speech and non-speech segments through the utilization of multi-modal information. Next, we employ a quality-aware audio-visual fusion structure to address signal quality issues for both audio degradations, such as noise, reverberation and other distortions, and video degradations, such as occlusions, off-screen speakers, or unreliable detection. Finally, a cross attention mechanism applied to multi-speaker embedding empowers the network to handle scenarios with varying numbers of speakers. Our experimental results, obtained from various data sets, demonstrate the robustness of our proposed techniques in diverse acoustic environments. Even in scenarios with severely degraded video quality, our system attains performance levels comparable to the best available audio-visual systems.
翻译:本文提出了一种质量感知的端到端视听神经说话人日志框架,该框架包含三项关键技术。首先,我们的视听模型同时接收音频和视觉特征作为输入,利用一系列二元分类输出层来同时识别所有说话人的活动状态。该端到端框架经过精心设计,能够有效处理语音重叠的情况,并通过利用多模态信息精确区分语音段与非语音段。其次,我们采用了一种质量感知的视听融合结构,以应对音频信号质量下降(如噪声、混响及其他失真)和视频信号质量下降(如遮挡、画面外说话人或不可靠检测)的问题。最后,应用于多说话人嵌入的交叉注意力机制使网络能够处理说话人数量变化的场景。我们在多个数据集上获得的实验结果表明,所提出的技术在各种声学环境中均表现出鲁棒性。即使在视频质量严重下降的场景下,我们的系统仍能达到与当前最佳视听系统相当的性能水平。