In active speaker detection (ASD), we would like to detect whether an on-screen person is speaking based on audio-visual cues. Previous studies have primarily focused on modeling audio-visual synchronization cue, which depends on the video quality of the lip region of a speaker. In real-world applications, it is possible that we can also have the reference speech of the on-screen speaker. To benefit from both facial cue and reference speech, we propose the Target Speaker TalkNet (TS-TalkNet), which leverages a pre-enrolled speaker embedding to complement the audio-visual synchronization cue in detecting whether the target speaker is speaking. Our framework outperforms the popular model, TalkNet on two datasets, achieving absolute improvements of 1.6\% in mAP on the AVA-ActiveSpeaker validation set, and 0.8\%, 0.4\%, and 0.8\% in terms of AP, AUC and EER on the ASW test set, respectively. Code is available at \href{https://github.com/Jiang-Yidi/TS-TalkNet/}{\color{red}{https://github.com/Jiang-Yidi/TS-TalkNet/}}.
翻译:在活跃说话人检测(ASD)中,我们希望根据音视频线索检测屏幕中的人物是否正在说话。以往研究主要集中于建模音视频同步线索,这一线索依赖于说话人唇部区域的视频质量。在现实应用中,我们可能同时拥有屏幕说话人的参考语音。为同时利用面部线索和参考语音,我们提出了目标说话人TalkNet(TS-TalkNet),该方法利用预注册的说话人嵌入来补充音视频同步线索,从而检测目标说话人是否正在说话。我们的框架在两个数据集上均优于流行模型TalkNet:在AVA-ActiveSpeaker验证集上,mAP绝对提升1.6%;在ASW测试集上,AP、AUC和EER分别绝对提升0.8%、0.4%和0.8%。代码已发布于\href{https://github.com/Jiang-Yidi/TS-TalkNet/}{\color{red}{https://github.com/Jiang-Yidi/TS-TalkNet/}}。