In active speaker detection (ASD), we would like to detect whether an on-screen person is speaking based on audio-visual cues. Previous studies have primarily focused on modeling audio-visual synchronization cue, which depends on the video quality of the lip region of a speaker. In real-world applications, it is possible that we can also have the reference speech of the on-screen speaker. To benefit from both facial cue and reference speech, we propose the Target Speaker TalkNet (TS-TalkNet), which leverages a pre-enrolled speaker embedding to complement the audio-visual synchronization cue in detecting whether the target speaker is speaking. Our framework outperforms the popular model, TalkNet on two datasets, achieving absolute improvements of 1.6\% in mAP on the AVA-ActiveSpeaker validation set, and 0.8\%, 0.4\%, and 0.8\% in terms of AP, AUC and EER on the ASW test set, respectively. Code is available at \href{https://github.com/Jiang-Yidi/TS-TalkNet/}{\color{red}{https://github.com/Jiang-Yidi/TS-TalkNet/}}.
翻译:在活跃说话者检测(ASD)中,我们希望根据视听线索检测屏幕上的人物是否正在说话。先前的研究主要集中于对视听同步线索进行建模,这依赖于说话者唇部区域的视频质量。在实际应用中,我们可能还可以获得屏幕上说话者的参考语音。为了同时利用面部线索和参考语音,我们提出了目标说话者TalkNet(TS-TalkNet),该方法利用预注册的说话者嵌入来补充视听同步线索,以检测目标说话者是否正在说话。我们的框架在两个数据集上优于流行模型TalkNet,在AVA-ActiveSpeaker验证集上平均精度(mAP)绝对提升1.6%,在ASW测试集上精度(AP)、曲线下面积(AUC)和等错误率(EER)分别绝对提升0.8%、0.4%和0.8%。代码可在\href{https://github.com/Jiang-Yidi/TS-TalkNet/}{\color{red}{https://github.com/Jiang-Yidi/TS-TalkNet/}}获取。