Audiovisual active speaker detection (ASD) addresses the task of determining the speech activity of a candidate speaker given acoustic and visual data. Typically, systems model the temporal correspondence of audiovisual cues, such as the synchronisation between speech and lip movement. Recent work has explored extending this paradigm by additionally leveraging speaker embeddings extracted from candidate speaker reference speech. This paper proposes the speaker comparison auxiliary network (SCAN) which uses speaker-specific information from both reference speech and the candidate audio signal to disambiguate challenging scenes when the visual signal is unresolvable. Furthermore, an improved method for enrolling face-speaker libraries is developed, which implements a self-supervised approach to video-based face recognition. Fitting with the recent proliferation of wearable devices, this work focuses on improving speaker-embedding-informed ASD in the context of egocentric recordings, which can be characterised by acoustic noise and highly dynamic scenes. SCAN is implemented with two well-established baselines, namely TalkNet and Light-ASD; yielding a relative improvement in mAP of 14.5% and 10.3% on the Ego4D benchmark, respectively.
翻译:视听主动说话人检测(ASD)旨在根据给定的声学和视觉数据,确定候选说话人的语音活动状态。通常,系统会对视听线索的时间对应关系进行建模,例如语音与唇部运动的同步性。近期研究探索通过额外利用从候选说话人参考语音中提取的说话人嵌入来扩展这一范式。本文提出说话人比较辅助网络(SCAN),该网络利用来自参考语音和候选音频信号的说话人特定信息,在视觉信号无法解析时对具有挑战性的场景进行消歧。此外,本文开发了一种改进的人脸-说话人库注册方法,该方法采用自监督方式实现基于视频的人脸识别。随着可穿戴设备的日益普及,本研究重点关注在自我中心记录背景下改进基于说话人嵌入的ASD,此类记录通常具有声学噪声高和场景动态变化剧烈的特点。SCAN在两种成熟基线模型(即TalkNet和Light-ASD)上实现,在Ego4D基准测试中分别获得了14.5%和10.3%的mAP相对提升。