Current Active Speaker Detection (ASD) models achieve great results on AVA-ActiveSpeaker (AVA), using only sound and facial features. Although this approach is applicable in movie setups (AVA), it is not suited for less constrained conditions. To demonstrate this limitation, we propose a Wilder Active Speaker Detection (WASD) dataset, with increased difficulty by targeting the two key components of current ASD: audio and face. Grouped into 5 categories, ranging from optimal conditions to surveillance settings, WASD contains incremental challenges for ASD with tactical impairment of audio and face data. We select state-of-the-art models and assess their performance in two groups of WASD: Easy (cooperative settings) and Hard (audio and/or face are specifically degraded). The results show that: 1) AVA trained models maintain a state-of-the-art performance in WASD Easy group, while underperforming in the Hard one, showing the 2) similarity between AVA and Easy data; and 3) training in WASD does not improve models performance to AVA levels, particularly for audio impairment and surveillance settings. This shows that AVA does not prepare models for wild ASD and current approaches are subpar to deal with such conditions. The proposed dataset also contains body data annotations to provide a new source for ASD, and is available at https://github.com/Tiago-Roxo/WASD.
翻译:当前的主动说话人检测(ASD)模型仅利用声音和面部特征便在AVA-ActiveSpeaker(AVA)数据集上取得了优异的结果。尽管这种方法适用于电影场景(AVA),但无法应对约束条件较弱的现实场景。为证明这一局限性,我们提出了一个更具挑战性的主动说话人检测(WASD)数据集,通过针对当前ASD的两个关键组成部分——音频和人脸——来增加难度。该数据集按从最优条件到监控场景分为5个类别,包含通过战术性削弱音频和人脸数据而逐步增加的ASD挑战。我们选取了最先进的模型,并在WASD的两个分组中评估其性能:简单组(合作场景)和困难组(音频和/或人脸被特意降质)。结果表明:1)在AVA上训练的模型在WASD简单组中保持最先进性能,但在困难组中表现欠佳,显示出2)AVA与简单数据之间的相似性;以及3)在WASD上训练未能提升模型性能至AVA水平,尤其在音频降质和监控场景中。这表明AVA无法为模型应对现实ASD场景做准备,且当前方法在处理此类条件时表现欠佳。本数据集还包含身体数据标注,为ASD提供新的数据来源,并发布于https://github.com/Tiago-Roxo/WASD。