State-of-the-art Active Speaker Detection (ASD) approaches heavily rely on audio and facial features to perform, which is not a sustainable approach in wild scenarios. Although these methods achieve good results in the standard AVA-ActiveSpeaker set, a recent wilder ASD dataset (WASD) showed the limitations of such models and raised the need for new approaches. As such, we propose BIAS, a model that, for the first time, combines audio, face, and body information, to accurately predict active speakers in varying/challenging conditions. Additionally, we design BIAS to provide interpretability by proposing a novel use for Squeeze-and-Excitation blocks, namely in attention heatmaps creation and feature importance assessment. For a full interpretability setup, we annotate an ASD-related actions dataset (ASD-Text) to finetune a ViT-GPT2 for text scene description to complement BIAS interpretability. The results show that BIAS is state-of-the-art in challenging conditions where body-based features are of utmost importance (Columbia, open-settings, and WASD), and yields competitive results in AVA-ActiveSpeaker, where face is more influential than body for ASD. BIAS interpretability also shows the features/aspects more relevant towards ASD prediction in varying settings, making it a strong baseline for further developments in interpretable ASD models, and is available at https://github.com/Tiago-Roxo/BIAS.
翻译:当前最先进的主动说话人检测方法严重依赖音频和面部特征,这在真实场景中并非可持续的解决方案。尽管这些方法在标准AVA-ActiveSpeaker数据集上取得了良好效果,但近期更贴近真实场景的WASD数据集揭示了此类模型的局限性,并凸显了对新方法的需求。为此,我们提出BIAS模型,首次融合音频、面部及身体信息,以在多变/挑战性条件下准确预测主动说话人。此外,我们通过创新性地应用Squeeze-and-Excitation模块构建注意力热图并评估特征重要性,使BIAS具备可解释性。为实现完整的可解释性框架,我们标注了ASD相关动作数据集(ASD-Text),并微调ViT-GPT2模型以生成场景文本描述,从而增强BIAS的可解释性。实验结果表明:在身体特征至关重要的挑战性场景中(Columbia数据集、开放场景及WASD数据集),BIAS达到最先进水平;而在面部特征主导的AVA-ActiveSpeaker数据集中,BIAS仍保持竞争力。BIAS的可解释性机制进一步揭示了不同场景下对ASD预测更相关的特征/维度,为可解释性ASD模型的后续发展奠定了坚实基础。项目代码已开源:https://github.com/Tiago-Roxo/BIAS。