Active speaker detection is a challenging task in audio-visual scenario understanding, which aims to detect who is speaking in one or more speakers scenarios. This task has received extensive attention as it is crucial in applications such as speaker diarization, speaker tracking, and automatic video editing. The existing studies try to improve performance by inputting multiple candidate information and designing complex models. Although these methods achieved outstanding performance, their high consumption of memory and computational power make them difficult to be applied in resource-limited scenarios. Therefore, we construct a lightweight active speaker detection architecture by reducing input candidates, splitting 2D and 3D convolutions for audio-visual feature extraction, and applying gated recurrent unit (GRU) with low computational complexity for cross-modal modeling. Experimental results on the AVA-ActiveSpeaker dataset show that our framework achieves competitive mAP performance (94.1% vs. 94.2%), while the resource costs are significantly lower than the state-of-the-art method, especially in model parameters (1.0M vs. 22.5M, about 23x) and FLOPs (0.6G vs. 2.6G, about 4x). In addition, our framework also performs well on the Columbia dataset showing good robustness. The code and model weights are available at https://github.com/Junhua-Liao/Light-ASD.
翻译:主动说话人检测是视听场景理解中的一项具有挑战性的任务,旨在检测多说话人场景中谁在发言。该任务在说话人日志、说话人追踪及自动视频编辑等应用中至关重要,因而受到广泛关注。现有研究通过输入多个候选信息并设计复杂模型来提升性能。尽管这些方法取得了卓越效果,但其对内存和计算能力的高消耗使其难以应用于资源受限场景。为此,我们通过减少输入候选数量、拆分2D与3D卷积进行视听特征提取,并采用低计算复杂度的门控循环单元(GRU)进行跨模态建模,构建了一种轻量级主动说话人检测架构。在AVA-ActiveSpeaker数据集上的实验结果表明,我们的框架实现了具有竞争力的mAP性能(94.1% vs. 94.2%),同时资源成本显著低于最先进方法,尤其在模型参数量(1.0M vs. 22.5M,约23倍)和FLOPs(0.6G vs. 2.6G,约4倍)方面优势明显。此外,该框架在Columbia数据集上同样表现良好,展现出优秀的鲁棒性。代码和模型权重已开源至 https://github.com/Junhua-Liao/Light-ASD。