Conventional audio-visual approaches for active speaker detection (ASD) typically rely on visually pre-extracted face tracks and the corresponding single-channel audio to find the speaker in a video. Therefore, they tend to fail every time the face of the speaker is not visible. We demonstrate that a simple audio convolutional recurrent neural network (CRNN) trained with spatial input features extracted from multichannel audio can perform simultaneous horizontal active speaker detection and localization (ASDL), independently of the visual modality. To address the time and cost of generating ground truth labels to train such a system, we propose a new self-supervised training pipeline that embraces a ``student-teacher'' learning approach. A conventional pre-trained active speaker detector is adopted as a ``teacher'' network to provide the position of the speakers as pseudo-labels. The multichannel audio ``student'' network is trained to generate the same results. At inference, the student network can generalize and locate also the occluded speakers that the teacher network is not able to detect visually, yielding considerable improvements in recall rate. Experiments on the TragicTalkers dataset show that an audio network trained with the proposed self-supervised learning approach can exceed the performance of the typical audio-visual methods and produce results competitive with the costly conventional supervised training. We demonstrate that improvements can be achieved when minimal manual supervision is introduced in the learning pipeline. Further gains may be sought with larger training sets and integrating vision with the multichannel audio system.
翻译:常规的主动说话人检测(ASD)视听方法通常依赖视觉预提取的面部轨迹及对应的单通道音频来定位视频中的说话人。因此,当说话人面部不可见时,这些方法往往失效。我们证明,一个简单的音频卷积循环神经网络(CRNN),使用从多通道音频中提取的空间输入特征进行训练,能够独立于视觉模态同时执行水平方向的主动说话人检测与定位。为解决训练此类系统所需地面真值标签生成的时间和成本问题,我们提出一种新的自监督训练流程,采用"学生-教师"学习范式。采用预训练的常规主动说话人检测器作为"教师"网络,提供说话人位置作为伪标签。多通道音频"学生"网络被训练以生成相同结果。推理时,学生网络能够泛化并定位教师网络无法从视觉上检测到的遮挡说话人,从而显著提升召回率。在TragicTalkers数据集上的实验表明,采用所提自监督学习方法训练的音频网络能超越典型视听方法的表现,并与成本高昂的常规监督训练结果相媲美。我们证明,在训练流程中引入少量人工监督即可实现性能提升。通过扩大训练集并将视觉信息与多通道音频系统集成,有望获得进一步改进。