This report describes our approach for the Audio-Visual Diarization (AVD) task of the Ego4D Challenge 2022. Specifically, we present multiple technical improvements over the official baselines. First, we improve the detection performance of the camera wearer's voice activity by modifying the training scheme of its model. Second, we discover that an off-the-shelf voice activity detection model can effectively remove false positives when it is applied solely to the camera wearer's voice activities. Lastly, we show that better active speaker detection leads to a better AVD outcome. Our final method obtains 65.9% DER on the test set of Ego4D, which significantly outperforms all the baselines. Our submission achieved 1st place in the Ego4D Challenge 2022.
翻译:本报告描述了我们在Ego4D挑战赛2022的视听说话人分割(AVD)任务中所采用的方法。具体而言,我们针对官方基线提出了多项技术改进。首先,通过修改摄像头佩戴者语音活动检测模型的训练方案,提升了其检测性能。其次,我们发现,当仅对摄像头佩戴者的语音活动应用现成的语音活动检测模型时,可有效消除误检。最后,我们证明更优的主动说话人检测能带来更好的AVD结果。我们的最终方法在Ego4D测试集上取得了65.9%的误检率(DER),显著优于所有基线方案。该提交方案最终获得Ego4D挑战赛2022第一名。