In recent years, an association is established between faces and voices of celebrities leveraging large scale audio-visual information from YouTube. The availability of large scale audio-visual datasets is instrumental in developing speaker recognition methods based on standard Convolutional Neural Networks. Thus, the aim of this paper is to leverage large scale audio-visual information to improve speaker recognition task. To achieve this task, we proposed a two-branch network to learn joint representations of faces and voices in a multimodal system. Afterwards, features are extracted from the two-branch network to train a classifier for speaker recognition. We evaluated our proposed framework on a large scale audio-visual dataset named VoxCeleb$1$. Our results show that addition of facial information improved the performance of speaker recognition. Moreover, our results indicate that there is an overlap between face and voice.
翻译:近年来,利用YouTube大规模视听信息,研究者建立了名人面孔与声音之间的关联。大规模视听数据集的可用性,为基于标准卷积神经网络的说话人识别方法的发展提供了关键支撑。因此,本文旨在利用大规模视听信息改进说话人识别任务。为实现该目标,我们提出了一种双分支网络,用于在多模态系统中学习面孔与声音的联合表征。随后,从该双分支网络中提取特征,以训练分类器进行说话人识别。我们在名为VoxCeleb$1$的大规模视听数据集上评估了所提出的框架。结果表明,加入面部信息提升了说话人识别的性能。此外,我们的结果还表明面孔与声音之间存在一定重叠。