Multimodal learning involves integrating information from various modalities to enhance learning and comprehension. We compare three modality fusion strategies in person identification and verification by processing two modalities: voice and face. In this paper, a one-dimensional convolutional neural network is employed for x-vector extraction from voice, while the pre-trained VGGFace2 network and transfer learning are utilized for face modality. In addition, gammatonegram is used as speech representation in engagement with the Darknet19 pre-trained network. The proposed systems are evaluated using the K-fold cross-validation technique on the 118 speakers of the test set of the VoxCeleb2 dataset. The comparative evaluations are done for single-modality and three proposed multimodal strategies in equal situations. Results demonstrate that the feature fusion strategy of gammatonegram and facial features achieves the highest performance, with an accuracy of 98.37% in the person identification task. However, concatenating facial features with the x-vector reaches 0.62% for EER in verification tasks.
翻译:多模态学习涉及整合来自不同模态的信息以增强学习与理解能力。本文通过处理语音和面部两种模态,比较了三种模态融合策略在人员识别与验证任务中的表现。研究中采用一维卷积神经网络从语音中提取x-vector,同时利用预训练的VGGFace2网络及迁移学习方法处理面部模态。此外,将伽马通谱作为语音表征,并结合预训练的Darknet19网络进行特征提取。所提出的系统在VoxCeleb2数据集测试集的118位说话人上采用K折交叉验证技术进行评估。在同等条件下,对单模态方法与三种多模态策略进行了对比评估。结果表明,伽马通谱与面部特征的特征融合策略在人员识别任务中取得了最佳性能,准确率达到98.37%;而在验证任务中,将面部特征与x-vector进行拼接的方法实现了0.62%的等错误率。