This paper presents Team Xaiofei's innovative approach to exploring Face-Voice Association in Multilingual Environments (FAME) at ACM Multimedia 2024. We focus on the impact of different languages in face-voice matching by building upon Fusion and Orthogonal Projection (FOP), introducing four key components: a dual-branch structure, dynamic sample pair weighting, robust data augmentation, and score polarization strategy. Our dual-branch structure serves as an auxiliary mechanism to better integrate and provide more comprehensive information. We also introduce a dynamic weighting mechanism for various sample pairs to optimize learning. Data augmentation techniques are employed to enhance the model's generalization across diverse conditions. Additionally, score polarization strategy based on age and gender matching confidence clarifies and accentuates the final results. Our methods demonstrate significant effectiveness, achieving an equal error rate (EER) of 20.07 on the V2-EH dataset and 21.76 on the V1-EU dataset.
翻译:本文介绍了Team Xaiofei在ACM Multimedia 2024上探索多语言环境下人脸-语音关联(FAME)的创新方法。我们基于融合与正交投影(FOP)框架,重点研究了不同语言对人脸-语音匹配的影响,并引入了四个关键组件:双分支结构、动态样本对加权、鲁棒数据增强以及分数极化策略。我们的双分支结构作为一种辅助机制,能够更好地整合并提供更全面的信息。我们还引入了针对不同样本对的动态加权机制以优化学习过程。采用数据增强技术来提升模型在多样化条件下的泛化能力。此外,基于年龄和性别匹配置信度的分数极化策略能够澄清并突出最终结果。我们的方法展现出显著的有效性,在V2-EH数据集上实现了20.07的等错误率(EER),在V1-EU数据集上实现了21.76的等错误率。