The innate correlation between a person's face and voice has recently emerged as a compelling area of study, especially within the context of multilingual environments. This paper introduces our novel solution to the Face-Voice Association in Multilingual Environments (FAME) 2024 challenge, focusing on a contrastive learning-based chaining-cluster method to enhance face-voice association. This task involves the challenges of building biometric relations between auditory and visual modality cues and modelling the prosody interdependence between different languages while addressing both intrinsic and extrinsic variability present in the data. To handle these non-trivial challenges, our method employs supervised cross-contrastive (SCC) learning to establish robust associations between voices and faces in multi-language scenarios. Following this, we have specifically designed a chaining-cluster-based post-processing step to mitigate the impact of outliers often found in unconstrained in the wild data. We conducted extensive experiments to investigate the impact of language on face-voice association. The overall results were evaluated on the FAME public evaluation platform, where we achieved 2nd place. The results demonstrate the superior performance of our method, and we validate the robustness and effectiveness of our proposed approach. Code is available at https://github.com/colaudiolab/FAME24_solution.
翻译:人脸与声音之间的内在关联性已成为一个引人注目的研究领域,尤其在多语言环境中。本文介绍了我们为2024年多语言环境声脸关联挑战赛提出的创新解决方案,重点研究一种基于对比学习的链式聚类方法以增强声脸关联。该任务面临以下挑战:在听觉与视觉模态线索间建立生物特征关联,建模不同语言间的韵律相互依存关系,同时处理数据中存在的内在与外在变异性。为应对这些复杂挑战,我们的方法采用监督交叉对比学习,在多语言场景中建立鲁棒的声脸关联。在此基础上,我们专门设计了基于链式聚类的后处理步骤,以减轻无约束真实场景数据中常见异常值的影响。我们进行了大量实验以探究语言对声脸关联的影响。最终结果在FAME公共评估平台上进行评测,我们获得了第二名。实验结果表明了我们方法的优越性能,并验证了所提方案的鲁棒性与有效性。代码发布于 https://github.com/colaudiolab/FAME24_solution。