Face verification systems have seen substantial advancements; however, they often lack transparency in their decision-making processes. In this paper, we introduce an innovative Vision-Language Model (VLM) for Face Verification, which not only accurately determines if two face images depict the same individual but also explicitly explains the rationale behind its decisions. Our model is uniquely trained using two complementary explanation styles: (1) concise explanations that summarize the key factors influencing its decision, and (2) comprehensive explanations detailing the specific differences observed between the images. We adapt and enhance a state-of-the-art modeling approach originally designed for audio-based differentiation to suit visual inputs effectively. This cross-modal transfer significantly improves our model's accuracy and interpretability. The proposed VLM integrates sophisticated feature extraction techniques with advanced reasoning capabilities, enabling clear articulation of its verification process. Our approach demonstrates superior performance, surpassing baseline methods and existing models. These findings highlight the immense potential of vision language models in face verification set up, contributing to more transparent, reliable, and explainable face verification systems.
翻译:人脸验证系统已取得显著进展,然而其决策过程往往缺乏透明度。本文提出一种创新的视觉语言模型(VLM),用于人脸验证任务。该模型不仅能准确判断两幅人脸图像是否属于同一人,还能明确解释其决策依据。我们采用两种互补的解释风格对模型进行独特训练:(1)总结影响决策关键因素的简明解释;(2)详细描述图像间具体差异的全面解释。我们改进并适配了一种最初为音频差异识别设计的最先进建模方法,使其能有效处理视觉输入。这种跨模态迁移显著提升了模型的准确性和可解释性。所提出的VLM融合了先进的特征提取技术与高级推理能力,能够清晰阐述其验证过程。实验表明,该方法性能优异,超越了基线方法与现有模型。这些发现凸显了视觉语言模型在人脸验证场景中的巨大潜力,有助于构建更透明、可靠且可解释的人脸验证系统。