Person identification systems often rely on audio, visual, or behavioral cues, but real-world conditions frequently result in missing or degraded modalities. To address this challenge, we propose a multimodal person identification framework that utilizes gesture as a situational enhancer to supplement traditional modalities like voice and face. Our model employs a unified hybrid fusion strategy, integrating both feature-level and score-level information to maximize representational richness and decision accuracy. Specifically, it leverages multi-task learning to process modalities independently, followed by cross-attention and gated fusion mechanisms. Finally, a confidence-weighted strategy dynamically adapts to missing data, ensuring that our single classification head achieves optimal performance even in unimodal and bimodal scenarios. We evaluate our method on CANDOR, a newly introduced interview-based multimodal dataset, which we benchmark in this work for the first time. Our results demonstrate that the proposed trimodal system achieves 99.51% Top-1 accuracy on person identification tasks. In addition, we evaluate our model on the VoxCeleb1 dataset as a benchmark and reach 99.92% accuracy in bimodal mode, outperforming conventional approaches. Moreover, we show that our system maintains high accuracy even when one or two modalities are unavailable, making it a robust solution for real-world person recognition applications. The code and data for this work are publicly available.
翻译:人员识别系统通常依赖于音频、视觉或行为线索,但现实环境常导致模态缺失或质量下降。为应对这一挑战,我们提出一种多模态人员识别框架,利用手势作为情境增强器来补充语音和人脸等传统模态。该模型采用统一的混合融合策略,整合特征级与分数级信息,以最大化表征丰富度和决策准确性。具体而言,模型通过多任务学习独立处理各模态,随后结合交叉注意力与门控融合机制。最终,置信度加权策略动态适应缺失数据,确保单一分类头即使在单模态和双模态场景下也能实现最优性能。我们在新引入的基于访谈的多模态数据集CANDOR上评估所提方法,该数据集于本研究中首次建立基准。实验结果表明,所提出的三模态系统在人员识别任务中达到99.51%的Top-1准确率。此外,我们在VoxCeleb1数据集上进行基准测试,双模态模式下达到99.92%的准确率,优于传统方法。更重要的是,即使在一或两种模态不可用的情况下,系统仍能保持高精度,这使其成为现实世界人员识别应用的鲁棒解决方案。本研究的代码与数据已公开提供。