Adaptive Multimodal Person Recognition: A Robust Framework for Handling Missing Modalities

Person identification systems often rely on audio, visual, or behavioral cues, but real-world conditions frequently present with missing or degraded modalities. To address this challenge, we propose a multimodal person identification framework incorporating upper-body motion, face, and voice. Experimental results demonstrate that body motion outperforms traditional modalities such as face and voice in within-session evaluations, while serving as a complementary cue that enhances performance in multi-session scenarios. Our model employs a unified hybrid fusion strategy, fusing both feature-level and score-level information to maximize representational richness and decision accuracy. Specifically, it leverages multi-task learning to process modalities independently, followed by cross-attention and gated fusion mechanisms to exploit both unimodal information and cross-modal interactions. Finally, a confidence-weighted strategy and mistake-correction mechanism dynamically adapt to missing data, ensuring that our single classification head achieves optimal performance even in unimodal and bimodal scenarios. We evaluate our method on CANDOR, a newly introduced interview-based multimodal dataset, which we benchmark in this work for the first time. Our results demonstrate that the proposed trimodal system achieves 99.51% Top-1 accuracy on person identification tasks. In addition, we evaluate our model on the VoxCeleb1 dataset as a widely used evaluation protocol and reach 99.92% accuracy in bimodal mode, outperforming conventional approaches. Moreover, we show that our system maintains high accuracy even when one or two modalities are unavailable, making it a robust solution for real-world person recognition applications. The code and data for this work are publicly available.

翻译：人员识别系统通常依赖于音频、视觉或行为线索，但现实场景中常出现模态缺失或退化的情况。为应对这一挑战，我们提出了一种融合上身运动、面部与语音的多模态人员识别框架。实验结果表明，在会话内评估中，身体运动表现优于面部与语音等传统模态，同时可作为互补线索提升多会话场景下的性能。我们的模型采用统一的混合融合策略，通过特征级与分数级信息融合最大化表征丰富度与决策准确性。具体而言，模型利用多任务学习独立处理各模态，随后通过交叉注意力与门控融合机制同时利用单模态信息与跨模态交互。最后，置信度加权策略与纠错机制动态适应缺失数据，确保单一分类头即使在单模态与双模态场景下仍能实现最优性能。我们在新引入的基于访谈的多模态数据集CANDOR上评估了所提方法，该数据集在本研究中首次建立基准。实验结果表明，所提出的三模态系统在人员识别任务中达到99.51%的Top-1准确率。此外，我们在广泛使用的评估协议VoxCeleb1数据集上测试模型，在双模态模式下达到99.92%的准确率，优于传统方法。更重要的是，我们证明即使在一或两种模态缺失时，系统仍能保持高准确率，这使其成为现实世界人员识别应用的鲁棒解决方案。本工作的代码与数据已公开。