Human sensing, which employs various sensors and advanced deep learning technologies to accurately capture and interpret human body information, has significantly impacted fields like public security and robotics. However, current human sensing primarily depends on modalities such as cameras and LiDAR, each of which has its own strengths and limitations. Furthermore, existing multi-modal fusion solutions are typically designed for fixed modality combinations, requiring extensive retraining when modalities are added or removed for diverse scenarios. In this paper, we propose a modality-invariant foundation model for all modalities, X-Fi, to address this issue. X-Fi enables the independent or combinatory use of sensor modalities without additional training by utilizing a transformer structure to accommodate variable input sizes and incorporating a novel "X-fusion" mechanism to preserve modality-specific features during multimodal integration. This approach not only enhances adaptability but also facilitates the learning of complementary features across modalities. Extensive experiments conducted on the MM-Fi and XRF55 datasets, employing six distinct modalities, demonstrate that X-Fi achieves state-of-the-art performance in human pose estimation (HPE) and human activity recognition (HAR) tasks. The findings indicate that our proposed model can efficiently support a wide range of human sensing applications, ultimately contributing to the evolution of scalable, multimodal sensing technologies.
翻译:人体感知通过利用各类传感器与先进深度学习技术来精确捕捉和解析人体信息,已在公共安全、机器人等领域产生重要影响。然而,当前的人体感知主要依赖于摄像头和激光雷达等模态,每种模态各有其优势与局限。此外,现有的多模态融合方案通常针对固定的模态组合设计,当为不同场景添加或移除模态时,往往需要大量重新训练。本文提出一种适用于所有模态的模态不变基础模型X-Fi,以解决这一问题。X-Fi利用Transformer结构适应可变的输入尺寸,并结合新颖的“X-融合”机制在多模态集成过程中保留模态特异性特征,从而无需额外训练即可支持传感器模态的独立或组合使用。该方法不仅提升了适应性,还促进了跨模态互补特征的学习。在MM-Fi和XRF55数据集上,使用六种不同模态进行的广泛实验表明,X-Fi在人体姿态估计(HPE)和人体活动识别(HAR)任务中均达到了最先进的性能。研究结果表明,我们提出的模型能够高效支持广泛的人体感知应用,最终推动可扩展多模态感知技术的发展。