Human sensing, which employs various sensors and advanced deep learning technologies to accurately capture and interpret human body information, has significantly impacted fields like public security and robotics. However, current human sensing primarily depends on modalities such as cameras and LiDAR, each of which has its own strengths and limitations. Furthermore, existing multi-modal fusion solutions are typically designed for fixed modality combinations, requiring extensive retraining when modalities are added or removed for diverse scenarios. In this paper, we propose a modality-invariant foundation model for all modalities, X-Fi, to address this issue. X-Fi enables the independent or combinatory use of sensor modalities without additional training by utilizing a transformer structure to accommodate variable input sizes and incorporating a novel "X-fusion" mechanism to preserve modality-specific features during multimodal integration. This approach not only enhances adaptability but also facilitates the learning of complementary features across modalities. Extensive experiments conducted on the MM-Fi and XRF55 datasets, employing six distinct modalities, demonstrate that X-Fi achieves state-of-the-art performance in human pose estimation (HPE) and human activity recognition (HAR) tasks. The findings indicate that our proposed model can efficiently support a wide range of human sensing applications, ultimately contributing to the evolution of scalable, multimodal sensing technologies.
翻译:人体感知通过运用多种传感器与先进深度学习技术,精准捕捉并解析人体信息,已在公共安全与机器人等领域产生重要影响。然而,当前的人体感知主要依赖于摄像头与激光雷达等模态,每种模态皆有其优势与局限。此外,现有的多模态融合方案通常针对固定的模态组合设计,当为不同场景增减模态时需进行大量重新训练。本文提出一种适用于所有模态的模态不变基础模型X-Fi以解决此问题。X-Fi通过采用Transformer结构以适应可变的输入尺寸,并结合新颖的“X-融合”机制在多模态集成过程中保留模态特异性特征,使得传感器模态能够独立或组合使用而无需额外训练。该方法不仅增强了适应性,亦促进了跨模态互补特征的学习。在MM-Fi与XRF55数据集上使用六种不同模态开展的广泛实验表明,X-Fi在人体姿态估计(HPE)与人体活动识别(HAR)任务中均实现了最先进的性能。研究结果表明,所提模型能高效支持广泛的人体感知应用,最终推动可扩展多模态感知技术的发展。