Advancing Multi-Modal Sensing Through Expandable Modality Alignment

Sensing technology is widely used for comprehending the physical world, with numerous modalities explored in past decades. While there has been considerable work on multi-modality learning, they all require data of all modalities be paired. How to leverage multi-modality data with partially pairings remains an open problem. To tackle this challenge, we introduce the Babel framework, encompassing the neural network architecture, data preparation and processing, as well as the training strategies. Babel serves as a scalable pre-trained multi-modal sensing neural network, currently aligning six sensing modalities, namely Wi-Fi, mmWave, IMU, LiDAR, video, and depth. To overcome the scarcity of complete paired data, the key idea of Babel involves transforming the N-modality alignment into a series of two-modality alignments by devising the expandable network architecture. This concept is also realized via a series of novel techniques, including the pre-trained modality tower that capitalizes on available single-modal networks, and the adaptive training strategy balancing the contribution of the newly incorporated modality with the previously established modality alignment. Evaluation demonstrates Babel's outstanding performance on eight human activity recognition datasets, compared to various baselines e.g., the top multi-modal sensing framework, single-modal sensing networks, and multi-modal large language models. Babel not only effectively fuses multiple available modalities (up to 22% accuracy increase), but also enhance the performance of individual modality (12% averaged accuracy improvement). Case studies also highlight exciting application scenarios empowered by Babel, including cross-modality retrieval (i.e., sensing imaging), and bridging LLM for sensing comprehension.

翻译：感知技术被广泛应用于理解物理世界，在过去数十年中已探索了众多模态。尽管在多模态学习方面已有大量工作，但它们均要求所有模态的数据完全配对。如何利用部分配对的多模态数据仍然是一个开放性问题。为应对这一挑战，我们提出了Babel框架，该框架包含神经网络架构、数据准备与处理以及训练策略。Babel作为一个可扩展的预训练多模态感知神经网络，目前对齐了六种感知模态，即Wi-Fi、毫米波、IMU、LiDAR、视频和深度。为克服完整配对数据稀缺的问题，Babel的核心思想是通过设计可扩展的网络架构，将N模态对齐转化为一系列双模态对齐。这一理念通过一系列创新技术得以实现，包括利用现有单模态网络的预训练模态塔，以及平衡新加入模态与已建立模态对齐贡献的自适应训练策略。评估表明，与各类基线（例如顶尖多模态感知框架、单模态感知网络和多模态大语言模型）相比，Babel在八个人体活动识别数据集上表现出卓越性能。Babel不仅能有效融合多种可用模态（最高提升22%准确率），还能增强单个模态的性能（平均准确率提升12%）。案例研究也凸显了Babel赋能的前沿应用场景，包括跨模态检索（即感知成像），以及连接大语言模型以实现感知理解。