Recent achievements in language models have showcased their extraordinary capabilities in bridging visual information with semantic language understanding. This leads us to a novel question: can language models connect textual semantics with IoT sensory signals to perform recognition tasks, e.g., Human Activity Recognition (HAR)? If so, an intelligent HAR system with human-like cognition can be built, capable of adapting to new environments and unseen categories. This paper explores its feasibility with an innovative approach, IoT-sEnsors-language alignmEnt pre-Training (TENT), which jointly aligns textual embeddings with IoT sensor signals, including camera video, LiDAR, and mmWave. Through the IoT-language contrastive learning, we derive a unified semantic feature space that aligns multi-modal features with language embeddings, so that the IoT data corresponds to specific words that describe the IoT data. To enhance the connection between textual categories and their IoT data, we propose supplementary descriptions and learnable prompts that bring more semantic information into the joint feature space. TENT can not only recognize actions that have been seen but also ``guess'' the unseen action by the closest textual words from the feature space. We demonstrate TENT achieves state-of-the-art performance on zero-shot HAR tasks using different modalities, improving the best vision-language models by over 12%.
翻译:语言模型的最新成就展示了其在连接视觉信息与语义语言理解方面的非凡能力。这引出一个新颖问题:语言模型能否将文本语义与IoT传感信号相连接,以执行诸如人体活动识别(HAR)等识别任务?若能,则可构建具有类人认知能力的智能HAR系统,使其适应新环境与未见类别。本文通过创新方法——IoT传感器-语言对齐预训练(TENT),探索了该可行性。该方法联合对齐文本嵌入与IoT传感器信号,包括摄像头视频、激光雷达(LiDAR)及毫米波(mmWave)。通过IoT-语言对比学习,我们推导出统一的语义特征空间,该空间将多模态特征与语言嵌入对齐,使IoT数据对应于描述该数据的具体词语。为增强文本类别与其IoT数据之间的关联,我们提出了补充描述与可学习提示,将更多语义信息引入联合特征空间。TENT不仅能识别已见动作,还能通过特征空间中最近的文本词语“猜测”未见动作。我们证明,TENT在使用不同模态的零样本HAR任务上实现了最先进的性能,将最佳视觉语言模型提升了12%以上。