COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition

The goal of creating intelligent, human-centered wearable systems for continuous activity understanding faces a fundamental trade-off: Egocentric video-based models capture rich semantic information and have demonstrated strong performance in human activity recognition (HAR), but their high power consumption, privacy concerns, and dependence on lighting limit their feasibility for continuous on-device recognition. In contrast, inertial measurement unit (IMU) sensors offer an energy-efficient, privacy-preserving alternative, yet lack large-scale annotated datasets, leading to weaker generalization. To bridge this gap, we propose COMODO, a cross-modal self-supervised distillation framework that transfers semantic knowledge from video to IMU without requiring labels. COMODO leverages a pretrained and frozen video encoder to construct a dynamic instance queue to align the feature distributions of video and IMU embeddings. This enables the IMU encoder to inherit rich semantic structure from video while maintaining its efficiency for real-world applications. Experiments on multiple egocentric HAR datasets show that COMODO consistently improves downstream performance, matching or surpassing fully supervised models, and demonstrating strong cross-dataset generalization. Benefiting from its simplicity and flexibility, COMODO is compatible with diverse pretrained video and time-series models, offering the potential to leverage more powerful teacher and student foundation models in future ubiquitous computing research. The code is available at this repository: https://github.com/cruiseresearchgroup/COMODO.

翻译：构建用于持续活动理解的智能、以人为中心的可穿戴系统面临一个根本性权衡：基于自我中心视频的模型能够捕捉丰富的语义信息，并在人体活动识别（HAR）中展现出强大性能，但其高功耗、隐私问题以及对光照的依赖限制了其在持续设备端识别中的可行性。相比之下，惯性测量单元（IMU）传感器提供了一种能效高、隐私保护性好的备选方案，却缺乏大规模标注数据集，导致泛化能力较弱。为弥合这一差距，我们提出COMODO——一种无需标签即可将视频中的语义知识迁移至IMU的跨模态自监督蒸馏框架。COMODO利用预训练且参数冻结的视频编码器构建动态实例队列，以对齐视频和IMU嵌入的特征分布，从而使IMU编码器在保持实际应用效率的同时，继承视频的丰富语义结构。在多个自我中心HAR数据集上的实验表明，COMODO持续提升下游性能，达到或超越全监督模型，并展现出强大的跨数据集泛化能力。凭借其简洁性与灵活性，COMODO兼容多种预训练视频与时序模型，为未来普适计算研究中利用更强大的教师与学生基础模型提供了潜力。代码已开源至：https://github.com/cruiseresearchgroup/COMODO。