In order to unlock the potential of diverse sensors, we investigate a method to transfer knowledge between modalities using the structure of a unified multimodal representation space for Human Action Recognition (HAR). We formalize and explore an understudied cross-modal transfer setting we term Unsupervised Modality Adaptation (UMA), where the modality used in testing is not used in supervised training, i.e. zero labeled instances of the test modality are available during training. We develop three methods to perform UMA: Student-Teacher (ST), Contrastive Alignment (CA), and Cross-modal Transfer Through Time (C3T). Our extensive experiments on various camera+IMU datasets compare these methods to each other in the UMA setting, and to their empirical upper bound in the supervised setting. The results indicate C3T is the most robust and highest performing by at least a margin of 8%, and nears the supervised setting performance even in the presence of temporal noise. This method introduces a novel mechanism for aligning signals across time-varying latent vectors, extracted from the receptive field of temporal convolutions. Our findings suggest that C3T has significant potential for developing generalizable models for time-series sensor data, opening new avenues for multi-modal learning in various applications.
翻译:为充分发挥多种传感器的潜力,我们研究了一种在统一多模态表示空间结构下进行模态间知识迁移的方法,用于人体动作识别任务。我们形式化并探索了一种尚未被充分研究的跨模态迁移设定,称为无监督模态适应,其中测试阶段使用的模态在监督训练中未被使用,即训练过程中完全不存在测试模态的标注实例。我们开发了三种实现UMA的方法:师生学习、对比对齐以及基于时序的跨模态迁移。我们在多种相机+惯性测量单元数据集上进行了大量实验,在UMA设定下比较了这些方法的性能,并与监督设定下的经验上界进行了对比。结果表明C3T是最鲁棒且性能最优的方法,其优势至少达到8%,即使在存在时序噪声的情况下也能接近监督设定的性能。该方法引入了一种新颖的机制,用于对齐从时序卷积感受野中提取的时变潜在向量信号。我们的研究结果表明,C3T在开发可泛化的时序传感器数据模型方面具有显著潜力,为多模态学习在各种应用中的发展开辟了新途径。