We address a novel cross-domain few-shot learning task (CD-FSL) with multimodal input and unlabeled target data for egocentric action recognition. This paper simultaneously tackles two critical challenges associated with egocentric action recognition in CD-FSL settings: (1) the extreme domain gap in egocentric videos (\eg, daily life vs. industrial domain) and (2) the computational cost for real-world applications. We propose MM-CDFSL, a domain-adaptive and computationally efficient approach designed to enhance adaptability to the target domain and improve inference speed. To address the first challenge, we propose the incorporation of multimodal distillation into the student RGB model using teacher models. Each teacher model is trained independently on source and target data for its respective modality. Leveraging only unlabeled target data during multimodal distillation enhances the student model's adaptability to the target domain. We further introduce ensemble masked inference, a technique that reduces the number of input tokens through masking. In this approach, ensemble prediction mitigates the performance degradation caused by masking, effectively addressing the second issue. Our approach outperformed the state-of-the-art CD-FSL approaches with a substantial margin on multiple egocentric datasets, improving by an average of 6.12/6.10 points for 1-shot/5-shot settings while achieving $2.2$ times faster inference speed. Project page: https://masashi-hatano.github.io/MM-CDFSL/
翻译:本文针对第一人称动作识别,提出了一种新颖的跨域小样本学习任务,该任务包含多模态输入和未标记的目标数据。本文同时解决了在CD-FSL设置下第一人称动作识别相关的两个关键挑战:(1) 第一人称视频中存在的极端领域差异(例如,日常生活领域与工业领域),以及 (2) 实际应用中的计算成本。我们提出了MM-CDFSL,一种领域自适应且计算高效的方法,旨在增强对目标域的适应能力并提高推理速度。为应对第一个挑战,我们提出将多模态蒸馏融入学生RGB模型,该过程利用了教师模型。每个教师模型在其各自的模态上,分别基于源数据和目标数据进行独立训练。在多模态蒸馏过程中仅利用未标记的目标数据,增强了学生模型对目标域的适应能力。我们进一步引入了集成掩码推理技术,该技术通过掩码减少输入令牌的数量。在此方法中,集成预测缓解了因掩码导致的性能下降,从而有效解决了第二个问题。我们的方法在多个第一人称数据集上以显著优势超越了最先进的CD-FSL方法,在1-shot/5-shot设置下平均提升了6.12/6.10个点,同时实现了$2.2$倍的推理加速。项目页面:https://masashi-hatano.github.io/MM-CDFSL/