Multimodal video understanding is crucial for analyzing egocentric videos, where integrating multiple sensory signals significantly enhances action recognition and moment localization. However, practical applications often grapple with incomplete modalities due to factors like privacy concerns, efficiency demands, or hardware malfunctions. Addressing this, our study delves into the impact of missing modalities on egocentric action recognition, particularly within transformer-based models. We introduce a novel concept -Missing Modality Token (MMT)-to maintain performance even when modalities are absent, a strategy that proves effective in the Ego4D, Epic-Kitchens, and Epic-Sounds datasets. Our method mitigates the performance loss, reducing it from its original $\sim 30\%$ drop to only $\sim 10\%$ when half of the test set is modal-incomplete. Through extensive experimentation, we demonstrate the adaptability of MMT to different training scenarios and its superiority in handling missing modalities compared to current methods. Our research contributes a comprehensive analysis and an innovative approach, opening avenues for more resilient multimodal systems in real-world settings.
翻译:多模态视频理解对于分析自我中心视频至关重要,其中集成多种感官信号显著提升了动作识别和时刻定位能力。然而,实际应用常因隐私顾虑、效率需求或硬件故障等因素面临模态不完整的问题。针对此挑战,本研究深入探讨了模态缺失对自我中心动作识别的影响,特别是在基于Transformer的模型中。我们引入了一个新概念——缺失模态标记(MMT),以在模态缺失时维持模型性能,该策略在Ego4D、Epic-Kitchens和Epic-Sounds数据集上被证明有效。我们的方法将性能损失从原本约30%的下降减轻至仅约10%(当测试集半数模态不完整时)。通过大量实验,我们展示了MMT对不同训练场景的适应性,以及在处理模态缺失问题上的优越性(相较于现有方法)。本研究提供了全面的分析及创新方法,为实际环境中更具鲁棒性的多模态系统开辟了新途径。