Multimodal egocentric activity recognition integrates visual and inertial cues for robust first-person behavior understanding. However, deploying such systems in open-world environments requires detecting novel activities while continuously learning from non-stationary data streams. Existing methods rely on the main fused logits for novelty scoring, without fully exploiting the complementary evidence available from individual modalities. Because these logits are often dominated by RGB, cues from other modalities, particularly IMU, remain underutilized, and this imbalance worsens as catastrophic forgetting accumulates. To address this, we propose MAND, a modality-aware framework for multimodal egocentric open-world continual learning. At inference, Modality-aware Adaptive Scoring (MoAS) adaptively adjusts modality contributions using sample-wise reliability and refines novelty scoring with deviation and disagreement penalties. During training, Modality-aware Representation Stabilization Training (MoRST) preserves the discriminative capacity of each modality across tasks through modality-specific heads and modality-wise logit distillation. Experiments on a public multimodal egocentric benchmark show that MAND consistently improves novel activity detection and known-class accuracy while substantially reducing FPR95, indicating more reliable open-world recognition. The source code is available at \href{https://github.com/HyeJeongIm/MAND}{github.com/HyeJeongIm/MAND}.
翻译:多模态第一视角活动识别通过融合视觉与惯性线索实现鲁棒的自我中心行为理解。然而,在开放世界环境中部署此类系统需要在持续学习非平稳数据流的同时检测新颖活动。现有方法主要依赖融合后的主逻辑值进行新颖性评分,未能充分利用各模态提供的互补证据。由于这些逻辑值常由RGB模态主导,其他模态(尤其是IMU)的线索未被充分挖掘,且这种不平衡会随灾难性遗忘的积累而加剧。为此,我们提出MAND——一种面向多模态第一视角开放世界持续学习的模态感知框架。在推理阶段,模态感知自适应评分(MoAS)利用样本级可靠性动态调整模态贡献,并引入偏差惩罚和不一致性惩罚来优化新颖性评分。训练阶段,模态感知表示稳定性训练(MoRST)通过模态特定分类头和模态级逻辑蒸馏保留各模态的跨任务判别能力。在公开的多模态第一视角基准数据集上的实验表明,MAND持续提升了新颖活动检测和已知类别分类的准确率,同时显著降低了FPR95指标,表明其实现了更可靠的开放世界识别。源代码见\href{https://github.com/HyeJeongIm/MAND}{github.com/HyeJeongIm/MAND}。