Multimodal multiview learning seeks to integrate information from diverse sources to enhance task performance. Existing approaches often struggle with flexible view configurations, including arbitrary view combinations, numbers of views, and heterogeneous modalities. Focusing on the context of human activity recognition, we propose RALIS, a model that combines multiview contrastive learning with a mixture-of-experts module to support arbitrary view availability during both training and inference. Instead of trying to reconstruct missing views, an adjusted center contrastive loss is used for self-supervised representation learning and view alignment, mitigating the impact of missing views on multiview fusion. This loss formulation allows for the integration of view weights to account for view quality. Additionally, it reduces computational complexity from $O(V^2)$ to $O(V)$, where $V$ is the number of views. To address residual discrepancies not captured by contrastive learning, we employ a mixture-of-experts module with a specialized load balancing strategy, tasked with adapting to arbitrary view combinations. We highlight the geometric relationship among components in our model and how they combine well in the latent space. RALIS is validated on four datasets encompassing inertial and human pose modalities, with the number of views ranging from three to nine, demonstrating its performance and flexibility.
翻译:多模态多视图学习旨在整合来自不同源的信息以提升任务性能。现有方法通常难以应对灵活的视图配置,包括任意的视图组合、视图数量以及异构模态。聚焦于人类活动识别场景,我们提出RALIS模型,该模型结合了多视图对比学习与专家混合模块,以支持训练和推理期间任意的视图可用性。不同于尝试重建缺失视图,我们采用调整后的中心对比损失进行自监督表示学习与视图对齐,从而减轻缺失视图对多视图融合的影响。该损失函数允许集成视图权重以考量视图质量。此外,它将计算复杂度从$O(V^2)$降低至$O(V)$,其中$V$为视图数量。针对对比学习未能捕捉的残余差异,我们采用带有专门负载均衡策略的专家混合模块,其任务在于适应任意的视图组合。我们强调了模型中各组件间的几何关系及其在潜在空间中如何良好结合。RALIS在包含惯性与人体姿态模态的四个数据集上进行了验证,视图数量从三到九不等,结果证明了其性能与灵活性。