Integrating and processing information from various sources or modalities are critical for obtaining a comprehensive and accurate perception of the real world. Drawing inspiration from neuroscience, we develop the Information-Theoretic Hierarchical Perception (ITHP) model, which utilizes the concept of information bottleneck. Distinct from most traditional fusion models that aim to incorporate all modalities as input, our model designates the prime modality as input, while the remaining modalities act as detectors in the information pathway. Our proposed perception model focuses on constructing an effective and compact information flow by achieving a balance between the minimization of mutual information between the latent state and the input modal state, and the maximization of mutual information between the latent states and the remaining modal states. This approach leads to compact latent state representations that retain relevant information while minimizing redundancy, thereby substantially enhancing the performance of downstream tasks. Experimental evaluations on both the MUStARD and CMU-MOSI datasets demonstrate that our model consistently distills crucial information in multimodal learning scenarios, outperforming state-of-the-art benchmarks.
翻译:整合和处理来自不同来源或模态的信息对于获得对现实世界的全面准确感知至关重要。受神经科学启发,我们提出了信息论分层感知(ITHP)模型,该模型利用信息瓶颈概念。与大多数旨在融合所有模态作为输入的传统融合模型不同,我们的模型将主要模态指定为输入,而其余模态在信息通路中充当检测器。所提出的感知模型专注于构建有效且紧凑的信息流,通过平衡潜在状态与输入模态状态之间的互信息最小化,以及潜在状态与其余模态状态之间的互信息最大化。这种方法产生了紧凑的潜在状态表示,既能保留相关信息,又最小化冗余,从而显著提升下游任务的性能。在MUStARD和CMU-MOSI数据集上的实验评估表明,我们的模型在多模态学习场景中持续提取关键信息,优于当前最先进的基准模型。