Integrating and processing information from various sources or modalities are critical for obtaining a comprehensive and accurate perception of the real world. Drawing inspiration from neuroscience, we develop the Information-Theoretic Hierarchical Perception (ITHP) model, which utilizes the concept of information bottleneck. Distinct from most traditional fusion models that aim to incorporate all modalities as input, our model designates the prime modality as input, while the remaining modalities act as detectors in the information pathway. Our proposed perception model focuses on constructing an effective and compact information flow by achieving a balance between the minimization of mutual information between the latent state and the input modal state, and the maximization of mutual information between the latent states and the remaining modal states. This approach leads to compact latent state representations that retain relevant information while minimizing redundancy, thereby substantially enhancing the performance of downstream tasks. Experimental evaluations on both the MUStARD and CMU-MOSI datasets demonstrate that our model consistently distills crucial information in multimodal learning scenarios, outperforming state-of-the-art benchmarks.
翻译:整合并处理来自不同源或模态的信息对于获得对真实世界的全面而准确的感知至关重要。受神经科学的启发,我们开发了信息论分层感知(ITHP)模型,该模型利用信息瓶颈的概念。与大多数旨在将所有模态作为输入的传统融合模型不同,我们的模型将主模态作为输入,而其余模态在信息通路中充当检测器。我们提出的感知模型侧重于通过实现潜在状态与输入模态状态之间互信息的最小化与潜在状态与其余模态状态之间互信息的最大化之间的平衡,来构建有效且紧凑的信息流。这种方法能够产生保留相关信息同时最小化冗余的紧凑潜在状态表示,从而显著提升下游任务的性能。在MUStARD和CMU-MOSI数据集上的实验评估表明,我们的模型在多模态学习场景中始终能够提取关键信息,超越了当前最先进的基准模型。