The fusion models, which effectively combine information from different sources, are widely used in solving multimodal tasks. However, they have significant limitations related to aligning data distributions across different modalities. This challenge can lead to inconsistencies and difficulties in learning robust representations. Alignment models, while specifically addressing this issue, often require training "from scratch" with large datasets to achieve optimal results, which can be costly in terms of resources and time. To overcome these limitations, we propose an innovative model called Context-Based Multimodal Fusion (CBMF), which combines both modality fusion and data distribution alignment. In CBMF, each modality is represented by a specific context vector, fused with the embedding of each modality. This enables the use of large pre-trained models that can be frozen, reducing the computational and training data requirements. Additionally, the network learns to differentiate embeddings of different modalities through fusion with context and aligns data distributions using a contrastive approach for self-supervised learning. Thus, CBMF offers an effective and economical solution for solving complex multimodal tasks.
翻译:融合模型通过有效整合来自不同来源的信息,广泛应用于多模态任务解决中。然而,这类模型在跨模态数据分布对齐方面存在显著局限性,这种挑战可能导致学习鲁棒表示时出现不一致性和困难。尽管对齐模型专门针对该问题,但通常需要"从头训练"大规模数据集才能达到最优效果,这在资源和时间成本上较为高昂。为克服这些局限,我们提出一种创新模型——基于语境的多模态融合(CBMF),该模型同时实现模态融合与数据分布对齐。在CBMF中,每种模态由一个特定的语境向量表示,并与各模态的嵌入进行融合。这使得能够利用可冻结的大型预训练模型,从而降低计算和训练数据需求。此外,网络通过融合语境学习区分不同模态的嵌入,并采用对比方法进行自监督学习以实现数据分布对齐。因此,CBMF为解决复杂多模态任务提供了高效且经济的解决方案。