The fusion models, which effectively combine information from different sources, are widely used in solving multimodal tasks. However, they have significant limitations related to aligning data distributions across different modalities. This challenge can lead to inconsistencies and difficulties in learning robust representations. Alignment models, while specifically addressing this issue, often require training "from scratch" with large datasets to achieve optimal results, which can be costly in terms of resources and time. To overcome these limitations, we propose an innovative model called Context-Based Multimodal Fusion (CBMF), which combines both modality fusion and data distribution alignment. In CBMF, each modality is represented by a specific context vector, fused with the embedding of each modality. This enables the use of large pre-trained models that can be frozen, reducing the computational and training data requirements. Additionally, the network learns to differentiate embeddings of different modalities through fusion with context and aligns data distributions using a contrastive approach for self-supervised learning. Thus, CBMF offers an effective and economical solution for solving complex multimodal tasks.
翻译:融合模型能够有效整合来自不同来源的信息,广泛应用于多模态任务中。然而,它们存在显著局限,即难以对齐不同模态间的数据分布。这一挑战可能导致学习鲁棒表征时出现不一致和困难。对齐模型虽专门针对此问题,但通常需要从头开始使用大型数据集进行训练才能获得最优结果,这会造成资源和时间上的高昂成本。为克服这些局限,我们提出了一种创新模型,称为基于上下文的多模态融合(CBMF),该模型结合了模态融合与数据分布对齐。在CBMF中,每个模态由一个特定的上下文向量表示,并与各模态的嵌入向量融合。这使得我们可以利用可冻结的大型预训练模型,从而降低计算和训练数据的需求。此外,该网络通过上下文融合学习区分不同模态的嵌入,并采用对比自监督学习方式对齐数据分布。因此,CBMF为求解复杂多模态任务提供了一种经济高效的解决方案。