With the success of self-supervised learning, multimodal foundation models have rapidly adapted a wide range of downstream tasks driven by vision and language (VL) pretraining. State-of-the-art methods achieve impressive performance by pre-training on large-scale datasets. However, bridging the semantic gap between the two modalities remains a nonnegligible challenge for VL tasks. In this work, we propose an efficient computation framework for multimodal alignment by introducing a novel visual semantic module to further improve the performance of the VL tasks. Specifically, we propose a flexible model, namely Artificial-Spiking Hierarchical Networks (ASH-Nets), which combines the complementary advantages of Artificial neural networks (ANNs) and Spiking neural networks (SNNs) to enrich visual semantic representations. In particular, a visual concrete encoder and a semantic abstract encoder are constructed to learn continuous and discrete latent variables to enhance the flexibility of semantic encoding. Considering the spatio-temporal properties of SNNs modeling, we introduce a contrastive learning method to optimize the inputs of similar samples. This can improve the computational efficiency of the hierarchical network, while the augmentation of hard samples is beneficial to the learning of visual representations. Furthermore, the Spiking to Text Uni-Alignment Learning (STUA) pre-training method is proposed, which only relies on text features to enhance the encoding ability of abstract semantics. We validate the performance on multiple well-established downstream VL tasks. Experiments show that the proposed ASH-Nets achieve competitive results.
翻译:随着自监督学习的成功,多模态基础模型已快速适应由视觉与语言预训练驱动的广泛下游任务。当前最先进的方法通过大规模数据集预训练取得了令人瞩目的性能。然而,弥合两种模态之间的语义鸿沟仍是视觉-语言任务中不可忽视的挑战。本文通过引入新型视觉语义模块,提出一种高效的多模态对齐计算框架,以进一步提升视觉-语言任务的性能。具体而言,我们设计了一种灵活模型——人工脉冲层次网络(ASH-Nets),该模型结合人工神经网络与脉冲神经网络的互补优势,以丰富视觉语义表征。其中,构建了视觉具体编码器与语义抽象编码器,用于学习连续与离散潜变量,从而增强语义编码的灵活性。考虑到脉冲神经网络建模的时空特性,我们引入对比学习方法优化相似样本的输入,这既能提升层次网络的计算效率,又通过难样本增强有利于视觉表征学习。此外,提出脉冲-文本统一对齐预训练方法,该方法仅依赖文本特征即可增强抽象语义的编码能力。我们在多个成熟的视觉-语言下游任务上验证了性能,实验表明,所提出的ASH-Nets取得了具有竞争力的结果。