Training multimodal models requires a large amount of labeled data. Active learning (AL) aim to reduce labeling costs. Most AL methods employ warm-start approaches, which rely on sufficient labeled data to train a well-calibrated model that can assess the uncertainty and diversity of unlabeled data. However, when assembling a dataset, labeled data are often scarce initially, leading to a cold-start problem. Additionally, most AL methods seldom address multimodal data, highlighting a research gap in this field. Our research addresses these issues by developing a two-stage method for Multi-Modal Cold-Start Active Learning (MMCSAL). Firstly, we observe the modality gap, a significant distance between the centroids of representations from different modalities, when only using cross-modal pairing information as self-supervision signals. This modality gap affects data selection process, as we calculate both uni-modal and cross-modal distances. To address this, we introduce uni-modal prototypes to bridge the modality gap. Secondly, conventional AL methods often falter in multimodal scenarios where alignment between modalities is overlooked. Therefore, we propose enhancing cross-modal alignment through regularization, thereby improving the quality of selected multimodal data pairs in AL. Finally, our experiments demonstrate MMCSAL's efficacy in selecting multimodal data pairs across three multimodal datasets.
翻译:训练多模态模型需要大量标注数据。主动学习旨在降低标注成本。大多数主动学习方法采用热启动策略,依赖充足的标注数据来训练一个校准良好的模型,该模型能够评估未标注数据的不确定性和多样性。然而,在构建数据集时,标注数据在初始阶段往往十分稀缺,从而导致冷启动问题。此外,大多数主动学习方法很少处理多模态数据,凸显了该领域的研究空白。本研究通过开发一种用于多模态冷启动主动学习的两阶段方法来解决这些问题。首先,我们观察到当仅使用跨模态配对信息作为自监督信号时,存在模态间隙——即不同模态表示的中心点之间存在显著距离。该模态间隙会影响数据选择过程,因为我们同时计算了单模态和跨模态距离。为解决此问题,我们引入了单模态原型以弥合模态间隙。其次,传统的主动学习方法在多模态场景中常常失效,因为其忽视了模态间的对齐。因此,我们提出通过正则化增强跨模态对齐,从而提高主动学习中选定的多模态数据对的质量。最后,我们的实验证明了该方法在三个多模态数据集上选择多模态数据对的有效性。