CoVLR: Coordinating Cross-Modal Consistency and Intra-Modal Structure for Vision-Language Retrieval

from arxiv, I apologize for my operational mistake, which has resulted in the absence of a revised version of the manuscript. Furthermore, I am concerned that the submission process of this paper may potentially lead to conflicts. Therefore, I kindly request the withdrawal of the manuscript

Current vision-language retrieval aims to perform cross-modal instance search, in which the core idea is to learn the consistent visionlanguage representations. Although the performance of cross-modal retrieval has greatly improved with the development of deep models, we unfortunately find that traditional hard consistency may destroy the original relationships among single-modal instances, leading the performance degradation for single-modal retrieval. To address this challenge, in this paper, we experimentally observe that the vision-language divergence may cause the existence of strong and weak modalities, and the hard cross-modal consistency cannot guarantee that strong modal instances' relationships are not affected by weak modality, resulting in the strong modal instances' relationships perturbed despite learned consistent representations.To this end, we propose a novel and directly Coordinated VisionLanguage Retrieval method (dubbed CoVLR), which aims to study and alleviate the desynchrony problem between the cross-modal alignment and single-modal cluster-preserving tasks. CoVLR addresses this challenge by developing an effective meta-optimization based strategy, in which the cross-modal consistency objective and the intra-modal relation preserving objective are acted as the meta-train and meta-test tasks, thereby CoVLR encourages both tasks to be optimized in a coordinated way. Consequently, we can simultaneously insure cross-modal consistency and intra-modal structure. Experiments on different datasets validate CoVLR can improve single-modal retrieval accuracy whilst preserving crossmodal retrieval capacity compared with the baselines.

翻译：当前视觉-语言检索旨在执行跨模态实例搜索，其核心思想是学习一致的视觉-语言表示。尽管随着深度模型的发展，跨模态检索性能大幅提升，但我们遗憾地发现，传统硬一致性可能破坏单模态实例间的原始关系，导致单模态检索性能下降。为应对这一挑战，本文通过实验观察到，视觉-语言差异可能导致强弱模态的存在，而硬跨模态一致性无法保证强模态实例关系不受弱模态影响，即便学习到一致表示，强模态实例关系仍会受到扰动。为此，我们提出一种新颖且直接协调的视觉-语言检索方法（简称CoVLR），旨在研究并缓解跨模态对齐与单模态聚类保持任务之间的异步问题。CoVLR通过开发一种基于元优化的有效策略解决该挑战，其中跨模态一致性目标和模态内关系保持目标分别作为元训练和元测试任务，从而鼓励两个任务以协调方式优化。因此，我们能够同时保证跨模态一致性与模态内结构。在不同数据集上的实验验证，CoVLR在保持跨模态检索能力的同时，能提升单模态检索精度，优于基线方法。