Advances in the field of visual-language contrastive learning have made it possible for many downstream applications to be carried out efficiently and accurately by simply taking the dot product between image and text representations. One of the most representative approaches proposed recently known as CLIP has quickly garnered widespread adoption due to its effectiveness. CLIP is trained with an InfoNCE loss that takes into account both positive and negative samples to help learn a much more robust representation space. This paper however reveals that the common downstream practice of taking a dot product is only a zeroth-order approximation of the optimization goal, resulting in a loss of information during test-time. Intuitively, since the model has been optimized based on the InfoNCE loss, test-time procedures should ideally also be in alignment. The question lies in how one can retrieve any semblance of negative samples information during inference. We propose Distribution Normalization (DN), where we approximate the mean representation of a batch of test samples and use such a mean to represent what would be analogous to negative samples in the InfoNCE loss. DN requires no retraining or fine-tuning and can be effortlessly applied during inference. Extensive experiments on a wide variety of downstream tasks exhibit a clear advantage of DN over the dot product.
翻译:视觉语言对比学习领域的进展使得许多下游应用仅需通过图像和文本表征的点积运算即可高效准确地实现。近期最具代表性的方法之一CLIP因其有效性迅速获得广泛采用。CLIP采用考虑正负样本的InfoNCE损失进行训练,有助于学习更鲁棒的表征空间。然而本文揭示,常见下游实践中采用的点积运算仅是对优化目标的零阶近似,导致测试时信息损失。直观而言,由于模型已基于InfoNCE损失进行优化,测试时流程也应当与之对齐。问题在于如何在推理过程中获取负样本信息的任何痕迹。我们提出分布归一化(Distribution Normalization, DN),通过近似一批测试样本的均值表征,利用该均值代表InfoNCE损失中相当于负样本的信息。DN无需重新训练或微调,可在推理过程中零成本应用。在广泛下游任务上的大量实验表明,DN相较于点积运算具有明显优势。