To train image-caption retrieval (ICR) methods, contrastive loss functions are a common choice for optimization functions. Unfortunately, contrastive ICR methods are vulnerable to predictive feature suppression. Predictive features are features that correctly indicate the similarity between a query and a candidate item. However, in the presence of multiple predictive features during training, encoder models tend to suppress redundant predictive features, since these features are not needed to learn to discriminate between positive and negative pairs. While some predictive features are redundant during training, these features might be relevant during evaluation. We introduce an approach to reduce predictive feature suppression for resource-constrained ICR methods: latent target decoding (LTD). We add an additional decoder to the contrastive ICR framework, to reconstruct the input caption in a latent space of a general-purpose sentence encoder, which prevents the image and caption encoder from suppressing predictive features. We implement the LTD objective as an optimization constraint, to ensure that the reconstruction loss is below a bound value while primarily optimizing for the contrastive loss. Importantly, LTD does not depend on additional training data or expensive (hard) negative mining strategies. Our experiments show that, unlike reconstructing the input caption in the input space, LTD reduces predictive feature suppression, measured by obtaining higher recall@k, r-precision, and nDCG scores than a contrastive ICR baseline. Moreover, we show that LTD should be implemented as an optimization constraint instead of a dual optimization objective. Finally, we show that LTD can be used with different contrastive learning losses and a wide variety of resource-constrained ICR methods.
翻译:为训练图文检索(ICR)方法,对比损失函数是常用的优化函数。然而,对比ICR方法易受预测特征抑制的影响。预测特征是指能正确指示查询与候选条目相似度的特征。但在训练过程中存在多个预测特征时,编码器模型倾向于抑制冗余的预测特征,因为这些特征对于学习区分正负样本对并非必要。尽管某些预测特征在训练时是冗余的,但它们在评估时可能具有相关性。我们提出一种减少资源受限ICR方法中预测特征抑制的方法:潜在目标解码(LTD)。在对比ICR框架中增加一个额外的解码器,用于在通用句子编码器的潜在空间中重构输入字幕,从而防止图像和字幕编码器抑制预测特征。我们将LTD目标实现为优化约束,确保在主要优化对比损失的同时,重构损失低于某个界限值。重要的是,LTD不依赖额外训练数据或昂贵的(困难)负样本挖掘策略。实验表明,与在输入空间中重构输入字幕不同,LTD能减少预测特征抑制,通过获得比对比ICR基线更高的recall@k、r-precision和nDCG分数得以验证。此外,我们证明LTD应作为优化约束而非双重优化目标来实现。最后,我们展示LTD可适用于不同的对比学习损失及多种资源受限ICR方法。