Cosine similarity is the common choice for measuring the distance between the feature representations in contrastive visual-textual alignment learning. However, empirically a learnable softmax temperature parameter is required when learning on large-scale noisy training data. In this work, we first discuss the role of softmax temperature from the embedding space's topological properties. We argue that the softmax temperature is the key mechanism for contrastive learning on noisy training data. It acts as a scaling factor of the distance range (e.g. [-1, 1] for the cosine similarity), and its learned value indicates the level of noise in the training data. Then, we propose an alternative design of the topology for the embedding alignment. We make use of multiple class tokens in the transformer architecture; then map the feature representations onto an oblique manifold endowed with the negative inner product as the distance function. With this configuration, we largely improve the zero-shot classification performance of baseline CLIP models pre-trained on large-scale datasets by an average of 6.1\%.
翻译:余孩相似度是对比性视觉-文本对齐学习中衡量特征表示距离的常用选择。然而,经验表明,在大规模噪声训练数据上学习时,需要引入可学习的softmax温度参数。本文首先从嵌入空间的拓扑特性出发,探讨softmax温度的作用。我们认为,softmax温度是对比学习处理噪声训练数据的关键机制,它充当距离范围(如余弦相似度的[-1,1])的缩放因子,其学习值反映了训练数据的噪声水平。随后,我们提出一种用于嵌入对齐的替代拓扑设计方案:在Transformer架构中利用多个分类令牌,将特征表示映射到以负内积作为距离函数的斜流形上。采用该配置后,我们将在大规模数据集上预训练的基线CLIP模型的零样本分类性能平均提升6.1%。