Contrastive Learning (CL) has emerged as a powerful method for training feature extraction models using unlabeled data. Recent studies suggest that incorporating a linear projection head post-backbone significantly enhances model performance. In this work, we investigate the use of a transformer model as a projection head within the CL framework, aiming to exploit the transformer's capacity for capturing long-range dependencies across embeddings to further improve performance. Our key contributions are fourfold: First, we introduce a novel application of transformers in the projection head role for contrastive learning, marking the first endeavor of its kind. Second, our experiments reveal a compelling "Deep Fusion" phenomenon where the attention mechanism progressively captures the correct relational dependencies among samples from the same class in deeper layers. Third, we provide a theoretical framework that explains and supports this "Deep Fusion" behavior. Finally, we demonstrate through experimental results that our model achieves superior performance compared to the existing approach of using a feed-forward layer.
翻译:对比学习已成为利用无标签数据训练特征提取模型的一种强大方法。近期研究表明,在骨干网络后加入线性投影头可显著提升模型性能。本文研究了在对比学习框架中使用Transformer模型作为投影头的方法,旨在利用Transformer捕获嵌入间长程依赖关系的能力以进一步提升性能。我们的主要贡献包括四个方面:首先,我们首次将Transformer应用于对比学习的投影头角色,开创了此类研究的先河。其次,实验揭示了一个引人注目的"深度融合"现象,即注意力机制在更深层网络中逐步捕获同类样本间正确的关联依赖性。第三,我们提出了解释和支持这种"深度融合"行为的理论框架。最后,实验结果表明,相较于现有使用前馈层的方法,我们的模型取得了更优越的性能。