We present Contextualized Local Visual Embeddings (CLoVE), a self-supervised convolutional-based method that learns representations suited for dense prediction tasks. CLoVE deviates from current methods and optimizes a single loss function that operates at the level of contextualized local embeddings learned from output feature maps of convolution neural network (CNN) encoders. To learn contextualized embeddings, CLoVE proposes a normalized mult-head self-attention layer that combines local features from different parts of an image based on similarity. We extensively benchmark CLoVE's pre-trained representations on multiple datasets. CLoVE reaches state-of-the-art performance for CNN-based architectures in 4 dense prediction downstream tasks, including object detection, instance segmentation, keypoint detection, and dense pose estimation.
翻译:我们提出了上下文化局部视觉嵌入(CLoVE),这是一种基于卷积的自监督方法,用于学习适合密集预测任务的表征。与现有方法不同,CLoVE优化一个单一的损失函数,该函数作用于从卷积神经网络(CNN)编码器输出特征图中学习到的上下文化局部嵌入层面。为了学习上下文化的嵌入,CLoVE提出了一种归一化多头自注意力层,该层基于相似度将图像不同部分的局部特征进行组合。我们基于多个数据集对CLoVE的预训练表征进行了广泛基准测试。在包括目标检测、实例分割、关键点检测和密集姿态估计在内的4项密集预测下游任务中,CLoVE在基于CNN的架构上达到了最先进的性能。