Deep neural networks have reached human-level performance on many computer vision tasks. However, the objectives used to train these networks enforce only that similar images are embedded at similar locations in the representation space, and do not directly constrain the global structure of the resulting space. Here, we explore the impact of supervising this global structure by linearly aligning it with human similarity judgments. We find that a naive approach leads to large changes in local representational structure that harm downstream performance. Thus, we propose a novel method that aligns the global structure of representations while preserving their local structure. This global-local transform considerably improves accuracy across a variety of few-shot learning and anomaly detection tasks. Our results indicate that human visual representations are globally organized in a way that facilitates learning from few examples, and incorporating this global structure into neural network representations improves performance on downstream tasks.
翻译:深度神经网络已在诸多计算机视觉任务上达到人类水平。然而,训练这些网络的目标仅要求相似图像在表示空间中嵌入到相近位置,并未直接约束所得表示空间的全局结构。本文探究了通过将全局结构与人类相似性判断进行线性对齐来监督该结构的影响。我们发现直接方法会导致局部表示结构剧烈变化,从而损害下游性能。因此,我们提出一种在保留局部结构的同时对齐表示全局结构的新方法。这种全局-局部变换显著提升了少样本学习和异常检测等多种任务的准确性。研究结果表明,人类视觉表示的全局组织方式有利于从少量样本中学习,而将这种全局结构融入神经网络表示可提高下游任务性能。