Deep neural networks have reached human-level performance on many computer vision tasks. However, the objectives used to train these networks enforce only that similar images are embedded at similar locations in the representation space, and do not directly constrain the global structure of the resulting space. Here, we explore the impact of supervising this global structure by linearly aligning it with human similarity judgments. We find that a naive approach leads to large changes in local representational structure that harm downstream performance. Thus, we propose a novel method that aligns the global structure of representations while preserving their local structure. This global-local transform considerably improves accuracy across a variety of few-shot learning and anomaly detection tasks. Our results indicate that human visual representations are globally organized in a way that facilitates learning from few examples, and incorporating this global structure into neural network representations improves performance on downstream tasks.
翻译:深度神经网络在许多计算机视觉任务上已达到人类水平。然而,用于训练这些网络的目标函数仅确保相似图像在表示空间中被嵌入到相似位置,并未直接约束所得表示空间的全局结构。在此,我们探究通过将表示空间的全局结构与人类相似性判断进行线性对齐来监督其结构的影响。我们发现,朴素方法会导致局部表示结构发生较大变化,从而损害下游任务性能。因此,我们提出一种新方法,在保留局部结构的同时对齐表示的全局结构。这种全局-局部变换显著提升了少量样本学习和异常检测等多种任务的准确率。我们的结果表明,人类视觉表示在全局组织上有利于从少量样本中学习,而将这种全局结构融入神经网络表示能提升下游任务性能。