Diffused Redundancy in Pre-trained Representations

Representations learned by pre-training a neural network on a large dataset are increasingly used successfully to perform a variety of downstream tasks. In this work, we take a closer look at how features are encoded in such pre-trained representations. We find that learned representations in a given layer exhibit a degree of diffuse redundancy, ie, any randomly chosen subset of neurons in the layer that is larger than a threshold size shares a large degree of similarity with the full layer and is able to perform similarly as the whole layer on a variety of downstream tasks. For example, a linear probe trained on $20\%$ of randomly picked neurons from the penultimate layer of a ResNet50 pre-trained on ImageNet1k achieves an accuracy within $5\%$ of a linear probe trained on the full layer of neurons for downstream CIFAR10 classification. We conduct experiments on different neural architectures (including CNNs and Transformers) pre-trained on both ImageNet1k and ImageNet21k and evaluate a variety of downstream tasks taken from the VTAB benchmark. We find that the loss and dataset used during pre-training largely govern the degree of diffuse redundancy and the "critical mass" of neurons needed often depends on the downstream task, suggesting that there is a task-inherent redundancy-performance Pareto frontier. Our findings shed light on the nature of representations learned by pre-trained deep neural networks and suggest that entire layers might not be necessary to perform many downstream tasks. We investigate the potential for exploiting this redundancy to achieve efficient generalization for downstream tasks and also draw caution to certain possible unintended consequences. Our code is available at \url{https://github.com/nvedant07/diffused-redundancy}.

翻译：通过在大型数据集上预训练神经网络学到的表示，被日益成功地应用于执行各种下游任务。本文深入探究了预训练表示中特征的编码方式，发现给定层中学到的表示呈现出一定程度的扩散冗余性：即该层中任意随机选取的、大于某一阈值大小的神经元子集，都与完整层具有高度相似性，并能像完整层一样在各种下游任务上表现出相近的性能。例如，在ImageNet1k上预训练的ResNet50的倒数第二层中，使用随机选取的20%神经元训练的线性探针，在下游CIFAR10分类任务上的准确率仅比使用完整层神经元训练的线性探针低5%。我们针对在ImageNet1k和ImageNet21k上预训练的不同神经网络架构（包括CNN和Transformer）进行了实验，并评估了VTAB基准测试中的多种下游任务。研究发现，预训练过程中使用的损失函数和数据集在很大程度上决定了扩散冗余的程度，而所需神经元的"临界数量"通常取决于下游任务，这表明存在一个任务固有的冗余-性能帕累托前沿。我们的发现揭示了预训练深度神经网络所学表示的本质，并表明完整层可能并非执行许多下游任务所必需。我们探讨了利用这种冗余性实现下游任务高效泛化的潜力，同时也提醒注意某些可能存在的意外后果。我们的代码发布于\url{https://github.com/nvedant07/diffused-redundancy}。