Input space reconstruction is an attractive representation learning paradigm. Despite interpretability of the reconstruction and generation, we identify a misalignment between learning by reconstruction, and learning for perception. We show that the former allocates a model's capacity towards a subspace of the data explaining the observed variance--a subspace with uninformative features for the latter. For example, the supervised TinyImagenet task with images projected onto the top subspace explaining 90\% of the pixel variance can be solved with 45\% test accuracy. Using the bottom subspace instead, accounting for only 20\% of the pixel variance, reaches 55\% test accuracy. The features for perception being learned last explains the need for long training time, e.g., with Masked Autoencoders. Learning by denoising is a popular strategy to alleviate that misalignment. We prove that while some noise strategies such as masking are indeed beneficial, others such as additive Gaussian noise are not. Yet, even in the case of masking, we find that the benefits vary as a function of the mask's shape, ratio, and the considered dataset. While tuning the noise strategy without knowledge of the perception task seems challenging, we provide first clues on how to detect if a noise strategy is never beneficial regardless of the perception task.
翻译:输入空间重构是一种具有吸引力的表征学习范式。尽管重构与生成具有可解释性,但我们发现通过重构学习与为感知而学习之间存在错位。我们证明,前者将模型能力分配给了解释观测方差的数据子空间——该子空间包含对后者无信息的特征。例如,在监督式TinyImagenet任务中,将图像投影到解释90%像素方差的主子空间上,仅能达到45%的测试准确率;而使用仅解释20%像素方差的底部子空间,测试准确率可达55%。感知特征最后才被学习的特性解释了掩码自编码器等模型需要长训练时间的原因。通过去噪学习是缓解该错位的常用策略。我们证明,虽然某些噪声策略(如掩码)确实有益,但其他策略(如加性高斯噪声)则无益。然而,即使在掩码情况下,我们发现其收益会随掩码形状、比例及所考虑数据集的不同而变化。尽管在未知感知任务的情况下调整噪声策略颇具挑战,但我们首次提供了如何检测某一噪声策略是否永远无法对任何感知任务产生收益的线索。