Deep neural networks with short residual connections have demonstrated remarkable success across domains, but increasing depth often introduces computational redundancy without corresponding improvements in representation quality. In this work, we introduce Auto-Compressing Networks (ACNs), an architectural variant where additive long feedforward connections from each layer to the output replace traditional short residual connections. ACNs showcase a unique property we coin as "auto-compression", the ability of a network to organically compress information during training with gradient descent, through architectural design alone. Through auto-compression, information is dynamically "pushed" into early layers during training, enhancing their representational quality and revealing potential redundancy in deeper ones. We theoretically show that this property emerges from layer-wise training patterns present in ACNs, where layers are dynamically utilized during training based on task requirements. We also find that ACNs exhibit enhanced noise robustness compared to residual networks, superior performance in low-data settings, improved transfer learning capabilities, and mitigate catastrophic forgetting suggesting that they learn representations that generalize better despite using fewer parameters. Our results demonstrate up to 18% reduction in catastrophic forgetting and 30-80% architectural compression while maintaining accuracy across vision transformers, MLP-mixers, and BERT architectures. Furthermore, we demonstrate that coupling ACNs with traditional pruning techniques, enables significantly better sparsity-performance trade-offs compared to conventional architectures. These findings establish ACNs as a practical approach to developing efficient neural architectures that automatically adapt their computational footprint to task complexity, while learning robust representations.
翻译:具有短残差连接的深度神经网络已在多个领域展现出卓越的成功,但增加网络深度往往会引入计算冗余,而表征质量并未相应提升。在本研究中,我们提出了自动压缩网络(ACNs),这是一种架构变体,其中从每个层到输出的加性长前馈连接取代了传统的短残差连接。ACNs展示了一种我们称之为“自动压缩”的独特特性,即网络仅通过架构设计,就能在梯度下降训练过程中有机地压缩信息。通过自动压缩,信息在训练期间被动态“推入”早期层,从而增强其表征质量并揭示深层中潜在的冗余。我们从理论上证明,这一特性源于ACNs中存在的逐层训练模式,其中各层根据任务需求在训练期间被动态利用。我们还发现,与残差网络相比,ACNs表现出更强的噪声鲁棒性、在低数据设置下的优越性能、改进的迁移学习能力,并能缓解灾难性遗忘,这表明它们尽管使用更少的参数,却能学习到泛化能力更强的表征。我们的实验结果表明,在视觉Transformer、MLP-mixer和BERT架构中,ACNs在保持准确性的同时,实现了高达18%的灾难性遗忘减少和30-80%的架构压缩。此外,我们证明将ACNs与传统剪枝技术结合,相比传统架构能实现显著更好的稀疏性-性能权衡。这些发现确立了ACNs作为一种实用方法,可用于开发能自动根据任务复杂度调整其计算开销,同时学习鲁棒表征的高效神经架构。