While it is well-established that the weight matrices and feature manifolds of deep neural networks exhibit a low Intrinsic Dimension (ID), current state-of-the-art models still rely on massive high-dimensional widths. This redundancy is not required for representation, but is strictly necessary to solve the non-convex optimization search problem-finding a global minimum, which remains intractable for compact networks. In this work, we propose a constructive approach to bypass this optimization bottleneck. By decoupling the solution geometry from the ambient search space, we empirically demonstrate across ResNet-50, ViT, and BERT that the classification head can be compressed by even huge factors of 16 with negligible performance degradation. This motivates Subspace-Native Distillation as a novel paradigm: by defining the target directly in this constructed subspace, we provide a stable geometric coordinate system for student models, potentially allowing them to circumvent the high-dimensional search problem entirely and realize the vision of Train Big, Deploy Small.
翻译:尽管深度神经网络的权重矩阵和特征流形具有低内在维度这一事实已得到充分证实,但当前最先进的模型仍依赖于庞大的高维宽度。这种冗余对于表示并非必需,但对于解决非凸优化搜索问题——寻找全局最小值——却是严格必要的,而这对紧凑网络而言仍然难以处理。在本工作中,我们提出了一种构造性方法来绕过这一优化瓶颈。通过将解几何与外围搜索空间解耦,我们在ResNet-50、ViT和BERT上实证表明,分类头可以被压缩高达16倍,而性能下降可忽略不计。这启发了子空间原生蒸馏作为一种新范式:通过直接在构建的子空间中定义目标,我们为学生模型提供了一个稳定的几何坐标系,使其有可能完全规避高维搜索问题,从而实现“训练大模型,部署小模型”的愿景。