In modern parametric model training, full-batch gradient descent (and its variants) suffers due to progressively stronger biasing towards the exact realization of training data; this drives the systematic ``generalization gap'', where the train error becomes an unreliable proxy for test error. Existing approaches either argue this gap is benign through complex analysis or sacrifice data to a validation set. In contrast, we introduce decoupled descent (DD), a novel theory-based training algorithm that satisfies a train-test identity -- enforcing the train error to asymptotically track the test error for stylized Gaussian mixture models. Within this specific regime, leveraging approximate message passing theory, DD iteratively cancels the biases due to data reuse, rigorously demonstrating the feasibility of zero-cost validation and $100\%$ data utilization. Moreover, DD is governed by a low-dimensional state evolution recursion, rendering the dynamics of the algorithm transparent and tractable. We validate DD on XOR classification, yielding superior performance compared to GD; additionally, we implement noisy MNIST and non-linear probing of CIFAR-10, demonstrating that even when our stylized assumptions are relaxed, DD narrows the generalization gap compared to GD.
翻译:在现代参数模型训练中,全批量梯度下降(及其变体)因逐渐偏向训练数据的精确实现而受损;这导致了系统性的“泛化差距”——训练误差成为测试误差不可靠的代理。现有方法要么通过复杂分析认为这种差距无害,要么牺牲数据作为验证集。相比之下,我们提出分离下降(Decoupled Descent,DD),一种基于理论的新型训练算法,满足训练-测试恒等式——在风格化高斯混合模型下迫使训练误差渐近追踪测试误差。在此特定框架内,利用近似消息传递理论,DD迭代消除因数据重用导致的偏差,严格证明了零成本验证和100%数据利用率的可行性。此外,DD受低维状态演化递归控制,使算法动态透明且易处理。我们在异或分类任务上验证DD,性能优于梯度下降;同时,我们在带有噪声的MNIST和CIFAR-10的非线性探测上实现DD,证明即使放松风格化假设,DD相比梯度下降仍能缩小泛化差距。