Many modern applications of deep learning involve training a neural network via a one-step prediction loss (e.g., $L^2$ regression, cross-entropy), but deploy the network by rolling out along its own predictions. Key examples include autoregressive language modeling, flow-based generative modeling, and robot policy learning. It is well-documented that these settings induce a phenomenon we call test-time feedback (TTF): the mismatch between the training/validation loss and downstream metrics of interest, such as task success rate and generation quality, which grows with task length. While data curation, architecture, and objective design have been proposed to combat train-test shift in TTF settings, this paper proposes optimization as a new design axis to mitigate error accumulation. Specifically, we introduce a new optimization paradigm called double-preconditioning (DoPr) uniquely tailored to the challenges of TTF. DoPr combines gradient-wise preconditioning, as in Adam and Muon, with activation-wise preconditioning (AP), such as in KFAC. We show that the addition of AP yields a drop-in intervention for increasing downstream model performance across a range of TTF settings. Interestingly, these gains in test-time performance do not consistently accompany improvements in validation loss, opening new questions about how to properly evaluate models trained with one-step supervised objectives.
翻译:摘要:现代深度学习的许多应用涉及通过一步预测损失(例如 $L^2$ 回归、交叉熵)训练神经网络,但在部署时却基于自身的预测进行展开。典型例子包括自回归语言建模、基于流的生成建模及机器人策略学习。已有充分文献表明,这些场景会引发一种我们称为测试时反馈(TTF)的现象:训练/验证损失与下游感兴趣指标(如任务成功率与生成质量)之间的失配会随任务长度加剧。尽管已有研究者提出通过数据整理、架构设计和目标函数设计来缓解TTF场景中的训练-测试偏移,本文则提出将优化作为缓解误差累积的新设计维度。具体而言,我们引入了一种名为双重预处理(DoPr)的新型优化范式,该范式特别针对TTF的挑战而量身定制。DoPr将Adam和Muon中的梯度级预处理与KFAC等中的激活级预处理(AP)相结合。研究表明,加入AP作为一种即插即用式干预措施,能在多种TTF场景下提升下游模型性能。有趣的是,这些测试时性能的提升并不总是伴随验证损失的改善,这引出了关于如何恰当评估通过一步监督目标训练的模型的新问题。