We investigate the fundamental optimization question of minimizing a target function $f$, whose gradients are expensive to compute or have limited availability, given access to some auxiliary side function $h$ whose gradients are cheap or more available. This formulation captures many settings of practical relevance, such as i) re-using batches in SGD, ii) transfer learning, iii) federated learning, iv) training with compressed models/dropout, Et cetera. We propose two generic new algorithms that apply in all these settings; we also prove that we can benefit from this framework under the Hessian similarity assumption between the target and side information. A benefit is obtained when this similarity measure is small; we also show a potential benefit from stochasticity when the auxiliary noise is correlated with that of the target function.
翻译:本研究探讨了一个基础性优化问题:在目标函数$f$的梯度计算成本高昂或获取受限的情况下,如何利用其辅助侧函数$h$(其梯度计算廉价或更易获取)来实现最小化。该框架涵盖了多个实际应用场景,例如:i) 随机梯度下降(SGD)中的批次复用,ii) 迁移学习,iii) 联邦学习,iv) 使用压缩模型/随机失活技术的训练等。我们提出了两种适用于所有场景的通用新算法,并证明在目标函数与辅助信息的Hessian矩阵相似性假设下,该框架能够带来优化收益。当相似性度量值较小时,可获得显著优势;同时,我们还揭示了当辅助函数噪声与目标函数噪声相关时,随机性可能带来的潜在益处。