We investigate the fundamental optimization question of minimizing a target function $f$, whose gradients are expensive to compute or have limited availability, given access to some auxiliary side function $h$ whose gradients are cheap or more available. This formulation captures many settings of practical relevance, such as i) re-using batches in SGD, ii) transfer learning, iii) federated learning, iv) training with compressed models/dropout, etc. We propose two generic new algorithms that apply in all these settings and prove that we can benefit from this framework using only an assumption on the Hessian similarity between the target and side information. A benefit is obtained when this similarity measure is small, we also show a potential benefit from stochasticity when the auxiliary noise is correlated with that of the target function.
翻译:我们研究了在目标函数$f$的梯度计算昂贵或可用性有限的情况下,通过利用辅助辅助函数$h$(其梯度计算成本低或更易获取)来最小化目标函数$f$这一基础优化问题。该公式涵盖了多种实际场景,包括:i) 随机梯度下降中的批量复用,ii) 迁移学习,iii) 联邦学习,iv) 使用压缩模型/丢弃法进行训练等。我们提出了两种适用于所有上述场景的新型通用算法,并证明仅需假设目标函数与辅助信息之间的Hessian矩阵相似性,即可从该框架中获益。当这种相似性测度较小时,即可获得收益;同时,我们还证明了当辅助噪声与目标函数的噪声相关时,随机性可能带来潜在收益。