Data mixing--the strategic reweighting of training domains--is a critical component in training robust machine learning models. This problem is naturally formulated as a bilevel optimization task, where the outer loop optimizes domain weights to minimize validation loss, and the inner loop optimizes model parameters to minimize the weighted training loss. Classical bilevel optimization relies on hypergradients, which theoretically require the inner optimization to reach convergence. However, due to computational constraints, state-of-the-art methods use a finite, often small, number of inner update steps before updating the weights. The theoretical implications of this approximation are not well understood. In this work, we rigorously analyze the convergence behavior of data mixing with a finite number of inner steps $T$. We prove that the "greedy" practical approach of using $T=1$ can fail even in a simple quadratic example. Under a fixed parameter update budget $N$ and assuming the per-domain losses are strongly convex, we show that the optimal $T$ scales as $Θ(\log N)$ (resp., $Θ({(N \log N)}^{1/2})$) for the data mixing problem with access to full (resp., stochastic) gradients. We complement our theoretical results with proof-of-concept experiments.
翻译:数据混合——对训练域进行战略性重加权——是训练鲁棒机器学习模型的关键组成部分。该问题自然表述为一个双层优化任务,其中外层循环优化域权重以最小化验证损失,内层循环优化模型参数以最小化加权训练损失。经典的双层优化依赖于超梯度,理论上要求内层优化达到收敛。然而,由于计算限制,现有先进方法在更新权重之前仅使用有限(通常很小)数量的内层更新步骤。这种近似的理论影响尚未得到充分理解。在本工作中,我们严格分析了使用有限内层步骤数 $T$ 进行数据混合的收敛行为。我们证明了使用 $T=1$ 的“贪婪”实用方法即使在简单的二次示例中也可能失败。在固定参数更新预算 $N$ 且假设各域损失函数为强凸的条件下,我们证明对于能够获取完整梯度(相应地,随机梯度)的数据混合问题,最优的 $T$ 以 $Θ(\log N)$(相应地,$Θ({(N \log N)}^{1/2})$)的速率增长。我们通过概念验证实验补充了理论结果。