Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning. Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard $L$-smooth or bounded-gradient assumptions, which are typically unsatisfactory for neural networks, such as recurrent neural networks (RNNs) and transformers. In this paper, we study a more general and realistic class of $\ell$-smooth loss functions, where $\ell$ is a general non-decreasing function of gradient norm. We develop two novel single-loop algorithms for $\ell$-smooth MOO problems, Generalized Smooth Multi-objective Gradient descent (GSMGrad) and its stochastic variant, Stochastic Generalized Smooth Multi-objective Gradient descent (SGSMGrad), which approximate the conflict-avoidant (CA) direction that maximizes the minimum improvement among objectives. We provide a comprehensive convergence analysis of both algorithms and show that they converge to an $\epsilon$-accurate Pareto stationary point with a guaranteed $\epsilon$-level average CA distance (i.e., the gap between the updating direction and the CA direction) over all iterations, where totally $\mathcal{O}(\epsilon^{-2})$ and $\mathcal{O}(\epsilon^{-4})$ samples are needed for deterministic and stochastic settings, respectively. Our algorithms can also guarantee a tighter $\epsilon$-level CA distance in each iteration using more samples. Moreover, we propose a practical variant of GSMGrad named GSMGrad-FA using only constant-level time and space, while achieving the same performance guarantee as GSMGrad. Our experiments validate our theory and demonstrate the effectiveness of the proposed methods.
翻译:多目标优化(MOO)在多任务学习等诸多领域日益受到关注。现有研究虽提出若干具备理论分析的有效算法,但这些算法受限于标准的$L$-光滑或有界梯度假设,此类假设对于循环神经网络(RNN)和Transformer等神经网络往往难以成立。本文研究一类更广义且符合实际的$\ell$-光滑损失函数,其中$\ell$为梯度范数的一般非递减函数。针对$\ell$-光滑MOO问题,我们提出两种新颖的单循环算法:广义光滑多目标梯度下降法(GSMGrad)及其随机变体——随机广义光滑多目标梯度下降法(SGSMGrad),这两种算法通过逼近能最大化目标函数最小改进量的冲突规避(CA)方向进行优化。我们对两种算法进行了全面的收敛性分析,证明其能以$\epsilon$精度收敛至帕累托稳定点,且在所有迭代中保证$\epsilon$级别的平均CA距离(即更新方向与CA方向之间的差距),其中确定性和随机性场景分别仅需$\mathcal{O}(\epsilon^{-2})$和$\mathcal{O}(\epsilon^{-4})$样本量。通过增加样本量,我们的算法还能在每次迭代中保证更严格的$\epsilon$级别CA距离。此外,我们提出仅需常数级时间和空间复杂度的实用变体GSMGrad-FA,其性能保证与GSMGrad完全一致。实验验证了理论结论并证明了所提方法的有效性。