Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning. Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard $L$-smooth or bounded-gradient assumptions, which are typically unsatisfactory for neural networks, such as recurrent neural networks (RNNs) and transformers. In this paper, we study a more general and realistic class of $\ell$-smooth loss functions, where $\ell$ is a general non-decreasing function of gradient norm. We develop two novel single-loop algorithms for $\ell$-smooth MOO problems, Generalized Smooth Multi-objective Gradient descent (GSMGrad) and its stochastic variant, Stochastic Generalized Smooth Multi-objective Gradient descent (SGSMGrad), which approximate the conflict-avoidant (CA) direction that maximizes the minimum improvement among objectives. We provide a comprehensive convergence analysis of both algorithms and show that they converge to an $\epsilon$-accurate Pareto stationary point with a guaranteed $\epsilon$-level average CA distance (i.e., the gap between the updating direction and the CA direction) over all iterations, where totally $\mathcal{O}(\epsilon^{-2})$ and $\mathcal{O}(\epsilon^{-4})$ samples are needed for deterministic and stochastic settings, respectively. Our algorithms can also guarantee a tighter $\epsilon$-level CA distance in each iteration using more samples. Moreover, we propose a practical variant of GSMGrad named GSMGrad-FA using only constant-level time and space, while achieving the same performance guarantee as GSMGrad. Our experiments validate our theory and demonstrate the effectiveness of the proposed methods.
翻译:多目标优化(MOO)在多任务学习等各个领域受到越来越多的关注。近期研究提供了若干具有理论分析的有效算法,但这些算法受限于标准的$L$-光滑或梯度有界假设,而这些假设对于循环神经网络(RNN)和Transformer等神经网络通常并不令人满意。本文研究一类更一般且更现实的$\ell$-光滑损失函数,其中$\ell$是关于梯度范数的一般非递减函数。我们为$\ell$-光滑MOO问题开发了两种新型单循环算法:广义光滑多目标梯度下降(GSMGrad)及其随机变体——随机广义光滑多目标梯度下降(SGSMGrad),这两种算法近似于避免冲突(CA)方向,该方向能最大化各目标之间的最小改进幅度。我们提供了两种算法的全面收敛性分析,并证明它们能够收敛到一个$\epsilon$精确的帕累托驻点,且在所有迭代中保证平均CA距离(即更新方向与CA方向之间的差距)达到$\epsilon$水平——在确定性和随机设置下分别总共需要$\mathcal{O}(\epsilon^{-2})$和$\mathcal{O}(\epsilon^{-4})$个样本。我们的算法还可以通过使用更多样本,在每次迭代中保证更严格的$\epsilon$水平CA距离。此外,我们提出了GSMGrad的实用变体GSMGrad-FA,该变体仅使用常数级别的时间和空间,同时实现与GSMGrad相同的性能保证。我们的实验验证了相关理论,并证明了所提方法的有效性。