On the Convergence of Multi-objective Optimization under Generalized Smoothness

Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning. Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard $L$-smooth or bounded-gradient assumptions, which are typically unsatisfactory for neural networks, such as recurrent neural networks (RNNs) and transformers. In this paper, we study a more general and realistic class of $\ell$-smooth loss functions, where $\ell$ is a general non-decreasing function of gradient norm. We develop two novel single-loop algorithms for $\ell$-smooth MOO problems, Generalized Smooth Multi-objective Gradient descent (GSMGrad) and its stochastic variant, Stochastic Generalized Smooth Multi-objective Gradient descent (SGSMGrad), which approximate the conflict-avoidant (CA) direction that maximizes the minimum improvement among objectives. We provide a comprehensive convergence analysis of both algorithms and show that they converge to an $\epsilon$-accurate Pareto stationary point with a guaranteed $\epsilon$-level average CA distance (i.e., the gap between the updating direction and the CA direction) over all iterations, where totally $\mathcal{O}(\epsilon^{-2})$ and $\mathcal{O}(\epsilon^{-4})$ samples are needed for deterministic and stochastic settings, respectively. Our algorithms can also guarantee a tighter $\epsilon$-level CA distance in each iteration using more samples. Moreover, we propose a practical variant of GSMGrad named GSMGrad-FA using only constant-level time and space, while achieving the same performance guarantee as GSMGrad. Our experiments validate our theory and demonstrate the effectiveness of the proposed methods.

翻译：多目标优化（MOO）在多任务学习等各个领域受到越来越多的关注。近期研究提供了若干具有理论分析的有效算法，但这些算法受限于标准的$L$-光滑或梯度有界假设，而这些假设对于循环神经网络（RNN）和Transformer等神经网络通常并不令人满意。本文研究一类更一般且更现实的$\ell$-光滑损失函数，其中$\ell$是关于梯度范数的一般非递减函数。我们为$\ell$-光滑MOO问题开发了两种新型单循环算法：广义光滑多目标梯度下降（GSMGrad）及其随机变体——随机广义光滑多目标梯度下降（SGSMGrad），这两种算法近似于避免冲突（CA）方向，该方向能最大化各目标之间的最小改进幅度。我们提供了两种算法的全面收敛性分析，并证明它们能够收敛到一个$\epsilon$精确的帕累托驻点，且在所有迭代中保证平均CA距离（即更新方向与CA方向之间的差距）达到$\epsilon$水平——在确定性和随机设置下分别总共需要$\mathcal{O}(\epsilon^{-2})$和$\mathcal{O}(\epsilon^{-4})$个样本。我们的算法还可以通过使用更多样本，在每次迭代中保证更严格的$\epsilon$水平CA距离。此外，我们提出了GSMGrad的实用变体GSMGrad-FA，该变体仅使用常数级别的时间和空间，同时实现与GSMGrad相同的性能保证。我们的实验验证了相关理论，并证明了所提方法的有效性。