When optimizing machine learning models, there are various scenarios where gradient computations are challenging or even infeasible. Furthermore, in reinforcement learning (RL), preference-based RL that only compares between options has wide applications, including reinforcement learning with human feedback in large language models. In this paper, we systematically study optimization of a smooth function $f\colon\mathbb{R}^n\to\mathbb{R}$ only assuming an oracle that compares function values at two points and tells which is larger. When $f$ is convex, we give two algorithms using $\tilde{O}(n/\epsilon)$ and $\tilde{O}(n^{2})$ comparison queries to find an $\epsilon$-optimal solution, respectively. When $f$ is nonconvex, our algorithm uses $\tilde{O}(n/\epsilon^2)$ comparison queries to find an $\epsilon$-approximate stationary point. All these results match the best-known zeroth-order algorithms with function evaluation queries in $n$ dependence, thus suggest that \emph{comparisons are all you need for optimizing smooth functions using derivative-free methods}. In addition, we also give an algorithm for escaping saddle points and reaching an $\epsilon$-second order stationary point of a nonconvex $f$, using $\tilde{O}(n^{1.5}/\epsilon^{2.5})$ comparison queries.
翻译:在优化机器学习模型时,存在各种梯度计算困难甚至不可行的场景。此外,在强化学习中,基于偏好的强化学习(仅比较不同选项)具有广泛应用,包括大型语言模型中的人类反馈强化学习。本文系统研究了仅利用比较函数值的预言机(即比较两点函数值并判断大小)来优化光滑函数$f\colon\mathbb{R}^n\to\mathbb{R}$的问题。当$f$为凸函数时,我们分别给出了使用$\tilde{O}(n/\epsilon)$和$\tilde{O}(n^{2})$次比较查询即可找到$\epsilon$-最优解的两个算法。当$f$为非凸函数时,我们的算法使用$\tilde{O}(n/\epsilon^2)$次比较查询即可找到$\epsilon$-近似驻点。这些结果在$n$的依赖关系上均与已知最优的零阶函数求值算法相匹配,从而表明"优化光滑函数只需比较——基于无导数方法"。此外,我们还给出了一个逃逸鞍点并达到非凸$f$的$\epsilon$-二阶驻点的算法,该算法使用$\tilde{O}(n^{1.5}/\epsilon^{2.5})$次比较查询。