The Frank-Wolfe method has become increasingly useful in statistical and machine learning applications, due to the structure-inducing properties of the iterates, and especially in settings where linear minimization over the feasible set is more computationally efficient than projection. In the setting of Empirical Risk Minimization -- one of the fundamental optimization problems in statistical and machine learning -- the computational effectiveness of Frank-Wolfe methods typically grows linearly in the number of data observations $n$. This is in stark contrast to the case for typical stochastic projection methods. In order to reduce this dependence on $n$, we look to second-order smoothness of typical smooth loss functions (least squares loss and logistic loss, for example) and we propose amending the Frank-Wolfe method with Taylor series-approximated gradients, including variants for both deterministic and stochastic settings. Compared with current state-of-the-art methods in the regime where the optimality tolerance $\varepsilon$ is sufficiently small, our methods are able to simultaneously reduce the dependence on large $n$ while obtaining optimal convergence rates of Frank-Wolfe methods, in both the convex and non-convex settings. We also propose a novel adaptive step-size approach for which we have computational guarantees. Last of all, we present computational experiments which show that our methods exhibit very significant speed-ups over existing methods on real-world datasets for both convex and non-convex binary classification problems.
翻译:Frank-Wolfe方法因其迭代产生的结构诱导特性,特别是在可行集上的线性最小化比投影计算更高效的场景中,在统计和机器学习应用中日益重要。在经验风险最小化这一统计与机器学习基础优化问题中,Frank-Wolfe方法的计算效率通常随数据观测数量$n$线性增长,这与典型随机投影方法形成鲜明对比。为降低对$n$的依赖性,我们利用典型光滑损失函数(如最小二乘损失和逻辑损失)的二阶光滑性,提出采用泰勒级数近似梯度改进Frank-Wolfe方法,包括确定性和随机性两种变体。与当前最优容差$\varepsilon$足够小场景下的最先进方法相比,我们的方法能同时降低对大$n$的依赖性,并在凸与非凸设置下均保持Frank-Wolfe方法的最优收敛速率。我们还提出一种具有计算保证的新型自适应步长方法。最后,通过计算实验表明,在真实世界数据集的凸与非凸二分类问题中,我们的方法相比现有方法具有显著加速效果。