We present new large-scale algorithms for fitting a subgradient regularized multivariate convex regression function to $n$ samples in $d$ dimensions -- a key problem in shape constrained nonparametric regression with applications in statistics, engineering and the applied sciences. The infinite-dimensional learning task can be expressed via a convex quadratic program (QP) with $O(nd)$ decision variables and $O(n^2)$ constraints. While instances with $n$ in the lower thousands can be addressed with current algorithms within reasonable runtimes, solving larger problems (e.g., $n\approx 10^4$ or $10^5$) is computationally challenging. To this end, we present an active set type algorithm on the dual QP. For computational scalability, we allow for approximate optimization of the reduced sub-problems; and propose randomized augmentation rules for expanding the active set. We derive novel computational guarantees for our algorithms. We demonstrate that our framework can approximately solve instances of the subgradient regularized convex regression problem with $n=10^5$ and $d=10$ within minutes; and shows strong computational performance compared to earlier approaches.
翻译:我们提出了新的大规模算法,用于拟合一个次梯度正则化的多元凸回归函数至$d$维空间中的$n$个样本——这是形状约束非参数回归中的一个关键问题,在统计学、工程学和应用科学中均有应用。该无限维学习任务可表示为一个凸二次规划问题,包含$O(nd)$个决策变量和$O(n^2)$个约束条件。尽管当$n$在数千量级时,现有算法可在合理时间内求解,但处理更大规模问题(例如$n\approx 10^4$或$10^5$)在计算上极具挑战性。为此,我们提出了一种基于对偶二次规划的有效集算法。为提升计算可扩展性,我们允许对简化子问题进行近似优化,并提出了用于扩展有效集的随机化增广规则。我们给出了算法的新颖计算保证。实验表明,我们的框架能够在数分钟内近似求解$n=10^5$、$d=10$的次梯度正则化凸回归问题实例,并且与以往方法相比展现出强大的计算性能。