Omnipredictors for Regression and the Approximate Rank of Convex Functions

Consider the supervised learning setting where the goal is to learn to predict labels $\mathbf y$ given points $\mathbf x$ from a distribution. An \textit{omnipredictor} for a class $\mathcal L$ of loss functions and a class $\mathcal C$ of hypotheses is a predictor whose predictions incur less expected loss than the best hypothesis in $\mathcal C$ for every loss in $\mathcal L$. Since the work of [GKR+21] that introduced the notion, there has been a large body of work in the setting of binary labels where $\mathbf y \in \{0, 1\}$, but much less is known about the regression setting where $\mathbf y \in [0,1]$ can be continuous. Our main conceptual contribution is the notion of \textit{sufficient statistics} for loss minimization over a family of loss functions: these are a set of statistics about a distribution such that knowing them allows one to take actions that minimize the expected loss for any loss in the family. The notion of sufficient statistics relates directly to the approximate rank of the family of loss functions. Our key technical contribution is a bound of $O(1/\varepsilon^{2/3})$ on the $\epsilon$-approximate rank of convex, Lipschitz functions on the interval $[0,1]$, which we show is tight up to a factor of $\mathrm{polylog} (1/\epsilon)$. This yields improved runtimes for learning omnipredictors for the class of all convex, Lipschitz loss functions under weak learnability assumptions about the class $\mathcal C$. We also give efficient omnipredictors when the loss families have low-degree polynomial approximations, or arise from generalized linear models (GLMs). This translation from sufficient statistics to faster omnipredictors is made possible by lifting the technique of loss outcome indistinguishability introduced by [GKH+23] for Boolean labels to the regression setting.

翻译：考虑监督学习设定，目标是从分布中给定点 $\mathbf x$ 学习预测标签 $\mathbf y$。对于损失函数类 $\mathcal L$ 和假设类 $\mathcal C$，全预测量是一种预测器，其预测值对 $\mathcal L$ 中每个损失的期望损失均低于 $\mathcal C$ 中最优假设。自 [GKR+21] 引入该概念以来，在二元标签设定（$\mathbf y \in \{0, 1\}$）中已有大量研究，但对于 $\mathbf y \in [0,1]$ 连续值的回归设定知之甚少。我们的主要概念贡献是提出了损失函数族上损失最小化的\textit{充分统计量}概念：这是关于分布的一组统计量，知晓它们即可在族内任意损失下采取行动最小化期望损失。充分统计量概念直接关联损失函数族的近似秩。我们的关键技术贡献是证明了区间 $[0,1]$ 上凸 Lipschitz 函数的 $\epsilon$-近似秩上界为 $O(1/\varepsilon^{2/3})$，并证明该界在 $\mathrm{polylog} (1/\epsilon)$ 因子内是紧的。这改进了在类 $\mathcal C$ 弱可学习假设下，为所有凸 Lipschitz 损失函数类学习全预测量的运行时间。我们还为具有低次多项式近似或来自广义线性模型（GLM）的损失族给出了高效全预测量。通过将 [GKH+23] 为布尔标签引入的损失结果不可区分性技术推广到回归设定，实现了从充分统计量到更快速全预测量的转换。