PAC learning, dating back to Valiant'84 and Vapnik and Chervonenkis'64,'74, is a classic model for studying supervised learning. In the agnostic setting, we have access to a hypothesis set $\mathcal{H}$ and a training set of labeled samples $(x_1,y_1),\dots,(x_n,y_n) \in \mathcal{X} \times \{-1,1\}$ drawn i.i.d. from an unknown distribution $\mathcal{D}$. The goal is to produce a classifier $h : \mathcal{X} \to \{-1,1\}$ that is competitive with the hypothesis $h^\star_{\mathcal{D}} \in \mathcal{H}$ having the least probability of mispredicting the label $y$ of a new sample $(x,y)\sim \mathcal{D}$. Empirical Risk Minimization (ERM) is a natural learning algorithm, where one simply outputs the hypothesis from $\mathcal{H}$ making the fewest mistakes on the training data. This simple algorithm is known to have an optimal error in terms of the VC-dimension of $\mathcal{H}$ and the number of samples $n$. In this work, we revisit agnostic PAC learning and first show that ERM is in fact sub-optimal if we treat the performance of the best hypothesis, denoted $\tau:=\Pr_{\mathcal{D}}[h^\star_{\mathcal{D}}(x) \neq y]$, as a parameter. Concretely we show that ERM, and any other proper learning algorithm, is sub-optimal by a $\sqrt{\ln(1/\tau)}$ factor. We then complement this lower bound with the first learning algorithm achieving an optimal error for nearly the full range of $\tau$. Our algorithm introduces several new ideas that we hope may find further applications in learning theory.
翻译:PAC学习可追溯至Valiant'84以及Vapnik与Chervonenkis'64,'74的研究,是监督学习领域的经典理论模型。在不可知学习设定中,我们拥有一个假设集$\mathcal{H}$以及从未知分布$\mathcal{D}$中独立同分布抽取的带标签训练样本$(x_1,y_1),\dots,(x_n,y_n) \in \mathcal{X} \times \{-1,1\}$。目标是构建一个分类器$h : \mathcal{X} \to \{-1,1\}$,使其预测性能可与假设集$\mathcal{H}$中最优假设$h^\star_{\mathcal{D}}$相竞争,该最优假设能最小化新样本$(x,y)\sim \mathcal{D}$的标签$y$误判概率。经验风险最小化(ERM)是一种自然的学习算法,它直接输出在训练数据上错误最少的$\mathcal{H}$中的假设。已知该简单算法在$\mathcal{H}$的VC维与样本数$n$的意义下具有最优误差界。本文重新审视不可知PAC学习问题,首先证明若将最优假设的性能参数$\tau:=\Pr_{\mathcal{D}}[h^\star_{\mathcal{D}}(x) \neq y]$纳入考量,则ERM算法实际上并非最优。具体而言,我们证明ERM以及任何其他适当学习算法都存在$\sqrt{\ln(1/\tau)}$因子的次优性。随后我们提出首个能在近乎完整$\tau$取值范围内达到最优误差的学习算法,从而对该下界形成补充。我们的算法引入了若干新思想,有望在学习理论领域获得更广泛的应用。