Regularization and Optimal Multiclass Learning

The quintessential learning algorithm of empirical risk minimization (ERM) is known to fail in various settings for which uniform convergence does not characterize learning. It is therefore unsurprising that the practice of machine learning is rife with considerably richer algorithmic techniques for successfully controlling model capacity. Nevertheless, no such technique or principle has broken away from the pack to characterize optimal learning in these more general settings. The purpose of this work is to characterize the role of regularization in perhaps the simplest setting for which ERM fails: multiclass learning with arbitrary label sets. Using one-inclusion graphs (OIGs), we exhibit optimal learning algorithms that dovetail with tried-and-true algorithmic principles: Occam's Razor as embodied by structural risk minimization (SRM), the principle of maximum entropy, and Bayesian reasoning. Most notably, we introduce an optimal learner which relaxes structural risk minimization on two dimensions: it allows the regularization function to be "local" to datapoints, and uses an unsupervised learning stage to learn this regularizer at the outset. We justify these relaxations by showing that they are necessary: removing either dimension fails to yield a near-optimal learner. We also extract from OIGs a combinatorial sequence we term the Hall complexity, which is the first to characterize a problem's transductive error rate exactly. Lastly, we introduce a generalization of OIGs and the transductive learning setting to the agnostic case, where we show that optimal orientations of Hamming graphs -- judged using nodes' outdegrees minus a system of node-dependent credits -- characterize optimal learners exactly. We demonstrate that an agnostic version of the Hall complexity again characterizes error rates exactly, and exhibit an optimal learner using maximum entropy programs.

翻译：经验风险最小化（ERM）这一经典学习算法已知在多种场景中失效，这些场景下统一收敛性无法刻画学习过程。因此，机器学习实践中充斥着更为丰富的算法技术以成功控制模型容量也就不足为奇。然而，在这些更一般的场景中，尚无此类技术或原则脱颖而出，用以刻画最优学习。本工作的目的是在最简单的ERM失效场景——即具有任意标签集的多类学习中——刻画正则化所起的作用。利用一包含图（OIG），我们提出了与久经考验的算法原则相契合的最优学习算法：结构风险最小化（SRM）所体现的奥卡姆剃刀原则、最大熵原理以及贝叶斯推理。尤为重要的是，我们引入了一个最优学习器，它从两个维度对结构风险最小化进行了放松：允许正则化函数对数据点具有"局部性"，并在初始阶段使用无监督学习来学习该正则化器。我们证明了这些放松的必要性：移除任一维度都无法得到接近最优的学习器。我们还从OIG中提取出一个称为Hall复杂度的组合序列，这是首个精确刻画问题转导误差率的量。最后，我们将OIG和转导学习设置推广至不可知情形，在该情形下，我们证明汉明图的最优定向——根据节点出度减去节点依赖的信用系统来评判——能够精确刻画最优学习器。我们证明不可知版本的Hall复杂度再次精确刻画了误差率，并展示了使用最大熵规划的最优学习器。