We study the task of online learning in the presence of Massart noise. Instead of assuming that the online adversary chooses an arbitrary sequence of labels, we assume that the context $\mathbf{x}$ is selected adversarially but the label $y$ presented to the learner disagrees with the ground-truth label of $\mathbf{x}$ with unknown probability at most $\eta$. We study the fundamental class of $\gamma$-margin linear classifiers and present a computationally efficient algorithm that achieves mistake bound $\eta T + o(T)$. Our mistake bound is qualitatively tight for efficient algorithms: it is known that even in the offline setting achieving classification error better than $\eta$ requires super-polynomial time in the SQ model. We extend our online learning model to a $k$-arm contextual bandit setting where the rewards -- instead of satisfying commonly used realizability assumptions -- are consistent (in expectation) with some linear ranking function with weight vector $\mathbf{w}^\ast$. Given a list of contexts $\mathbf{x}_1,\ldots \mathbf{x}_k$, if $\mathbf{w}^*\cdot \mathbf{x}_i > \mathbf{w}^* \cdot \mathbf{x}_j$, the expected reward of action $i$ must be larger than that of $j$ by at least $\Delta$. We use our Massart online learner to design an efficient bandit algorithm that obtains expected reward at least $(1-1/k)~ \Delta T - o(T)$ bigger than choosing a random action at every round.
翻译:我们研究了在存在马萨特噪声条件下的在线学习任务。不同于假设在线对手选择任意标签序列,我们假设上下文$\mathbf{x}$是敌对性选取的,但呈现给学习者的标签$y$与$\mathbf{x}$的真实标签不一致的概率未知且不超过$\eta$。我们研究$\gamma$-间隔线性分类器这一基本类别,并提出一种计算高效算法,其错误界为$\eta T + o(T)$。我们的错误界对高效算法而言在性质上是紧致的:即使在离线场景下,实现优于$\eta$的分类误差在SQ模型中需要超多项式时间。我们将在线学习模型扩展至$k臂上下文赌博机设置,其中奖励并非满足常用的可实现性假设,而是与某个权重向量为$\mathbf{w}^\ast$的线性排序函数在期望意义上一致。给定上下文列表$\mathbf{x}_1,\ldots \mathbf{x}_k$,若$\mathbf{w}^*\cdot \mathbf{x}_i > \mathbf{w}^* \cdot \mathbf{x}_j$,则动作$i$的期望奖励必须比动作$j$至少大$\Delta$。我们利用马萨特在线学习器设计了一种高效赌博机算法,其获得的期望奖励比每轮随机选择动作至少高出$(1-1/k)~ \Delta T - o(T)$。