For some hypothesis classes and input distributions, active agnostic learning needs exponentially fewer samples than passive learning; for other classes and distributions, it offers little to no improvement. The most popular algorithms for agnostic active learning express their performance in terms of a parameter called the disagreement coefficient, but it is known that these algorithms are inefficient on some inputs. We take a different approach to agnostic active learning, getting an algorithm that is competitive with the optimal algorithm for any binary hypothesis class $H$ and distribution $D_X$ over $X$. In particular, if any algorithm can use $m^*$ queries to get $O(\eta)$ error, then our algorithm uses $O(m^* \log |H|)$ queries to get $O(\eta)$ error. Our algorithm lies in the vein of the splitting-based approach of Dasgupta [2004], which gets a similar result for the realizable ($\eta = 0$) setting. We also show that it is NP-hard to do better than our algorithm's $O(\log |H|)$ overhead in general.
翻译:对于某些假设类和输入分布,不可知主动学习所需的样本数量比被动学习呈指数级减少;而对于其他类别和分布,其改进微乎其微甚至没有改进。不可知主动学习中最流行的算法通过一个称为分歧系数的参数来表达其性能,但已知这些算法在某些输入上效率低下。我们采用一种不同的方法进行不可知主动学习,得到一种对于任意二元假设类$H$和$X$上分布$D_X$都能与最优算法竞争的算法。具体而言,若任何算法可使用$m^*$次查询获得$O(\eta)$误差,则我们的算法使用$O(m^* \log |H|)$次查询即可获得$O(\eta)$误差。我们的算法遵循Dasgupta [2004]提出的基于分裂的方法路径,该方法在可实现($\eta = 0$)设定下获得了类似结果。我们还证明,在一般情况下,要超越我们算法$O(\log |H|)$的开销是NP困难的。