For some hypothesis classes and input distributions, active agnostic learning needs exponentially fewer samples than passive learning; for other classes and distributions, it offers little to no improvement. The most popular algorithms for agnostic active learning express their performance in terms of a parameter called the disagreement coefficient, but it is known that these algorithms are inefficient on some inputs. We take a different approach to agnostic active learning, getting an algorithm that is competitive with the optimal algorithm for any binary hypothesis class $H$ and distribution $D_X$ over $X$. In particular, if any algorithm can use $m^*$ queries to get $O(\eta)$ error, then our algorithm uses $O(m^* \log |H|)$ queries to get $O(\eta)$ error. Our algorithm lies in the vein of the splitting-based approach of Dasgupta [2004], which gets a similar result for the realizable ($\eta = 0$) setting. We also show that it is NP-hard to do better than our algorithm's $O(\log |H|)$ overhead in general.
翻译:对于某些假设类别和输入分布,不可知主动学习所需的样本量比被动学习呈指数级减少;而对于其他类别和分布,其改进效果微乎其微。最流行的不可知主动学习算法以名为“分歧系数”的参数表征其性能,但已知这些算法在某些输入上效率低下。我们采用不同的方法处理不可知主动学习,提出一种算法,该算法对于任意二元假设类别$H$及分布$D_X$(定义于$X$上)均能与最优算法竞争。具体而言,若存在任意算法可使用$m^*$次查询达到$O(\eta)$错误率,则我们的算法使用$O(m^* \log |H|)$次查询即可达到$O(\eta)$错误率。该算法源自Dasgupta [2004]的基于分裂的方法,该方法在可实现($\eta = 0$)场景下获得了类似结果。我们还证明,在一般情况下,改进该算法的$O(\log |H|)$开销是NP难问题。