For some hypothesis classes and input distributions, active agnostic learning needs exponentially fewer samples than passive learning; for other classes and distributions, it offers little to no improvement. The most popular algorithms for agnostic active learning express their performance in terms of a parameter called the disagreement coefficient, but it is known that these algorithms are inefficient on some inputs. We take a different approach to agnostic active learning, getting an algorithm that is competitive with the optimal algorithm for any binary hypothesis class $H$ and distribution $D_X$ over $X$. In particular, if any algorithm can use $m^*$ queries to get $O(\eta)$ error, then our algorithm uses $O(m^* \log |H|)$ queries to get $O(\eta)$ error. Our algorithm lies in the vein of the splitting-based approach of Dasgupta [2004], which gets a similar result for the realizable ($\eta = 0$) setting. We also show that it is NP-hard to do better than our algorithm's $O(\log |H|)$ overhead in general.
翻译:对于某些假设类别和输入分布,不可知主动学习所需的样本量比被动学习呈指数级减少;而对于其他类别和分布,其改进效果甚微甚至毫无提升。目前最流行的不可知主动学习算法基于称为"分歧系数"的参数来表征性能,但已知这些算法在某些输入上效率低下。我们采用不同的不可知主动学习策略,提出了对任意二元假设类别 $H$ 和定义于 $X$ 上的分布 $D_X$ 均能媲美最优算法性能的竞争性算法。具体而言,若存在算法可通过 $m^*$ 次查询实现 $O(\eta)$ 误差,则我们的算法仅需 $O(m^* \log |H|)$ 次查询即可达到同等误差水平。该算法沿袭了 Dasgupta [2004] 基于分裂方法的研究思路,该思路已在可实现($\eta = 0$)场景中获得类似结论。同时我们证明,在一般情况下,突破本算法 $O(\log |H|)$ 的额外开销是 NP 难问题。