Active Learning with Simple Questions

We consider an active learning setting where a learner is presented with a pool S of n unlabeled examples belonging to a domain X and asks queries to find the underlying labeling that agrees with a target concept h^* \in H. In contrast to traditional active learning that queries a single example for its label, we study more general region queries that allow the learner to pick a subset of the domain T \subset X and a target label y and ask a labeler whether h^*(x) = y for every example in the set T \cap S. Such more powerful queries allow us to bypass the limitations of traditional active learning and use significantly fewer rounds of interactions to learn but can potentially lead to a significantly more complex query language. Our main contribution is quantifying the trade-off between the number of queries and the complexity of the query language used by the learner. We measure the complexity of the region queries via the VC dimension of the family of regions. We show that given any hypothesis class H with VC dimension d, one can design a region query family Q with VC dimension O(d) such that for every set of n examples S \subset X and every h^* \in H, a learner can submit O(d log n) queries from Q to a labeler and perfectly label S. We show a matching lower bound by designing a hypothesis class H with VC dimension d and a dataset S \subset X of size n such that any learning algorithm using any query class with VC dimension less than O(d) must make poly(n) queries to label S perfectly. Finally, we focus on well-studied hypothesis classes including unions of intervals, high-dimensional boxes, and d-dimensional halfspaces, and obtain stronger results. In particular, we design learning algorithms that (i) are computationally efficient and (ii) work even when the queries are not answered based on the learner's pool of examples S but on some unknown superset L of S

翻译：我们考虑一种主动学习场景，其中学习者面对一个包含n个未标记样本的池S，这些样本属于领域X，并通过提出查询来寻找与目标概念h^* ∈ H一致的底层标注。与传统主动学习仅查询单个样本标签不同，我们研究更一般的区域查询：学习者可以选择领域子集T ⊂ X和目标标签y，询问标注者是否对集合T ∩ S中的每个样本都满足h^*(x) = y。这种更强大的查询使我们能够规避传统主动学习的局限性，以更少的交互轮次进行学习，但可能导致查询语言显著复杂化。我们的主要贡献在于量化查询数量与学习者所用查询语言复杂度之间的权衡关系。我们通过区域族的VC维来度量区域查询的复杂度。研究表明：对于任意VC维为d的假设类H，可以设计VC维为O(d)的区域查询族Q，使得对于任意样本集S ⊂ X和任意h^* ∈ H，学习者通过向标注者提交O(d log n)个来自Q的查询即可完美标注S。我们通过构造VC维为d的假设类H和规模为n的数据集S ⊂ X，证明了匹配下界：任何使用VC维低于O(d)的查询类的学习算法都必须进行poly(n)次查询才能完美标注S。最后，我们聚焦于得到充分研究的假设类——包括区间并集、高维箱体和d维半空间，并获得了更强的结论。特别地，我们设计了满足以下条件的学习算法：(i) 计算高效；(ii) 即使在查询应答不基于学习者样本池S，而是基于某个未知超集L ⊃ S时仍然有效。