In pool-based active learning, the learner is given an unlabeled data set and aims to efficiently learn the unknown hypothesis by querying the labels of the data points. This can be formulated as the classical Optimal Decision Tree (ODT) problem: Given a set of tests, a set of hypotheses, and an outcome for each pair of test and hypothesis, our objective is to find a low-cost testing procedure (i.e., decision tree) that identifies the true hypothesis. This optimization problem has been extensively studied under the assumption that each test generates a deterministic outcome. However, in numerous applications, for example, clinical trials, the outcomes may be uncertain, which renders the ideas from the deterministic setting invalid. In this work, we study a fundamental variant of the ODT problem in which some test outcomes are noisy, even in the more general case where the noise is persistent, i.e., repeating a test gives the same noisy output. Our approximation algorithms provide guarantees that are nearly best possible and hold for the general case of a large number of noisy outcomes per test or per hypothesis where the performance degrades continuously with this number. We numerically evaluated our algorithms for identifying toxic chemicals and learning linear classifiers, and observed that our algorithms have costs very close to the information-theoretic minimum.
翻译:在基于池的主动学习中,学习者获得一个未标记的数据集,旨在通过查询数据点的标签来高效地学习未知假设。这可以表述为经典的最优决策树问题:给定一组测试、一组假设以及每个测试-假设对的结果,我们的目标是找到一个低成本的测试程序(即决策树)来识别真实假设。在每次测试产生确定性结果的假设下,该优化问题已得到广泛研究。然而,在众多应用(例如临床试验)中,结果可能具有不确定性,这使得确定性场景下的方法失效。在本工作中,我们研究ODT问题的一个基本变体,其中部分测试结果存在噪声,即使在噪声具有持久性的更一般情况下(即重复测试会产生相同的噪声输出)也是如此。我们的近似算法提供了近乎最优的保证,并适用于每个测试或每个假设存在大量噪声结果的一般情况,且算法性能随噪声数量增加而连续下降。我们在有毒化学品识别和线性分类器学习任务中对算法进行了数值评估,观察到我们的算法成本非常接近信息论下界。