In deep active learning, it is especially important to choose multiple examples to markup at each step to work efficiently, especially on large datasets. At the same time, existing solutions to this problem in the Bayesian setup, such as BatchBALD, have significant limitations in selecting a large number of examples, associated with the exponential complexity of computing mutual information for joint random variables. We, therefore, present the Large BatchBALD algorithm, which gives a well-grounded approximation to the BatchBALD method that aims to achieve comparable quality while being more computationally efficient. We provide a complexity analysis of the algorithm, showing a reduction in computation time, especially for large batches. Furthermore, we present an extensive set of experimental results on image and text data, both on toy datasets and larger ones such as CIFAR-100.
翻译:在深度主动学习中,尤其是在处理大规模数据集时,高效地选择多个样本进行标注至关重要。然而,贝叶斯框架下现有解决此问题的方法,如BatchBALD,在选择大量样本时存在显著局限性,原因在于计算联合随机变量的互信息具有指数级复杂度。为此,我们提出了Large BatchBALD算法,该算法为BatchBALD方法提供了一个可靠近似,旨在在保持可比质量的同时提升计算效率。我们对算法进行了复杂度分析,结果表明计算时间显著减少,尤其适用于大批量场景。此外,我们在图像和文本数据(包括玩具数据集及CIFAR-100等较大数据集)上展示了一系列广泛的实验结果。