Deep learning (DL) algorithms rely on massive amounts of labeled data. Semi-supervised learning (SSL) and active learning (AL) aim to reduce this label complexity by leveraging unlabeled data or carefully acquiring labels, respectively. In this work, we primarily focus on designing an AL algorithm but first argue for a change in how AL algorithms should be evaluated. Although unlabeled data is readily available in pool-based AL, AL algorithms are usually evaluated by measuring the increase in supervised learning (SL) performance at consecutive acquisition steps. Because this measures performance gains from both newly acquired instances and newly acquired labels, we propose to instead evaluate the label efficiency of AL algorithms by measuring the increase in SSL performance at consecutive acquisition steps. After surveying tools that can be used to this end, we propose our neural pre-conditioning (NPC) algorithm inspired by a Neural Tangent Kernel (NTK) analysis. Our algorithm incorporates the classifier's uncertainty on unlabeled data and penalizes redundant samples within candidate batches to efficiently acquire a diverse set of informative labels. Furthermore, we prove that NPC improves downstream training in the large-width regime in a manner previously observed to correlate with generalization. Comparisons with other AL algorithms show that a state-of-the-art SSL algorithm coupled with NPC can achieve high performance using very few labeled data.
翻译:深度学习算法依赖于大量标注数据。半监督学习和主动学习旨在分别通过利用未标注数据或谨慎获取标签来降低这种标签复杂度。本文主要聚焦于设计一种主动学习算法,但首先论证了应如何改变主动学习算法的评估方式。尽管在基于池的主动学习中,未标注数据易于获取,但主动学习算法通常通过衡量连续采集步骤中监督学习性能的提升来评估。由于这同时衡量了新获取实例和新获取标签带来的性能增益,我们提出改而通过衡量连续采集步骤中半监督学习性能的提升来评估主动学习算法的标签效率。在调研可用于此目的的工具后,我们受神经正切核分析启发,提出了神经预条件算法(NPC)。该算法整合了分类器在未标注数据上的不确定性,并惩罚候选批次中的冗余样本,以高效获取多样化的信息标签。此外,我们证明NPC在大宽度机制下能改善下游训练,其方式与先前观察到的泛化相关性一致。与其他主动学习算法的比较表明,结合NPC的最先进半监督学习算法仅需极少量标注数据即可实现高性能。