Labeling data is one of the most costly processes in machine learning pipelines. Active learning is a standard approach to alleviating this problem. Pool-based active learning first builds a pool of unlabelled data and iteratively selects data to be labeled so that the total number of required labels is minimized, keeping the model performance high. Many effective criteria for choosing data from the pool have been proposed in the literature. However, how to build the pool is less explored. Specifically, most of the methods assume that a task-specific pool is given for free. In this paper, we advocate that such a task-specific pool is not always available and propose the use of a myriad of unlabelled data on the Web for the pool for which active learning is applied. As the pool is extremely large, it is likely that relevant data exist in the pool for many tasks, and we do not need to explicitly design and build the pool for each task. The challenge is that we cannot compute the acquisition scores of all data exhaustively due to the size of the pool. We propose an efficient method, Seafaring, to retrieve informative data in terms of active learning from the Web using a user-side information retrieval algorithm. In the experiments, we use the online Flickr environment as the pool for active learning. This pool contains more than ten billion images and is several orders of magnitude larger than the existing pools in the literature for active learning. We confirm that our method performs better than existing approaches of using a small unlabelled pool.
翻译:数据标注是机器学习流程中最昂贵的环节之一。主动学习是缓解这一问题的标准方法。基于池的主动学习首先构建一个未标注数据池,并迭代选择待标注数据,从而在保持模型高性能的同时最小化所需标注总量。现有文献已提出多种从数据池中挑选数据的有效准则。然而,如何构建数据池却鲜有探索。具体而言,大多数方法假设任务专属数据池可无偿获得。本文主张此类任务专属数据池并非总能获取,并提出将网络上海量未标注数据作为主动学习的数据池。由于该数据池极其庞大,其中很可能存在适用于多项任务的相关数据,因此我们无需为每项任务显式设计并构建数据池。面临的挑战是:受数据池规模所限,我们无法穷尽计算所有数据的采集评分。为此,我们提出一种名为Seafaring的高效方法,利用用户端信息检索算法从网络中检索具有主动学习价值的信息性数据。实验中,我们以在线Flickr环境作为主动学习的数据池,该池包含超百亿张图像,规模比现有主动学习文献中的数据池大数个数量级。实验证实,我们的方法优于使用小型未标注数据池的现有方法。