In many machine learning applications, labeling datasets can be an arduous and time-consuming task. Although research has shown that semi-supervised learning techniques can achieve high accuracy with very few labels within the field of computer vision, little attention has been given to how images within a dataset should be selected for labeling. In this paper, we propose a novel approach based on well-established self-supervised learning, clustering, and manifold learning techniques that address this challenge of selecting an informative image subset to label in the first instance, which is known as the cold-start or unsupervised selective labelling problem. We test our approach using several publicly available datasets, namely CIFAR10, Imagenette, DeepWeeds, and EuroSAT, and observe improved performance with both supervised and semi-supervised learning strategies when our label selection strategy is used, in comparison to random sampling. We also obtain superior performance for the datasets considered with a much simpler approach compared to other methods in the literature.
翻译:在许多机器学习应用中,数据集标注是一项艰巨且耗时的任务。尽管研究表明,半监督学习技术在计算机视觉领域能以极少量标注达到高精度,但鲜有研究关注如何选择数据集中的图像进行标注。本文提出了一种基于成熟的自监督学习、聚类和流形学习技术的新方法,用于解决首次标注时如何选择具有信息量的图像子集这一挑战,即冷启动或无监督选择性标注问题。我们在多个公开数据集(包括CIFAR10、Imagenette、DeepWeeds和EuroSAT)上测试了该方法,发现与随机采样相比,采用我们的标注选择策略后,监督学习和半监督学习策略的性能均有所提升。此外,与文献中的其他方法相比,我们以更简单的方式在所选数据集上取得了更优的性能。