Machine learning models have shown increased accuracy in classification tasks when the training process incorporates human perceptual information. However, a challenge in training human-guided models is the cost associated with collecting image annotations for human salience. Collecting annotation data for all images in a large training set can be prohibitively expensive. In this work, we utilize "teacher" models (trained on a small amount of human-annotated data) to annotate additional data by means of teacher models' saliency maps. Then, "student" models are trained using the larger amount of annotated training data. This approach makes it possible to supplement a limited number of human-supplied annotations with an arbitrarily large number of model-generated image annotations. We compare the accuracy achieved by our teacher-student training paradigm with (1) training using all available human salience annotations, and (2) using all available training data without human salience annotations. We use synthetic face detection and fake iris detection as example challenging problems, and report results across four model architectures (DenseNet, ResNet, Xception, and Inception), and two saliency estimation methods (CAM and RISE). Results show that our teacher-student training paradigm results in models that significantly exceed the performance of both baselines, demonstrating that our approach can usefully leverage a small amount of human annotations to generate salience maps for an arbitrary amount of additional training data.
翻译:机器学习模型在训练过程中融入人类感知信息时,分类任务的准确性已得到显著提升。然而,训练人类引导式模型面临的一大挑战,是为获取人类显著性而收集图像标注的高昂成本。为大型训练集中的所有图像收集标注数据往往代价过高。在本研究中,我们利用“教师”模型(基于少量人工标注数据训练而成),通过其显著性图对额外数据进行标注。随后,“学生”模型使用更大规模的标注训练数据进行学习。该方法能够将有限数量的人工标注,与任意大规模的模型生成图像标注相结合。我们通过以下两种方式对比师生训练范式的准确性:(1)使用所有可用的人类显著性标注进行训练;(2)使用所有可用训练数据但不含人类显著性标注。我们以合成人脸检测和假虹膜检测为例的挑战性问题,报告了四种模型架构(DenseNet、ResNet、Xception和Inception)以及两种显著性估计方法(CAM和RISE)的实验结果。结果显示,我们的师生训练范式所得模型表现显著优于两种基线方法,证明了该方法能有效利用少量人工标注为任意规模的额外训练数据生成显著性图。