Extracting Human Attention through Crowdsourced Patch Labeling

In image classification, a significant problem arises from bias in the datasets. When it contains only specific types of images, the classifier begins to rely on shortcuts - simplistic and erroneous rules for decision-making. This leads to high performance on the training dataset but inferior results on new, varied images, as the classifier's generalization capability is reduced. For example, if the images labeled as mustache consist solely of male figures, the model may inadvertently learn to classify images by gender rather than the presence of a mustache. One approach to mitigate such biases is to direct the model's attention toward the target object's location, usually marked using bounding boxes or polygons for annotation. However, collecting such annotations requires substantial time and human effort. Therefore, we propose a novel patch-labeling method that integrates AI assistance with crowdsourcing to capture human attention from images, which can be a viable solution for mitigating bias. Our method consists of two steps. First, we extract the approximate location of a target using a pre-trained saliency detection model supplemented by human verification for accuracy. Then, we determine the human-attentive area in the image by iteratively dividing the image into smaller patches and employing crowdsourcing to ascertain whether each patch can be classified as the target object. We demonstrated the effectiveness of our method in mitigating bias through improved classification accuracy and the refined focus of the model. Also, crowdsourced experiments validate that our method collects human annotation up to 3.4 times faster than annotating object locations with polygons, significantly reducing the need for human resources. We conclude the paper by discussing the advantages of our method in a crowdsourcing context, mainly focusing on aspects of human errors and accessibility.

翻译：在图像分类中，数据集偏差导致了一个显著问题。当数据集中仅包含特定类型的图像时，分类器开始依赖捷径——即简单且错误的决策规则。这导致分类器在训练集上表现优异，但在面对新的多样化图像时效果较差，因为其泛化能力降低。例如，如果标注为"胡须"的图像仅包含男性人物，模型可能会无意中学到根据性别而不是胡须的存在进行分类。缓解此类偏差的一种方法是引导模型关注目标物体的位置，通常使用边界框或多边形进行标注。然而，收集此类标注需要大量时间和人力投入。因此，我们提出了一种新颖的补丁标注方法，将人工智能辅助与众包相结合，从图像中捕捉人类注意力，这可以作为缓解偏差的可行方案。我们的方法包含两个步骤。首先，我们利用预训练的显著性检测模型提取目标的近似位置，并通过人工验证确保准确性。然后，通过将图像迭代分割为更小的补丁，并采用众包方式确定每个补丁是否可被分类为目标物体，以此确定图像中的人类关注区域。我们通过改进的分类准确率和优化的模型聚焦效果，证明了该方法在缓解偏差方面的有效性。此外，众包实验验证，与使用多边形标注物体位置相比，我们的方法收集人类标注的速度最高可提升3.4倍，显著减少了人力资源需求。最后，我们讨论了该方法在众包背景下的优势，主要关注人为错误与可及性方面。