Consensus and Subjectivity of Skin Tone Annotation for ML Fairness

Understanding different human attributes and how they affect model behavior may become a standard need for all model creation and usage, from traditional computer vision tasks to the newest multimodal generative AI systems. In computer vision specifically, we have relied on datasets augmented with perceived attribute signals (e.g., gender presentation, skin tone, and age) and benchmarks enabled by these datasets. Typically labels for these tasks come from human annotators. However, annotating attribute signals, especially skin tone, is a difficult and subjective task. Perceived skin tone is affected by technical factors, like lighting conditions, and social factors that shape an annotator's lived experience. This paper examines the subjectivity of skin tone annotation through a series of annotation experiments using the Monk Skin Tone (MST) scale, a small pool of professional photographers, and a much larger pool of trained crowdsourced annotators. Along with this study we release the Monk Skin Tone Examples (MST-E) dataset, containing 1515 images and 31 videos spread across the full MST scale. MST-E is designed to help train human annotators to annotate MST effectively. Our study shows that annotators can reliably annotate skin tone in a way that aligns with an expert in the MST scale, even under challenging environmental conditions. We also find evidence that annotators from different geographic regions rely on different mental models of MST categories resulting in annotations that systematically vary across regions. Given this, we advise practitioners to use a diverse set of annotators and a higher replication count for each image when annotating skin tone for fairness research.

翻译：理解不同人类属性及其对模型行为的影响，可能成为所有模型创建和使用中的标准需求——从传统计算机视觉任务到最新多模态生成式AI系统。在计算机视觉领域，我们尤其依赖通过感知属性信号（如性别呈现、肤色和年龄）增强的数据集以及基于这些数据集的基准测试。这些任务的标签通常由人类标注员提供。然而，标注属性信号（尤其是皮肤色调）是一项困难且具有主观性的任务。感知到的肤色不仅受照明条件等技术因素影响，还受制于塑造标注员生活经历的社会因素。本文通过一系列标注实验，探讨了皮肤色调标注的主观性：实验采用Monk肤色量表（MST）、少量专业摄影师和大量经过培训的众包标注员。伴随本研究，我们发布了Monk肤色示例数据集（MST-E），包含覆盖完整MST量表的1515张图片和31个视频。MST-E旨在帮助有效训练人类标注员掌握MST标注方法。研究表明，即使在具有挑战性的环境条件下，标注员也能以与MST量表专家一致的方式可靠标注肤色。我们还发现，来自不同地理区域的标注员对MST类别存在不同的心理模型，导致标注结果呈现系统性区域差异。基于此，我们建议研究者在进行公平性研究的肤色标注时，采用多样化的标注员群体并提高每张图片的重复标注次数。