Towards Reliable Assessments of Demographic Disparities in Multi-Label Image Classifiers

Disaggregated performance metrics across demographic groups are a hallmark of fairness assessments in computer vision. These metrics successfully incentivized performance improvements on person-centric tasks such as face analysis and are used to understand risks of modern models. However, there is a lack of discussion on the vulnerabilities of these measurements for more complex computer vision tasks. In this paper, we consider multi-label image classification and, specifically, object categorization tasks. First, we highlight design choices and trade-offs for measurement that involve more nuance than discussed in prior computer vision literature. These challenges are related to the necessary scale of data, definition of groups for images, choice of metric, and dataset imbalances. Next, through two case studies using modern vision models, we demonstrate that naive implementations of these assessments are brittle. We identify several design choices that look merely like implementation details but significantly impact the conclusions of assessments, both in terms of magnitude and direction (on which group the classifiers work best) of disparities. Based on ablation studies, we propose some recommendations to increase the reliability of these assessments. Finally, through a qualitative analysis we find that concepts with large disparities tend to have varying definitions and representations between groups, with inconsistencies across datasets and annotators. While this result suggests avenues for mitigation through more consistent data collection, it also highlights that ambiguous label definitions remain a challenge when performing model assessments. Vision models are expanding and becoming more ubiquitous; it is even more important that our disparity assessments accurately reflect the true performance of models.

翻译：跨人口统计群体的分解性能指标是计算机视觉公平性评估的标志性特征。这些指标成功激励了人脸分析等以人为中心任务的性能提升，并被用于理解现代模型的风险。然而，针对这些测量方法在更复杂计算机视觉任务中的脆弱性，目前仍缺乏相关讨论。本文聚焦多标签图像分类，特别是物体分类任务。首先，我们揭示了测量过程中涉及的设计选择与权衡，其复杂性远超现有计算机视觉文献的讨论范畴。这些挑战与数据规模要求、图像群体定义方式、指标选择及数据集不平衡性密切相关。其次，通过两项基于现代视觉模型的案例研究，我们证明这些评估的朴素实现具有脆弱性。我们识别出若干看似仅为实现细节的设计选择，实则显著影响评估结论——既包括差异的幅度，也涉及方向性（即分类器在哪个人群组别表现更优）。基于消融研究，我们提出若干建议以提升这类评估的可靠性。最后，通过定性分析发现：存在显著差异的概念往往在不同群体间呈现相异的定义与表征方式，且数据集与标注者间存在不一致性。这一结果虽提示可通过更统一的数据收集加以缓解，但同时也凸显出模糊标签定义仍是模型评估中的核心挑战。随着视觉模型的扩展与普及化，确保差异评估能准确反映模型真实性能变得愈发重要。