Biases in models pose a critical issue when deploying machine learning systems, but diagnosing them in an explainable manner can be challenging. To address this, we introduce the bias-to-text (B2T) framework, which uses language interpretation to identify and mitigate biases in vision models, such as image classifiers and text-to-image generative models. Our language descriptions of visual biases provide explainable forms that enable the discovery of novel biases and effective model debiasing. To achieve this, we analyze common keywords in the captions of mispredicted or generated images. Here, we propose novel score functions to avoid biases in captions by comparing the similarities between bias keywords and those images. Additionally, we present strategies to debias zero-shot classifiers and text-to-image diffusion models using the bias keywords from the B2T framework. We demonstrate the effectiveness of our framework on various image classification and generation tasks. For classifiers, we discover a new spurious correlation between the keywords "(sports) player" and "female" in Kaggle Face and improve the worst-group accuracy on Waterbirds by 11% through debiasing, compared to the baseline. For generative models, we detect and effectively prevent unfair (e.g., gender-biased) and unsafe (e.g., "naked") image generation.
翻译:模型在部署机器学习系统时面临的关键问题是偏见,但以可解释的方式诊断偏见颇具挑战性。为此,我们引入偏见到文本(B2T)框架,该框架利用语言解释来识别并消除视觉模型(如图像分类器和文本到图像生成模型)中的偏见。我们对视觉偏见的语言描述提供了可解释的形式,从而能够发现新型偏见并实现有效的模型去偏。为实现这一目标,我们分析了被错误预测或生成的图像描述中的常见关键词。在此,我们提出新颖的评分函数,通过比较偏见关键词与这些图像的相似性来避免描述中的偏见。此外,我们提出了利用B2T框架中的偏见关键词对零样本分类器和文本到图像扩散模型进行去偏的策略。我们在多种图像分类和生成任务上展示了该框架的有效性。对于分类器,我们在Kaggle Face数据集中发现了关键词"(运动)运动员"与"女性"之间的新型虚假关联,并通过去偏将Waterbirds数据集中最差组准确率相比基线提高了11%。对于生成模型,我们检测并有效防止了不公平(例如性别偏见)和不安全(例如"裸体")的图像生成。