IndicFairFace: Balanced Indian Face Dataset for Auditing and Mitigating Geographical Bias in Vision-Language Models

Vision-Language Models (VLMs) are known to inherit and amplify societal biases from their web-scale training data with Indian being particularly misrepresented. Existing fairness-aware datasets have significantly improved demographic balance across global race and gender groups, yet they continue to treat Indian as a single monolithic category. The oversimplification ignores the vast intra-national diversity across 28 states and 8 Union Territories of India and leads to representational and geographical bias. To address the limitation, we present IndicFairFace, a novel and balanced face dataset comprising 14,400 images representing geographical diversity of India. Images were sourced ethically from Wikimedia Commons and open-license web repositories and uniformly balanced across states and gender. Using IndicFairFace, we quantify intra-national geographical bias in prominent CLIP-based VLMs and reduce it using post-hoc Iterative Nullspace Projection debiasing approach. We also show that the adopted debiasing approach does not adversely impact the existing embedding space as the average drop in retrieval accuracy on benchmark datasets is less than 1.5 percent. Our work establishes IndicFairFace as the first benchmark to study geographical bias in VLMs for the Indian context.

翻译：视觉语言模型（VLMs）已知会继承并放大其网络规模训练数据中的社会偏见，其中印度群体尤其受到失实表征。现有的公平性感知数据集已在全球种族与性别群体间显著改善了人口统计平衡，但它们仍将印度视为单一同质类别。这种过度简化忽视了印度28个邦和8个中央直辖区之间的巨大国内多样性，并导致表征性与地理偏见。为应对这一局限，我们提出了IndicFairFace——一个新颖且平衡的人脸数据集，包含14,400张代表印度地理多样性的图像。图像均从维基共享资源及开放许可的网络资源库中合规获取，并在各邦与性别间均匀平衡。利用IndicFairFace，我们量化了主流基于CLIP的VLMs中的国内地理偏见，并通过后处理的迭代零空间投影去偏方法降低了该偏见。我们还证明，所采用的去偏方法不会对现有嵌入空间产生负面影响，其在基准数据集上的检索准确率平均下降幅度小于1.5%。本工作确立了IndicFairFace作为首个研究印度语境下VLMs地理偏见的基准数据集。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

大型语言模型中隐性与显性偏见的综合研究

专知会员服务

17+阅读 · 2025年11月25日

在无标注条件下适配视觉—语言模型：全面综述

专知会员服务

13+阅读 · 2025年8月9日

面向视觉语言模型的持续学习：遗忘之外的综述与分类体系

专知会员服务

21+阅读 · 2025年8月9日

【ICML2025】Proxy-FDA：基于代理的特征分布对齐方法，用于无遗忘地微调视觉基础模型

专知会员服务

9+阅读 · 2025年6月3日