Data-driven algorithms are only as good as the data they work with, while data sets, especially social data, often fail to represent minorities adequately. Representation Bias in data can happen due to various reasons ranging from historical discrimination to selection and sampling biases in the data acquisition and preparation methods. Given that "bias in, bias out", one cannot expect AI-based solutions to have equitable outcomes for societal applications, without addressing issues such as representation bias. While there has been extensive study of fairness in machine learning models, including several review papers, bias in the data has been less studied. This paper reviews the literature on identifying and resolving representation bias as a feature of a data set, independent of how consumed later. The scope of this survey is bounded to structured (tabular) and unstructured (e.g., image, text, graph) data. It presents taxonomies to categorize the studied techniques based on multiple design dimensions and provides a side-by-side comparison of their properties. There is still a long way to fully address representation bias issues in data. The authors hope that this survey motivates researchers to approach these challenges in the future by observing existing work within their respective domains.
翻译:数据驱动算法的质量取决于其处理的数据,而数据集(尤其是社会数据)往往无法充分代表少数群体。数据中的表征偏差可能源于多种原因,从历史歧视到数据采集与准备方法中的选择和抽样偏差。考虑到“偏入偏出”原则,若不解决表征偏差等问题,基于人工智能的解决方案在社会应用中便无法实现公平结果。尽管机器学习模型的公平性已得到广泛研究(包括多篇综述论文),但对数据偏差的探讨相对不足。本文综述了将表征偏差视为数据集特征(独立于后续使用方式)的识别与解决技术。本综述的范围涵盖结构化(表格型)和非结构化(如图像、文本、图)数据。文章基于多个设计维度对研究技术进行分类,并提出分类体系,同时对其特性进行并列比较。数据中的表征偏差问题仍远未得到完全解决。作者希望本综述能激励研究者通过观察各自领域的现有工作,在未来应对这些挑战。