Despite extensive efforts to create fairer machine learning (ML) datasets, there remains a limited understanding of the practical aspects of dataset curation. Drawing from interviews with 30 ML dataset curators, we present a comprehensive taxonomy of the challenges and trade-offs encountered throughout the dataset curation lifecycle. Our findings underscore overarching issues within the broader fairness landscape that impact data curation. We conclude with recommendations aimed at fostering systemic changes to better facilitate fair dataset curation practices.
翻译:尽管学界为构建更公平的机器学习数据集付出了大量努力,但对数据集构建实践环节的理解仍十分有限。基于对30位机器学习数据集构建者的访谈,我们系统梳理了数据集全生命周期中遇到的各类挑战与权衡取舍,提出了全面的分类体系。研究结果揭示了更广泛的公平性领域中影响数据构建的全局性问题。最后,我们提出旨在推动系统性变革的建议,以更好地促进公平数据集构建实践。