Current research on bias in machine learning often focuses on fairness, while overlooking the roots or causes of bias. However, bias was originally defined as a "systematic error," often caused by humans at different stages of the research process. This article aims to bridge the gap between past literature on bias in research by providing taxonomy for potential sources of bias and errors in data and models. The paper focus on bias in machine learning pipelines. Survey analyses over forty potential sources of bias in the machine learning (ML) pipeline, providing clear examples for each. By understanding the sources and consequences of bias in machine learning, better methods can be developed for its detecting and mitigating, leading to fairer, more transparent, and more accurate ML models.
翻译:当前关于机器学习中偏差的研究通常聚焦于公平性,而忽视了偏差的根源或成因。然而,偏差最初被定义为"系统性误差",往往由人类在研究过程的不同阶段引发。本文旨在通过建立数据与模型中潜在偏差及错误来源的分类体系,弥合过去关于研究偏差的文献空白。论文聚焦于机器学习流程中的偏差,系统分析了该流程中超过四十种潜在偏差来源,并针对每种偏差提供了清晰的实例。通过理解机器学习中偏差的成因与影响,可以开发出更有效的偏差检测与缓解方法,从而构建更公平、更透明且更精确的机器学习模型。