Machine learning (ML) based approaches are increasingly being used in a number of applications with societal impact. Training ML models often require vast amounts of labeled data, and crowdsourcing is a dominant paradigm for obtaining labels from multiple workers. Crowd workers may sometimes provide unreliable labels, and to address this, truth discovery (TD) algorithms such as majority voting are applied to determine the consensus labels from conflicting worker responses. However, it is important to note that these consensus labels may still be biased based on sensitive attributes such as gender, race, or political affiliation. Even when sensitive attributes are not involved, the labels can be biased due to different perspectives of subjective aspects such as toxicity. In this paper, we conduct a systematic study of the bias and fairness of TD algorithms. Our findings using two existing crowd-labeled datasets, reveal that a non-trivial proportion of workers provide biased results, and using simple approaches for TD is sub-optimal. Our study also demonstrates that popular TD algorithms are not a panacea. Additionally, we quantify the impact of these unfair workers on downstream ML tasks and show that conventional methods for achieving fairness and correcting label biases are ineffective in this setting. We end the paper with a plea for the design of novel bias-aware truth discovery algorithms that can ameliorate these issues.
翻译:基于机器学习的方法正越来越多地应用于具有社会影响的诸多场景中。训练机器学习模型通常需要大量标注数据,而众包是从多个标注者处获取标签的主要范式。众包工人有时会提供不可靠的标签,为此,人们采用多数投票等真值发现算法从冲突的工人响应中确定共识标签。然而,必须指出,这些共识标签仍可能基于性别、种族或政治倾向等敏感属性而产生偏差。即便不涉及敏感属性,标签也可能因毒性这类主观维度上的不同视角而产生偏差。本文对真值发现算法的偏差与公平性进行了系统性研究。通过使用两个现成的众包标注数据集,我们发现相当比例的工人提供了有偏差的结果,而且采用简单方法进行真值发现并非最优选择。研究还表明,流行的真值发现算法并非万能灵药。此外,我们量化了这些不公正工人对下游机器学习任务的影响,并证明在本文设定下,实现公平性和纠正标签偏差的传统方法效果不佳。文末,我们呼吁设计能够改善这些问题的全新偏差感知真值发现算法。