Are Bias Mitigation Techniques for Deep Learning Effective?

A critical problem in deep learning is that systems learn inappropriate biases, resulting in their inability to perform well on minority groups. This has led to the creation of multiple algorithms that endeavor to mitigate bias. However, it is not clear how effective these methods are. This is because study protocols differ among papers, systems are tested on datasets that fail to test many forms of bias, and systems have access to hidden knowledge or are tuned specifically to the test set. To address this, we introduce an improved evaluation protocol, sensible metrics, and a new dataset, which enables us to ask and answer critical questions about bias mitigation algorithms. We evaluate seven state-of-the-art algorithms using the same network architecture and hyperparameter selection policy across three benchmark datasets. We introduce a new dataset called Biased MNIST that enables assessment of robustness to multiple bias sources. We use Biased MNIST and a visual question answering (VQA) benchmark to assess robustness to hidden biases. Rather than only tuning to the test set distribution, we study robustness across different tuning distributions, which is critical because for many applications the test distribution may not be known during development. We find that algorithms exploit hidden biases, are unable to scale to multiple forms of bias, and are highly sensitive to the choice of tuning set. Based on our findings, we implore the community to adopt more rigorous assessment of future bias mitigation methods. All data, code, and results are publicly available at: https://github.com/erobic/bias-mitigators.

翻译：深度学习中的一个关键问题是系统会习得不当偏见，导致其在少数群体上表现不佳。这催生了多种致力于缓解偏见的算法。然而，这些方法的有效性尚不明确，原因是各论文的研究方案不同、系统在无法测试多种偏见形式的数集上测试、且系统可能利用隐藏知识或针对特定测试集进行调优。为解决这一问题，我们提出了一种改进的评估协议、合理的度量标准以及一个新数据集，从而能够针对偏见缓解算法提出并解答关键问题。我们在三个基准数据集上使用相同的网络架构和超参数选择策略，评估了七种最先进的算法。我们引入了一个名为Biased MNIST的新数据集，用于评估对多重偏见源的鲁棒性。我们利用Biased MNIST和一个视觉问答（VQA）基准来评估对隐藏偏见的鲁棒性。我们不仅针对测试集分布进行调优，还研究了不同调优分布下的鲁棒性——这至关重要，因为许多应用场景下开发阶段可能未知测试分布。我们发现，算法会利用隐藏偏见，无法扩展到多种偏见形式，且高度敏感于调优集的选择。基于研究结果，我们呼吁学界对未来的偏见缓解方法进行更严格的评估。所有数据、代码和结果均已公开：https://github.com/erobic/bias-mitigators。