Are Bias Mitigation Techniques for Deep Learning Effective?

A critical problem in deep learning is that systems learn inappropriate biases, resulting in their inability to perform well on minority groups. This has led to the creation of multiple algorithms that endeavor to mitigate bias. However, it is not clear how effective these methods are. This is because study protocols differ among papers, systems are tested on datasets that fail to test many forms of bias, and systems have access to hidden knowledge or are tuned specifically to the test set. To address this, we introduce an improved evaluation protocol, sensible metrics, and a new dataset, which enables us to ask and answer critical questions about bias mitigation algorithms. We evaluate seven state-of-the-art algorithms using the same network architecture and hyperparameter selection policy across three benchmark datasets. We introduce a new dataset called Biased MNIST that enables assessment of robustness to multiple bias sources. We use Biased MNIST and a visual question answering (VQA) benchmark to assess robustness to hidden biases. Rather than only tuning to the test set distribution, we study robustness across different tuning distributions, which is critical because for many applications the test distribution may not be known during development. We find that algorithms exploit hidden biases, are unable to scale to multiple forms of bias, and are highly sensitive to the choice of tuning set. Based on our findings, we implore the community to adopt more rigorous assessment of future bias mitigation methods. All data, code, and results are publicly available at: https://github.com/erobic/bias-mitigators.

翻译：深度学习的一个关键问题在于系统会学习不当的偏见，导致其在少数群体上的表现不佳。这一问题催生了多种致力于缓解偏差的算法。然而，这些方法的有效性尚不明确——因为不同论文的研究协议存在差异，系统在无法测试多种偏差形式的数据集上进行测试，且系统往往获得隐藏知识或针对测试集进行了专门调参。为解决此问题，我们提出改进的评估协议、合理的评估指标以及新数据集，从而能够探究并解答关于偏差缓解算法的关键问题。我们采用相同的网络架构和超参数选择策略，在三个基准数据集上评估了七种前沿算法。我们引入名为Biased MNIST的新数据集，该数据集可评估对多重偏差来源的鲁棒性。通过Biased MNIST和视觉问答基准测试，我们评估了算法对隐藏偏差的鲁棒性。不同于仅针对测试集分布调参，我们研究了不同调参分布下的鲁棒性——这一研究至关重要，因为在许多应用中开发阶段可能未知测试分布。研究发现：算法会利用隐藏偏差、无法扩展应对多种偏差形式，且对调参集选择高度敏感。基于研究结果，我们呼吁学界对未来的偏差缓解方法进行更严格的评估。所有数据、代码及成果均已开源：https://github.com/erobic/bias-mitigators