UnbiasedNets: A Dataset Diversification Framework for Robustness Bias Alleviation in Neural Networks

Performance of trained neural network (NN) models, in terms of testing accuracy, has improved remarkably over the past several years, especially with the advent of deep learning. However, even the most accurate NNs can be biased toward a specific output classification due to the inherent bias in the available training datasets, which may propagate to the real-world implementations. This paper deals with the robustness bias, i.e., the bias exhibited by the trained NN by having a significantly large robustness to noise for a certain output class, as compared to the remaining output classes. The bias is shown to result from imbalanced datasets, i.e., the datasets where all output classes are not equally represented. Towards this, we propose the UnbiasedNets framework, which leverages K-means clustering and the NN's noise tolerance to diversify the given training dataset, even from relatively smaller datasets. This generates balanced datasets and reduces the bias within the datasets themselves. To the best of our knowledge, this is the first framework catering to the robustness bias problem in NNs. We use real-world datasets to demonstrate the efficacy of the UnbiasedNets for data diversification, in case of both binary and multi-label classifiers. The results are compared to well-known tools aimed at generating balanced datasets, and illustrate how existing works have limited success while addressing the robustness bias. In contrast, UnbiasedNets provides a notable improvement over existing works, while even reducing the robustness bias significantly in some cases, as observed by comparing the NNs trained on the diversified and original datasets.

翻译：过去几年中，基于测试准确率评估的训练神经网络模型性能取得了显著提升，尤其在深度学习兴起之后。然而，即使是最精确的神经网络，也可能因训练数据集固有的偏差而对特定输出分类产生偏倚，这种偏差可能传播至现实应用场景。本文聚焦于鲁棒性偏差问题——即训练后的神经网络对某一输出类别的噪声鲁棒性显著高于其他类别所表现出的偏差。研究表明，这种偏差源于不平衡数据集，即所有输出类别在数据集中未能得到均衡表征。为此，我们提出UnbiasedNets框架，通过结合K-means聚类算法与神经网络的噪声容限能力，对给定训练数据集（甚至相对较小的数据集）进行多样化扩充，从而生成平衡数据集并降低数据集内部的偏差。据我们所知，这是首个针对神经网络鲁棒性偏差问题的框架。我们使用真实世界数据集验证了UnbiasedNets在二分类与多标签分类器场景下的数据多样化效能。通过与现有平衡数据集生成工具的对比，结果显示现有方法在解决鲁棒性偏差问题上效果有限。相反，UnbiasedNets不仅显著优于现有方法，更在某些案例中大幅降低了鲁棒性偏差——这一结论通过对比基于多样化数据集与原始数据集训练的神经网络性能差异而得以证实。