Reliable and robust evaluation methods are a necessary first step towards developing machine learning models that are themselves robust and reliable. Unfortunately, current evaluation protocols typically used to assess classifiers fail to comprehensively evaluate performance as they tend to rely on limited types of test data, and ignore others. For example, using the standard test data fails to evaluate the predictions made by the classifier to samples from classes it was not trained on. On the other hand, testing with data containing samples from unknown classes fails to evaluate how well the classifier can predict the labels for known classes. This article advocates bench-marking performance using a wide range of different types of data and using a single metric that can be applied to all such data types to produce a consistent evaluation of performance. Using such a benchmark it is found that current deep neural networks, including those trained with methods that are believed to produce state-of-the-art robustness, are extremely vulnerable to making mistakes on certain types of data. This means that such models will be unreliable in real-world scenarios where they may encounter data from many different domains, and that they are insecure as they can easily be fooled into making the wrong decisions. It is hoped that these results will motivate the wider adoption of more comprehensive testing methods that will, in turn, lead to the development of more robust machine learning methods in the future. Code is available at: \url{https://codeberg.org/mwspratling/RobustnessEvaluation}
翻译:可靠且鲁棒的评估方法是开发鲁棒且可靠的机器学习模型的必要前提。然而,目前用于评估分类器的标准评估协议往往只依赖有限类型的测试数据而忽略其他类型,因而无法全面评估其性能。例如,使用标准测试数据无法评估分类器对未训练类别样本的预测能力;另一方面,使用包含未知类别样本的数据进行测试时,又无法评估分类器对已知类别标签的预测准确性。本文倡导使用涵盖多种异构数据类型的基准测试,并采用单一可适用于所有数据类型的度量指标,以实现对性能的一致评估。通过该基准测试发现,当前深度神经网络(包括那些采用被认为能产生最先进鲁棒性方法训练的模型)在处理某些数据类型时极易犯错。这意味着此类模型在面对可能来自多个不同领域的真实场景数据时不可靠,且由于易被欺骗而做出错误决策,导致其存在安全隐患。期望这些研究结果能推动更全面测试方法的广泛采用,进而促进未来更鲁棒的机器学习方法的发展。代码获取地址:\url{https://codeberg.org/mwspratling/RobustnessEvaluation}