Modern artificial intelligence (AI) applications require large quantities of training and test data. This need creates critical challenges not only concerning the availability of such data, but also regarding its quality. For example, incomplete, erroneous, or inappropriate training data can lead to unreliable models that ultimately produce poor decisions. Trustworthy AI applications require high-quality training and test data along many quality dimensions, such as accuracy, completeness, and consistency. We explore empirically the relationship between six data quality dimensions and the performance of 19 popular machine learning algorithms covering the tasks of classification, regression, and clustering, with the goal of explaining their performance in terms of data quality. Our experiments distinguish three scenarios based on the AI pipeline steps that were fed with polluted data: polluted training data, test data, or both. We conclude the paper with an extensive discussion of our observations.
翻译:现代人工智能(AI)应用需要大量的训练和测试数据。这一需求不仅带来了数据可用性方面的关键挑战,同时也引发了对其质量的关注。例如,不完整、错误或不恰当的训练数据可能导致模型不可靠,最终产生错误的决策。值得信赖的AI应用需要在准确性、完整性和一致性等多个质量维度上具备高质量的训练和测试数据。本文通过实证研究探讨了六个数据质量维度与19种涵盖分类、回归和聚类任务的流行机器学习算法性能之间的关系,旨在从数据质量角度解释其性能表现。我们的实验根据AI流程中接收污染数据的步骤区分了三种场景:训练数据污染、测试数据污染或两者皆受污染。最后,我们对观察结果进行了深入讨论。