Artificial Intelligence (AI) applications critically depend on data. Poor quality data produces inaccurate and ineffective AI models that may lead to incorrect or unsafe use. Evaluation of data readiness is a crucial step in improving the quality and appropriateness of data usage for AI. R&D efforts have been spent on improving data quality. However, standardized metrics for evaluating data readiness for use in AI training are still evolving. In this study, we perform a comprehensive survey of metrics used to verify data readiness for AI training. This survey examines more than 140 papers published by ACM Digital Library, IEEE Xplore, journals such as Nature, Springer, and Science Direct, and online articles published by prominent AI experts. This survey aims to propose a taxonomy of data readiness for AI (DRAI) metrics for structured and unstructured datasets. We anticipate that this taxonomy will lead to new standards for DRAI metrics that will be used for enhancing the quality, accuracy, and fairness of AI training and inference.
翻译:人工智能(AI)应用高度依赖于数据。低质量数据会产生不准确且低效的AI模型,可能导致错误或不安全的应用。评估数据就绪度是提升AI数据使用质量与适用性的关键步骤。研发工作已致力于改善数据质量,然而用于评估AI训练数据就绪度的标准化指标仍在发展之中。本研究对用于验证AI训练数据就绪度的指标进行了全面综述。本综述检视了超过140篇文献,包括ACM数字图书馆、IEEE Xplore、Nature、Springer、Science Direct等期刊出版物,以及知名AI专家发表的在线文章。本综述旨在为结构化和非结构化数据集提出一套面向人工智能的数据就绪度(DRAI)指标分类体系。我们预期该分类体系将催生新的DRAI指标标准,用于提升AI训练与推理的质量、准确性与公平性。