"Garbage In Garbage Out" is a universally agreed quote by computer scientists from various domains, including Artificial Intelligence (AI). As data is the fuel for AI, models trained on low-quality, biased data are often ineffective. Computer scientists who use AI invest a considerable amount of time and effort in preparing the data for AI. However, there are no standard methods or frameworks for assessing the "readiness" of data for AI. To provide a quantifiable assessment of the readiness of data for AI processes, we define parameters of AI data readiness and introduce AIDRIN (AI Data Readiness Inspector). AIDRIN is a framework covering a broad range of readiness dimensions available in the literature that aid in evaluating the readiness of data quantitatively and qualitatively. AIDRIN uses metrics in traditional data quality assessment such as completeness, outliers, and duplicates for data evaluation. Furthermore, AIDRIN uses metrics specific to assess data for AI, such as feature importance, feature correlations, class imbalance, fairness, privacy, and FAIR (Findability, Accessibility, Interoperability, and Reusability) principle compliance. AIDRIN provides visualizations and reports to assist data scientists in further investigating the readiness of data. The AIDRIN framework enhances the efficiency of the machine learning pipeline to make informed decisions on data readiness for AI applications.
翻译:"垃圾进,垃圾出"是包括人工智能(AI)领域在内的各领域计算机科学家普遍认同的观点。由于数据是AI的燃料,基于低质量、有偏见数据训练的模型往往效果不佳。使用AI的计算机科学家在准备AI数据方面投入了大量时间和精力。然而,目前尚无评估数据"就绪度"的标准方法或框架。为提供AI流程数据就绪度的量化评估,我们定义了AI数据就绪度参数,并提出了AIDRIN(AI数据就绪度检查器)。AIDRIN是一个涵盖文献中广泛就绪度维度的框架,可从定量和定性角度辅助评估数据就绪度。该框架采用完整性、异常值和重复性等传统数据质量评估指标进行数据评价。此外,AIDRIN还使用专门针对AI数据评估的指标,包括特征重要性、特征相关性、类别不平衡、公平性、隐私性以及FAIR(可发现性、可访问性、互操作性和可重用性)原则合规性。AIDRIN通过可视化图表和评估报告协助数据科学家深入探查数据就绪度。该框架能提升机器学习流程的效率,为AI应用的数据就绪度决策提供科学依据。