Data quality assessment has become a prominent component in the successful execution of complex data-driven artificial intelligence (AI) software systems. In practice, real-world applications generate huge volumes of data at speeds. These data streams require analysis and preprocessing before being permanently stored or used in a learning task. Therefore, significant attention has been paid to the systematic management and construction of high-quality datasets. Nevertheless, managing voluminous and high-velocity data streams is usually performed manually (i.e. offline), making it an impractical strategy in production environments. To address this challenge, DataOps has emerged to achieve life-cycle automation of data processes using DevOps principles. However, determining the data quality based on a fitness scale constitutes a complex task within the framework of DataOps. This paper presents a novel Data Quality Scoring Operations (DQSOps) framework that yields a quality score for production data in DataOps workflows. The framework incorporates two scoring approaches, an ML prediction-based approach that predicts the data quality score and a standard-based approach that periodically produces the ground-truth scores based on assessing several data quality dimensions. We deploy the DQSOps framework in a real-world industrial use case. The results show that DQSOps achieves significant computational speedup rates compared to the conventional approach of data quality scoring while maintaining high prediction performance.
翻译:数据质量评估已成为复杂数据驱动型人工智能软件系统成功执行的关键组成部分。实际应用场景中,现实世界系统以高速生成海量数据,这些数据流在永久存储或用于学习任务前需经分析与预处理。因此,高质量数据集的系统化管理与构建受到广泛关注。然而,管理海量高速数据流通常依赖人工操作(即离线处理),这在生产环境中缺乏可行性。为应对该挑战,DataOps应运而生,通过运用DevOps原则实现数据流程的全生命周期自动化。但在DataOps框架下,基于适配性量表判定数据质量仍属复杂任务。本文提出一种新型数据质量评分运维(DQSOps)框架,该框架可为DataOps工作流中的生产数据生成质量分数。该框架融合两种评分方法:基于机器学习预测的数据质量评分方法,以及通过评估多维度数据质量定期生成真实基准分数的标准方法。我们在真实工业场景中部署了DQSOps框架。结果表明,相较传统数据质量评分方法,DQSOps在保持高预测性能的同时实现了显著的计算加速比。