Data pipelines are an integral part of various modern data-driven systems. However, despite their importance, they are often unreliable and deliver poor-quality data. A critical step toward improving this situation is a solid understanding of the aspects contributing to the quality of data pipelines. Therefore, this article first introduces a taxonomy of 41 factors that influence the ability of data pipelines to provide quality data. The taxonomy is based on a multivocal literature review and validated by eight interviews with experts from the data engineering domain. Data, infrastructure, life cycle management, development & deployment, and processing were found to be the main influencing themes. Second, we investigate the root causes of data-related issues, their location in data pipelines, and the main topics of data pipeline processing issues for developers by mining GitHub projects and Stack Overflow posts. We found data-related issues to be primarily caused by incorrect data types (33%), mainly occurring in the data cleaning stage of pipelines (35%). Data integration and ingestion tasks were found to be the most asked topics of developers, accounting for nearly half (47%) of all questions. Compatibility issues were found to be a separate problem area in addition to issues corresponding to the usual data pipeline processing areas (i.e., data loading, ingestion, integration, cleaning, and transformation). These findings suggest that future research efforts should focus on analyzing compatibility and data type issues in more depth and assisting developers in data integration and ingestion tasks. The proposed taxonomy is valuable to practitioners in the context of quality assurance activities and fosters future research into data pipeline quality.
翻译:数据管道是现代各类数据驱动系统不可或缺的组成部分。然而,尽管其重要性显著,它们往往不可靠,且会交付低质量数据。改善这一状况的关键步骤在于深入理解影响数据管道质量的相关因素。因此,本文首先提出一个包含41个因素的分类体系,这些因素影响着数据管道提供高质量数据的能力。该分类体系基于多语言文献综述,并通过八位数据工程领域专家的访谈进行了验证。数据、基础设施、生命周期管理、开发与部署以及处理被确定为主要影响主题。其次,我们通过挖掘GitHub项目和Stack Overflow帖子,研究了数据相关问题的根本原因、其在数据管道中的位置,以及开发人员面临的数据管道处理问题的主要主题。我们发现,数据相关问题主要由不正确的数据类型(33%)引起,主要发生在数据管道的清洗阶段(35%)。数据集成和摄入任务是开发者最常询问的主题,占所有问题的近一半(47%)。除数据管道常规处理领域(即数据加载、摄入、集成、清洗与转换)对应的常见问题之外,兼容性问题被识别为一个独立的问题领域。这些发现表明,未来研究工作应侧重于更深入地分析兼容性和数据类型问题,并协助开发者完成数据集成和摄入任务。所提出的分类体系对实践者在质量保证活动方面具有重要价值,并促进未来对数据管道质量的进一步研究。