Currently, a variety of pipeline tools are available for use in data engineering. Data scientists can use these tools to resolve data wrangling issues associated with data and accomplish some data engineering tasks from data ingestion through data preparation to utilization as input for machine learning (ML). Some of these tools have essential built-in components or can be combined with other tools to perform desired data engineering operations. While some tools are wholly or partly commercial, several open-source tools are available to perform expert-level data engineering tasks. This survey examines the broad categories and examples of pipeline tools based on their design and data engineering intentions. These categories are Extract Transform Load/Extract Load Transform (ETL/ELT), pipelines for Data Integration, Ingestion, and Transformation, Data Pipeline Orchestration and Workflow Management, and Machine Learning Pipelines. The survey also provides a broad outline of the utilization with examples within these broad groups and finally, a discussion is presented with case studies indicating the usage of pipeline tools for data engineering. The studies present some first-user application experiences with sample data, some complexities of the applied pipeline, and a summary note of approaches to using these tools to prepare data for machine learning.
翻译:目前,数据工程领域中存在多种流水线工具。数据科学家可利用这些工具解决与数据相关的数据整理问题,并完成从数据摄取、数据准备到将其作为机器学习(ML)输入的一系列数据工程任务。部分工具内置了关键组件,或可与其他工具协同完成所需的数据工程操作。尽管某些工具完全或部分商业化,但仍有多种开源工具可供执行专业级数据工程任务。本综述基于流水线工具的设计理念与数据工程应用目标,对其主要类别及典型案例进行了系统梳理。这些类别包括:提取-转换-加载/提取-加载-转换(ETL/ELT)、数据集成/摄取/转换流水线、数据流水线编排与工作流管理、以及机器学习流水线。此外,本文通过各类别中的实例概述了这些工具的应用方式,最后结合案例研究探讨了数据工程流水线工具的实际应用。案例研究展示了首批用户使用样本数据的应用体验、所采用流水线的复杂性,以及使用这些工具为机器学习准备数据的方法要点总结。