A Survey of Pipeline Tools for Data Engineering

Currently, a variety of pipeline tools are available for use in data engineering. Data scientists can use these tools to resolve data wrangling issues associated with data and accomplish some data engineering tasks from data ingestion through data preparation to utilization as input for machine learning (ML). Some of these tools have essential built-in components or can be combined with other tools to perform desired data engineering operations. While some tools are wholly or partly commercial, several open-source tools are available to perform expert-level data engineering tasks. This survey examines the broad categories and examples of pipeline tools based on their design and data engineering intentions. These categories are Extract Transform Load/Extract Load Transform (ETL/ELT), pipelines for Data Integration, Ingestion, and Transformation, Data Pipeline Orchestration and Workflow Management, and Machine Learning Pipelines. The survey also provides a broad outline of the utilization with examples within these broad groups and finally, a discussion is presented with case studies indicating the usage of pipeline tools for data engineering. The studies present some first-user application experiences with sample data, some complexities of the applied pipeline, and a summary note of approaches to using these tools to prepare data for machine learning.

翻译：目前，数据工程领域中存在多种流水线工具。数据科学家可利用这些工具解决与数据相关的数据整理问题，并完成从数据摄取、数据准备到将其作为机器学习（ML）输入的一系列数据工程任务。部分工具内置了关键组件，或可与其他工具协同完成所需的数据工程操作。尽管某些工具完全或部分商业化，但仍有多种开源工具可供执行专业级数据工程任务。本综述基于流水线工具的设计理念与数据工程应用目标，对其主要类别及典型案例进行了系统梳理。这些类别包括：提取-转换-加载/提取-加载-转换（ETL/ELT）、数据集成/摄取/转换流水线、数据流水线编排与工作流管理、以及机器学习流水线。此外，本文通过各类别中的实例概述了这些工具的应用方式，最后结合案例研究探讨了数据工程流水线工具的实际应用。案例研究展示了首批用户使用样本数据的应用体验、所采用流水线的复杂性，以及使用这些工具为机器学习准备数据的方法要点总结。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日