AI systems cannot exist without data. Now that AI models (data science and AI) have matured and are readily available to apply in practice, most organizations struggle with the data infrastructure to do so. There is a growing need for data engineers that know how to prepare data for AI systems or that can setup enterprise-wide data architectures for analytical projects. But until now, the data engineering part of AI engineering has not been getting much attention, in favor of discussing the modeling part. In this paper we aim to change this by perform a mapping study on data engineering for AI systems, i.e., AI data engineering. We found 25 relevant papers between January 2019 and June 2023, explaining AI data engineering activities. We identify which life cycle phases are covered, which technical solutions or architectures are proposed and which lessons learned are presented. We end by an overall discussion of the papers with implications for practitioners and researchers. This paper creates an overview of the body of knowledge on data engineering for AI. This overview is useful for practitioners to identify solutions and best practices as well as for researchers to identify gaps.
翻译:人工智能系统离不开数据。随着AI模型(数据科学与AI)日趋成熟并广泛应用于实践,多数组织面临数据基础设施建设的挑战。当前亟需具备为AI系统准备数据能力、或能搭建企业级分析项目数据架构的数据工程师。然而迄今为止,在AI工程领域中,相较于模型构建环节的讨论热度,数据工程板块始终未获足够重视。本文旨在通过开展AI系统数据工程(即AI数据工程)的映射研究来改变这一现状。我们筛选了2019年1月至2023年6月间25篇相关论文,解析其中涉及的AI数据工程活动。研究识别了所覆盖的生命周期阶段、提出的技术方案或架构,以及总结的经验教训。最后通过整体讨论,阐明对从业者与研究者的启示。本文系统梳理了AI数据工程知识体系,既可为从业者提供解决方案与最佳实践参考,亦有助于研究者发现研究空白。