Data processing is one of the fundamental steps in machine learning pipelines to ensure data quality. Majority of the applications consider the user-defined function (UDF) design pattern for data processing in databases. Although the UDF design pattern introduces flexibility, reusability and scalability, the increasing demand on machine learning pipelines brings three new challenges to this design pattern -- not low-code, not dependency-free and not knowledge-aware. To address these challenges, we propose a new design pattern that large language models (LLMs) could work as a generic data operator (LLM-GDO) for reliable data cleansing, transformation and modeling with their human-compatible performance. In the LLM-GDO design pattern, user-defined prompts (UDPs) are used to represent the data processing logic rather than implementations with a specific programming language. LLMs can be centrally maintained so users don't have to manage the dependencies at the run-time. Fine-tuning LLMs with domain-specific data could enhance the performance on the domain-specific tasks which makes data processing knowledge-aware. We illustrate these advantages with examples in different data processing tasks. Furthermore, we summarize the challenges and opportunities introduced by LLMs to provide a complete view of this design pattern for more discussions.
翻译:数据处理是机器学习流程中确保数据质量的基础步骤之一。大部分应用采用用户定义函数设计模式进行数据库数据处理。尽管UDF设计模式引入了灵活性、可重用性和可扩展性,但机器学习流程日益增长的需求给该模式带来了三个新挑战——非低代码化、非无依赖性和非知识感知。为解决这些问题,我们提出了一种新设计模式:使大语言模型作为通用数据处理算子(LLM-GDO),凭借其类人性能实现可靠的数据清洗、转换和建模。在LLM-GDO设计模式中,用户定义提示用于描述数据处理逻辑,而非通过特定编程语言实现。LLM可集中维护,使用户无需在运行时管理依赖关系。通过领域特定数据对LLM进行微调,可增强其在领域特定任务上的性能,实现数据处理的知识感知。我们通过不同数据处理任务的实例阐述了这些优势,并进一步总结了LLM带来的挑战与机遇,为深入探讨该设计模式提供全景视图。