CleanAgent: Automating Data Standardization with LLM-based Agents

Data standardization is a crucial part in data science life cycle. While tools like Pandas offer robust functionalities, their complexity and the manual effort required for customizing code to diverse column types pose significant challenges. Although large language models (LLMs) like ChatGPT have shown promise in automating this process through natural language understanding and code generation, it still demands expert-level programming knowledge and continuous interaction for prompt refinement. To solve these challenges, our key idea is to propose a Python library with declarative, unified APIs for standardizing column types, simplifying the code generation of LLM with concise API calls. We first propose Dataprep.Clean which is written as a component of the Dataprep Library, offers a significant reduction in complexity by enabling the standardization of specific column types with a single line of code. Then we introduce the CleanAgent framework integrating Dataprep.Clean and LLM-based agents to automate the data standardization process. With CleanAgent, data scientists need only provide their requirements once, allowing for a hands-free, automatic standardization process.

翻译：数据标准化是数据科学生命周期中的关键环节。虽然Pandas等工具提供了强大的功能，但其复杂性以及针对不同列类型手动定制代码的需求带来了显著挑战。尽管大语言模型（如ChatGPT）通过自然语言理解和代码生成在自动化这一流程中展现出潜力，但仍需专家级编程知识及持续交互以优化提示词。为解决这些问题，我们的核心思想是提出一个包含声明式统一API的Python库，用于标准化列类型，通过简洁的API调用简化LLM的代码生成。我们首先提出了Dataprep.Clean，作为Dataprep库的组件，通过单行代码即可实现特定列类型的标准化，大幅降低了复杂度。随后我们引入CleanAgent框架，将Dataprep.Clean与基于LLM的代理相结合，实现数据标准化流程的自动化。借助CleanAgent，数据科学家仅需一次性提供需求，即可实现无需人工干预的自动标准化流程。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日