SEED: Domain-Specific Data Curation With Large Language Models

Data curation tasks that prepare data for analytics are critical for turning data into actionable insights. However, due to the diverse requirements of applications in different domains, generic off-the-shelf tools are typically insufficient. As a result, data scientists often have to develop domain-specific solutions tailored to both the dataset and the task, e.g. writing domain-specific code or training machine learning models on a sufficient number of annotated examples. This process is notoriously difficult and time-consuming. We present SEED, an LLM-as-compiler approach that automatically generates domain-specific data curation solutions via Large Language Models (LLMs). Once the user describes a task, input data, and expected output, the SEED compiler produces an executable pipeline composed of LLM-generated code, small model, and data access modules. SEED uses these generated modules to process most of the data records and dynamically decides when the LLM should step in to directly process some individual records, possibly using the data-access modules to retrieve relevant information from the data sources to assist the LLM in solving the task. To validate this new, revolutionary approach, we conducted experiments on 9 datasets spanning over 5 data curation tasks. The results show that SEED generates domain-specific solutions that significantly outperform their generic counterparts, often approaching the performance of the manually curated solutions that use thousands of labeled training examples. Moreover, in comparison to solutions that use the LLM on every data record, SEED achieves state-of-the-art or comparable few-shot performance, while significantly reducing the number of LLM calls.

翻译：数据整理任务（为分析准备数据）对于将数据转化为可操作洞察至关重要。然而，由于不同领域应用需求的多样性，通用现成工具往往难以满足需求。因此，数据科学家通常需要针对数据集和任务开发领域专用解决方案，例如编写领域特定代码或在充足标注样本上训练机器学习模型。这一过程公认困难且耗时。本文提出SEED——一种“大语言模型即编译器”方法，通过大语言模型自动生成领域专用数据整理方案。用户描述任务、输入数据和预期输出后，SEED编译器生成由大语言模型生成的代码、小型模型和数据访问模块构成的可执行流水线。SEED利用这些生成模块处理大部分数据记录，并动态决策何时需要大语言模型直接介入处理个别记录（可能通过数据访问模块从数据源检索相关信息以协助大语言模型完成任务）。为验证这一突破性方法，我们在覆盖5类数据整理任务的9个数据集上开展实验。结果表明，SEED生成的领域专用方案显著优于通用方案，其性能常接近使用数千条标注训练样本的人工整理方案。此外，与每条数据记录均调用大语言模型的方案相比，SEED在实现最先进或相当的小样本性能的同时，大幅减少了大语言模型调用次数。