Data curation tasks that prepare data for analytics are critical for turning data into actionable insights. However, due to the diverse requirements of applications in different domains, generic off-the-shelf tools are typically insufficient. As a result, data scientists often have to develop domain-specific solutions tailored to both the dataset and the task, e.g. writing domain-specific code or training machine learning models on a sufficient number of annotated examples. This process is notoriously difficult and time-consuming. We present SEED, an LLM-as-compiler approach that automatically generates domain-specific data curation solutions via Large Language Models (LLMs). Once the user describes a task, input data, and expected output, the SEED compiler produces a hybrid pipeline that combines LLM querying with more cost-effective alternatives, such as vector-based caching, LLM-generated code, and small models trained on LLM-annotated data. SEED features an optimizer that automatically selects from the four LLM-assisted modules and forms a hybrid execution pipeline that best fits the task at hand. To validate this new, revolutionary approach, we conducted experiments on $9$ datasets spanning over $5$ data curation tasks. In comparison to solutions that use the LLM on every data record, SEED achieves state-of-the-art or comparable few-shot performance, while significantly reducing the number of LLM calls.
翻译:数据分析前的数据整理任务对于将数据转化为可操作见解至关重要。然而,由于不同领域应用的多样化需求,通用的现成工具通常难以满足要求。数据科学家往往需要针对具体数据集和任务开发领域特定的解决方案,例如编写领域特定代码或在足量标注样本上训练机器学习模型。这一过程公认困难且耗时。我们提出SEED方法——一种"大语言模型即编译器"的范式,通过大语言模型自动生成领域特定的数据整理方案。用户描述任务、输入数据和预期输出后,SEED编译器能生成混合流水线,将大语言模型查询与更具成本效益的替代方案(如基于向量的缓存、大语言模型生成的代码、在大语言模型标注数据上训练的小模型)相结合。SEED配备优化器,可自动从四个大语言模型辅助模块中选择并构建最适配当前任务的混合执行流水线。为验证这一革命性新方法,我们在涵盖5类数据整理任务的9个数据集上开展实验。与在每条数据记录上都使用大语言模型的方案相比,SEED在显著减少大语言模型调用次数的同时,实现了与少样本方法相当或更优的性能。