Existing approaches to automatic data transformation are insufficient to meet the requirements in many real-world scenarios, such as the building sector. First, there is no convenient interface for domain experts to provide domain knowledge easily. Second, they require significant training data collection overheads. Third, the accuracy suffers from complicated schema changes. To bridge this gap, we present a novel approach that leverages the unique capabilities of large language models (LLMs) in coding, complex reasoning, and zero-shot learning to generate SQL code that transforms the source datasets into the target datasets. We demonstrate the viability of this approach by designing an LLM-based framework, termed SQLMorpher, which comprises a prompt generator that integrates the initial prompt with optional domain knowledge and historical patterns in external databases. It also implements an iterative prompt optimization mechanism that automatically improves the prompt based on flaw detection. The key contributions of this work include (1) pioneering an end-to-end LLM-based solution for data transformation, (2) developing a benchmark dataset of 105 real-world building energy data transformation problems, and (3) conducting an extensive empirical evaluation where our approach achieved 96% accuracy in all 105 problems. SQLMorpher demonstrates the effectiveness of utilizing LLMs in complex, domain-specific challenges, highlighting the potential of their potential to drive sustainable solutions.
翻译:现有自动数据转换方法难以满足建筑领域等众多真实场景的需求。首先,领域专家缺乏便捷的接口来轻松提供领域知识;其次,这些方法需要大量训练数据收集成本;第三,复杂模式变更导致准确率下降。为弥补这一差距,我们提出一种新方法,利用大语言模型(LLM)在编码、复杂推理和零样本学习方面的独特能力,生成将源数据集转换为目标数据集的SQL代码。我们通过设计名为SQLMorpher的LLM框架验证了该方法的可行性,该框架包含一个提示生成器,可将初始提示与可选的领域知识及外部数据库中的历史模式相整合;同时实现了一种基于缺陷检测自动优化提示的迭代提示优化机制。本文的主要贡献包括:(1)率先提出端到端的LLM数据转换解决方案;(2)开发包含105个真实建筑能源数据转换问题的基准数据集;(3)通过广泛实证评估,该方法在全部105个问题上达到96%的准确率。SQLMorpher证明了LLM在复杂领域特定挑战中的有效性,凸显了其推动可持续解决方案的潜力。