Creating programs to correctly manipulate data is a difficult task, as the underlying programming languages and APIs can be challenging to learn for many users who are not skilled programmers. Large language models (LLMs) demonstrate remarkable potential for generating code from natural language, but in the data manipulation domain, apart from the natural language (NL) description of the intended task, we also have the dataset on which the task is to be performed, or the "data context". Existing approaches have utilized data context in a limited way by simply adding relevant information from the input data into the prompts sent to the LLM. In this work, we utilize the available input data to execute the candidate programs generated by the LLMs and gather their outputs. We introduce semantic reranking, a technique to rerank the programs generated by LLMs based on three signals coming the program outputs: (a) semantic filtering and well-formedness based score tuning: do programs even generate well-formed outputs, (b) semantic interleaving: how do the outputs from different candidates compare to each other, and (c) output-based score tuning: how do the outputs compare to outputs predicted for the same task. We provide theoretical justification for semantic interleaving. We also introduce temperature mixing, where we combine samples generated by LLMs using both high and low temperatures. We extensively evaluate our approach in three domains, namely databases (SQL), data science (Pandas) and business intelligence (Excel's Power Query M) on a variety of new and existing benchmarks. We observe substantial gains across domains, with improvements of up to 45% in top-1 accuracy and 34% in top-3 accuracy.
翻译:创建能正确操作数据的程序是一项困难的任务,因为底层编程语言和API对于许多非专业程序员用户而言难以掌握。大型语言模型(LLM)在从自然语言生成代码方面展现出显著潜力,但在数据操作领域,除预期任务的自然语言描述外,我们还拥有待执行任务的数据集,即"数据上下文"。现有方法通过简单地将输入数据中的相关信息加入发送给LLM的提示中,以有限方式利用数据上下文。本文利用可用输入数据执行LLM生成的候选程序并收集其输出。我们提出语义重排序技术,基于程序输出的三种信号对LLM生成的程序进行重排:(a)语义过滤与良好形式性分数调节:程序是否生成良好形式的输出;(b)语义交错:不同候选程序的输出相互比较结果;(c)基于输出的分数调节:输出与同一任务预期输出的比较结果。我们为语义交错提供了理论依据。同时引入温度混合技术,结合LLM在高温和低温下生成的样本。我们在三个领域(数据库SQL、数据科学Pandas、商业智能Excel的Power Query M)的多种新基准与现有基准上全面评估该方法。观察到各领域均取得显著提升,Top-1准确率最高提升45%,Top-3准确率最高提升34%。