Currently, the in-context learning method based on large language models (LLMs) has become the mainstream of text-to-SQL research. Previous works have discussed how to select demonstrations related to the user question from a human-labeled demonstration pool. However, human labeling suffers from the limitations of insufficient diversity and high labeling overhead. Therefore, in this paper, we discuss how to measure and improve the diversity of the demonstrations for text-to-SQL. We present a metric to measure the diversity of the demonstrations and analyze the insufficient of the existing labeled data by experiments. Based on the above discovery, we propose fusing iteratively for demonstrations (Fused) to build a high-diversity demonstration pool through human-free multiple-iteration synthesis, improving diversity and lowering label cost. Our method achieves an average improvement of 3.2% and 5.0% with and without human labeling on several mainstream datasets, which proves the effectiveness of Fused.
翻译:当前,基于大语言模型的上下文学习方法已成为文本到SQL研究的主流范式。已有研究探讨如何从人工标注的示例池中选取与用户问题相关的示例。然而,人工标注存在多样性不足与标注成本高昂的局限性。为此,本文探讨如何评估并提升文本到SQL任务的示例多样性。我们提出了一种度量示例多样性的指标,并通过实验分析了现有标注数据中多样性不足的问题。基于上述发现,我们提出迭代示例融合方法(Fused),通过无人工干预的多轮迭代合成构建高多样性示例池,从而提升多样性并降低标注成本。在多个主流数据集上,我们的方法在有/无人工标注情况下分别平均提升3.2%和5.0%的性能,验证了Fused方法的有效性。