Understanding textual description to generate code seems to be an achieved capability of instruction-following Large Language Models (LLMs) in zero-shot scenario. However, there is a severe possibility that this translation ability may be influenced by having seen target textual descriptions and the related code. This effect is known as Data Contamination. In this study, we investigate the impact of Data Contamination on the performance of GPT-3.5 in the Text-to-SQL code-generating tasks. Hence, we introduce a novel method to detect Data Contamination in GPTs and examine GPT-3.5's Text-to-SQL performances using the known Spider Dataset and our new unfamiliar dataset Termite. Furthermore, we analyze GPT-3.5's efficacy on databases with modified information via an adversarial table disconnection (ATD) approach, complicating Text-to-SQL tasks by removing structural pieces of information from the database. Our results indicate a significant performance drop in GPT-3.5 on the unfamiliar Termite dataset, even with ATD modifications, highlighting the effect of Data Contamination on LLMs in Text-to-SQL translation tasks.
翻译:理解文本描述并生成代码似乎是指令遵循型大型语言模型(LLMs)在零样本场景下已具备的能力。然而,这种翻译能力可能因模型接触过目标文本描述及相关代码而受到严重影响,这种现象被称为数据污染。本研究探讨了数据污染对GPT-3.5在文本到SQL代码生成任务中性能的影响。我们提出了一种检测GPT系列模型数据污染的新方法,并利用已知的Spider数据集与新的陌生数据集Termite,评估GPT-3.5的文本到SQL性能。此外,我们通过对抗性表断开(ATD)方法修改数据库信息,移除数据库中的结构信息以增加文本到SQL任务的复杂度,从而分析GPT-3.5在修改后数据库上的表现。结果表明,即使在ATD修改条件下,GPT-3.5在陌生数据集Termite上的性能仍显著下降,凸显了数据污染对LLMs在文本到SQL翻译任务中的影响。