The limited scale of annotated data constraints existing context-dependent text-to-SQL models because of the complexity of labeling. The data augmentation method is a commonly used method to solve this problem. However, the data generated by current augmentation methods often lack diversity. In this paper, we introduce ConDA, which generates interactive questions and corresponding SQL results. We designed the SQL dialogue state to enhance the data diversity through the state transition. Meanwhile, we also present a filter method to ensure the data quality by a grounding model. Additionally, we utilize a grounding model to identify and filter low-quality questions that mismatch the state information. Experimental results on the SParC and CoSQL datasets show that ConDA boosts the baseline model to achieve an average improvement of $3.3\%$ on complex questions. Moreover, we analyze the augmented data, which reveals that the data generated by ConDA are of high quality in both SQL template hardness and types, turns, and question consistency.
翻译:受限于标注复杂性,注释数据规模有限制约了现有上下文依赖的文本到SQL模型的发展。数据增强方法是解决该问题的常用手段,然而当前增强方法生成的数据往往缺乏多样性。本文提出ConDA方法,该方法可生成交互式问题及对应的SQL结果。我们设计了SQL对话状态机制,通过状态转换提升数据多样性;同时提出基于接地模型的过滤方法以确保数据质量。此外,我们利用接地模型识别并过滤与状态信息不匹配的低质量问题。在SParC和CoSQL数据集上的实验结果表明,ConDA使基线模型在复杂问题上平均提升$3.3\%$。进一步对增强数据的分析表明,ConDA生成的数据在SQL模板复杂度、类型分布、对话轮次及问题一致性方面均呈现高质量特征。