In this paper, we introduce EHR-SeqSQL, a novel sequential text-to-SQL dataset for Electronic Health Record (EHR) databases. EHR-SeqSQL is designed to address critical yet underexplored aspects in text-to-SQL parsing: interactivity, compositionality, and efficiency. To the best of our knowledge, EHR-SeqSQL is not only the largest but also the first medical text-to-SQL dataset benchmark to include sequential and contextual questions. We provide a data split and the new test set designed to assess compositional generalization ability. Our experiments demonstrate the superiority of a multi-turn approach over a single-turn approach in learning compositionality. Additionally, our dataset integrates specially crafted tokens into SQL queries to improve execution efficiency. With EHR-SeqSQL, we aim to bridge the gap between practical needs and academic research in the text-to-SQL domain. EHR-SeqSQL is available at https://github.com/seonhee99/EHR-SeqSQL.
翻译:本文介绍了EHR-SeqSQL,一个面向电子健康记录(EHR)数据库的新型序列文本到SQL数据集。EHR-SeqSQL旨在解决文本到SQL解析中关键但尚未充分探索的方面:交互性、组合性和效率。据我们所知,EHR-SeqSQL不仅是目前规模最大的,也是首个包含序列化和上下文相关问题的医学文本到SQL数据集基准。我们提供了专门设计的数据划分和新的测试集,用于评估组合泛化能力。实验表明,在学习组合性方面,多轮次方法优于单轮次方法。此外,我们的数据集将特殊设计的标记集成到SQL查询中,以提高执行效率。通过EHR-SeqSQL,我们致力于弥合文本到SQL领域实际需求与学术研究之间的差距。EHR-SeqSQL可通过https://github.com/seonhee99/EHR-SeqSQL获取。