Recent advances in Large Language Models (LLMs) have enabled the development of text-to-SQL models that allow clinicians to query structured data stored in Electronic Health Records (EHRs) using natural language. However, deploying these models for EHR question answering (QA) systems in safety-critical clinical environments remains challenging: incorrect SQL queries-whether caused by model errors or problematic user inputs-can undermine clinical decision-making and jeopardize patient care. While prior work has mainly focused on improving SQL generation accuracy or filtering questions before execution, there is a lack of a unified benchmark for evaluating independent post-hoc verification mechanisms (i.e., a component that inspects and validates the generated SQL before execution), which is crucial for safe deployment. To fill this gap, we introduce SCARE, a benchmark for evaluating methods that function as a post-hoc safety layer in EHR QA systems. SCARE evaluates the joint task of (1) classifying question answerability (i.e., determining whether a question is answerable, ambiguous, or unanswerable) and (2) verifying or correcting candidate SQL queries. The benchmark comprises 4,200 triples of questions, candidate SQL queries, and expected model outputs, grounded in the MIMIC-III, MIMIC-IV, and eICU databases. It covers a diverse set of questions and corresponding candidate SQL queries generated by seven different text-to-SQL models, ensuring a realistic and challenging evaluation. Using SCARE, we benchmark a range of approaches-from two-stage methods to agentic frameworks. Our experiments reveal a critical trade-off between question classification and SQL error correction, highlighting key challenges and outlining directions for future research.
翻译:大型语言模型(LLM)的最新进展推动了文本到SQL模型的发展,使临床医生能够使用自然语言查询存储在电子健康记录(EHR)中的结构化数据。然而,在安全关键的临床环境中部署这些模型用于EHR问答(QA)系统仍面临挑战:错误的SQL查询——无论是由于模型错误还是有问题的用户输入引起——都可能损害临床决策并危及患者护理。尽管先前的研究主要集中于提高SQL生成准确性或在执行前过滤问题,但缺乏用于评估独立事后验证机制(即在执行前检查并验证生成的SQL的组件)的统一基准,而这对于安全部署至关重要。为填补这一空白,我们提出了SCARE,这是一个用于评估在EHR QA系统中作为事后安全层的方法的基准。SCARE评估以下联合任务:(1)问题可答性分类(即判断问题是否可答、模糊或不可答);(2)验证或修正候选SQL查询。该基准包含4,200个由问题、候选SQL查询和预期模型输出组成的三元组,基于MIMIC-III、MIMIC-IV和eICU数据库构建。它涵盖了由七种不同文本到SQL模型生成的多样化问题及相应候选SQL查询,确保了评估的真实性和挑战性。利用SCARE,我们对一系列方法进行了基准测试——从两阶段方法到智能体框架。我们的实验揭示了问题分类与SQL错误修正之间的关键权衡,突出了主要挑战并指明了未来研究方向。