Extracting insights from Electronic Health Record (EHR) databases often requires SQL expertise, creating a barrier for healthcare decision-making and research. While a promising approach is to use Large Language Models (LLMs) to translate natural language questions to SQL via Retrieval-Augmented Generation (RAG), adapting this approach to the medical domain is non-trivial. Standard RAG relies on single-step retrieval from a static pool of examples, which struggles with the variability and noise of medical terminology and jargon. This often leads to anti-patterns such as expanding the task demonstration pool to improve coverage, which in turn introduces noise and scalability problems. To address this, we introduce CBR-to-SQL, a framework inspired by Case-Based Reasoning (CBR). It represents question-SQL pairs as reusable, abstract case templates and utilizes a two-stage retrieval process that first captures logical structure and then resolves relevant entities. Evaluated on MIMICSQL, CBR-to-SQL achieves state-of-the-art logical form accuracy and competitive execution accuracy. More importantly, it demonstrates higher sample efficiency and robustness than standard RAG approaches, particularly under data scarcity and retrieval perturbations.
翻译:从电子健康记录(EHR)数据库中提取洞察通常需要SQL专业知识,这为医疗决策和研究设置了障碍。虽然利用大型语言模型(LLMs)通过检索增强生成(RAG)将自然语言问题转换为SQL是一种前景广阔的方法,但将其应用于医疗领域并非易事。标准RAG依赖于从静态示例池中进行单步检索,难以应对医学术语和行话的多样性与噪声。这常常导致反模式,例如扩大任务演示池以提高覆盖率,而这又会引入噪声和可扩展性问题。为解决此问题,我们提出了CBR-to-SQL,一个受案例推理(CBR)启发的框架。它将问题-SQL对表示为可重用的抽象案例模板,并采用两阶段检索过程:首先捕获逻辑结构,然后解析相关实体。在MIMICSQL上的评估表明,CBR-to-SQL实现了最先进的逻辑形式准确率和具有竞争力的执行准确率。更重要的是,与标准RAG方法相比,它展现出更高的样本效率和鲁棒性,尤其是在数据稀缺和检索扰动的情况下。