Deploying large language models for clinical Text-to-SQL requires distinguishing two qualitatively different causes of output diversity: (i) input ambiguity that should trigger clarification, and (ii) model instability that should trigger human review. We propose CLUES, a framework that models Text-to-SQL as a two-stage process (interpretations --> answers) and decomposes semantic uncertainty into an ambiguity score and an instability score. The instability score is computed via the Schur complement of a bipartite semantic graph matrix. Across AmbigQA/SituatedQA (gold interpretations) and a clinical Text-to-SQL benchmark (known interpretations), CLUES improves failure prediction over state-of-the-art Kernel Language Entropy. In deployment settings, it remains competitive while providing a diagnostic decomposition unavailable from a single score. The resulting uncertainty regimes map to targeted interventions - query refinement for ambiguity, model improvement for instability. The high-ambiguity/high-instability regime contains 51% of errors while covering 25% of queries, enabling efficient triage.
翻译:部署大型语言模型用于临床文本到SQL任务时,需要区分两种性质不同的输出多样性成因:(i) 应当触发澄清流程的输入歧义,以及(ii) 应当触发人工审核的模型不稳定性。我们提出CLUES框架,将文本到SQL建模为两阶段过程(语义解释 → 答案生成),并将语义不确定性分解为歧义分数与不稳定性分数。不稳定性分数通过二分语义图矩阵的舒尔补进行计算。在AmbigQA/SituatedQA(含标准语义解释)和临床文本到SQL基准测试(含已知语义解释)上的实验表明,CLUES在失败预测方面优于最先进的核语言熵方法。在部署环境中,该框架在保持竞争力的同时,提供了单一分数无法实现的诊断性分解。由此划分的不确定性区域可映射至针对性干预措施——针对歧义进行查询优化,针对不稳定性实施模型改进。高歧义/高不稳定性区域覆盖了25%的查询,却包含了51%的错误,实现了高效的错误分级处理。