In LLM-based text-to-SQL systems, unanswerable and underspecified user queries may generate not only incorrect text but also executable programs that yield misleading results or violate safety constraints, posing a major barrier to safe deployment. Existing refusal strategies for such queries either rely on output-level instruction following, which is brittle due to model hallucinations, or estimate output uncertainty, which adds complexity and overhead. To address this challenge, we formalize safe refusal in text-to-SQL systems as an answerability-gating problem and propose LatentRefusal, a latent-signal refusal mechanism that predicts query answerability from intermediate hidden activations of a large language model. We introduce the Tri-Residual Gated Encoder, a lightweight probing architecture, to suppress schema noise and amplify sparse, localized cues of question-schema mismatch that indicate unanswerability. Extensive empirical evaluations across diverse ambiguous and unanswerable settings, together with ablation studies and interpretability analyses, demonstrate the effectiveness of the proposed approach and show that LatentRefusal provides an attachable and efficient safety layer for text-to-SQL systems. Across four benchmarks, LatentRefusal improves average F1 to 88.5 percent on both backbones while adding approximately 2 milliseconds of probe overhead.
翻译:在基于大语言模型的文本到SQL系统中,不可回答及未充分明确的用户查询不仅可能生成错误文本,还可能产生可执行程序,这些程序会返回误导性结果或违反安全约束,对安全部署构成主要障碍。针对此类查询的现有拒绝策略要么依赖于输出级的指令遵循(因模型幻觉而脆弱),要么依赖于输出不确定性估计(增加了复杂性和开销)。为应对这一挑战,我们将文本到SQL系统中的安全拒绝形式化为可回答性门控问题,并提出LatentRefusal——一种基于潜在信号的拒绝机制,通过大语言模型的中间隐藏激活状态预测查询的可回答性。我们引入了Tri-Residual Gated Encoder这一轻量级探测架构,以抑制模式噪声并放大指示不可回答性的问题-模式失配的稀疏局部线索。在多种模糊及不可回答场景下的广泛实证评估,结合消融研究和可解释性分析,证明了所提方法的有效性,并表明LatentRefusal为文本到SQL系统提供了一个可附加的高效安全层。在四个基准测试中,LatentRefusal将两个骨干模型的平均F1分数提升至88.5%,同时仅增加约2毫秒的探测开销。