Few-shot example retrieval is the dominant paradigm for grounding large language models (LLMs) in domain-specific text-to-SQL systems. However, the quality of the annotated example bank directly governs system accuracy, and expert annotation is prohibitively expensive. We formalize the active selection of these examples as a constrained experimental design problem over the intrinsic, low-dimensional manifold of semantic query embeddings. Unlike standard active learning frameworks, our setting introduces three critical challenges: varying, query-dependent annotation reliability (heteroscedasticity), strict requirements for spatial diversity across semantic topics (partition matroid constraints), and the inherent reality that the true covariance structure of the embedding space is unknown (misspecification). To address these, we propose a stratified greedy algorithm that maximizes a heteroscedastic mutual information objective. We prove that this objective remains submodular and approximately monotonic on the intrinsic manifold, yielding a theoretical constant-factor approximation guarantee. We establish a spectral bound demonstrating that this approximation guarantee degrades gracefully, rather than catastrophically, when the assumed surrogate kernel diverges from the true underlying data-generating process. Empirical results demonstrate that the proposed strategy significantly reduces labeling effort while maintaining high text-to-SQL retrieval accuracy.
翻译:少样本示例检索是大型语言模型(LLMs)在特定领域文本到SQL系统中实现落地的主导范式。然而,标注示例库的质量直接决定系统精度,且专家标注成本过高。本文将示例的主动选择形式化为一个在语义查询嵌入的内在低维流形上的约束实验设计问题。与标准主动学习框架不同,本场景提出了三个关键挑战:随查询变化的标注可靠性(异方差性)、跨语义主题的空间多样性严格约束(划分拟阵约束),以及嵌入空间真实协方差结构未知的固有现实(设定错误)。为解决这些问题,我们提出了一种分层贪婪算法,该算法最大化异方差互信息目标。我们证明该目标在内在流形上保持子模性和近似单调性,从而得到理论上的常数因子近似保证。我们建立的谱界限表明,当假设的替代核函数偏离真实数据生成过程时,该近似保证会逐步而非灾难性地退化。实验结果证明,所提策略能在保持高文本到SQL检索精度的同时,显著降低标注工作量。