For industrial-scale text-to-SQL, supplying the entire database schema to Large Language Models (LLMs) is impractical due to context window limits and irrelevant noise. Schema linking, which filters the schema to a relevant subset, is therefore critical. However, existing methods incur prohibitive costs, struggle to trade off recall and noise, and scale poorly to large databases. We present \textbf{AutoLink}, an autonomous agent framework that reformulates schema linking as an iterative, agent-driven process. Guided by an LLM, AutoLink dynamically explores and expands the linked schema subset, progressively identifying necessary schema components without inputting the full database schema. Our experiments demonstrate AutoLink's superior performance, achieving state-of-the-art strict schema linking recall of \textbf{97.4\%} on Bird-Dev and \textbf{91.2\%} on Spider-2.0-Lite, with competitive execution accuracy, i.e., \textbf{68.7\%} EX on Bird-Dev (better than CHESS) and \textbf{34.9\%} EX on Spider-2.0-Lite (ranking 2nd on the official leaderboard). Crucially, AutoLink exhibits \textbf{exceptional scalability}, \textbf{maintaining high recall}, \textbf{efficient token consumption}, and \textbf{robust execution accuracy} on large schemas (e.g., over 3,000 columns) where existing methods severely degrade-making it a highly scalable, high-recall schema-linking solution for industrial text-to-SQL systems.
翻译:在工业级文本到SQL任务中,由于上下文窗口限制及无关噪声干扰,将整个数据库模式提供给大型语言模型(LLMs)是不现实的。因此,模式链接——即筛选出相关模式子集——变得至关重要。然而,现有方法成本高昂,难以在召回率与噪声之间取得平衡,且难以扩展至大型数据库。本文提出 **AutoLink**,一种自主智能体框架,将模式链接重新定义为迭代的、智能体驱动的过程。在LLM的引导下,AutoLink动态探索并扩展链接的模式子集,逐步识别必要的模式组件,而无需输入完整的数据库模式。实验表明,AutoLink具有卓越的性能,在Bird-Dev数据集上实现了 **97.4%** 的严格模式链接召回率,在Spider-2.0-Lite数据集上达到 **91.2%**,同时具备竞争力的执行准确率,即在Bird-Dev上达到 **68.7%** 的EX(优于CHESS),在Spider-2.0-Lite上达到 **34.9%** 的EX(在官方排行榜上位列第二)。至关重要的是,AutoLink展现出 **卓越的可扩展性**,在大型模式(例如超过3,000列)上 **保持高召回率**、**高效的令牌消耗** 以及 **稳健的执行准确率**,而现有方法在此类场景下性能严重下降——这使其成为工业级文本到SQL系统中高度可扩展、高召回率的模式链接解决方案。