Realistic text-to-SQL workflows often require joining multiple tables. As a result, accurately retrieving the relevant set of tables becomes a key bottleneck for end-to-end performance. We study an open-book setting where queries must be answered over large, heterogeneous table collections pooled from many sources, without clean scoping signals such as database identifiers. Here, dense retrieval (DR) achieves high recall but returns many distractors, while join-aware alternatives often rely on extra assumptions and/or incur high inference overhead. We propose CORE-T, a scalable, training-free framework that enriches tables with LLM-generated purpose metadata and pre-computes a lightweight table-compatibility cache. At inference time, DR returns top-K candidates; a single LLM call selects a coherent, joinable subset, and a simple additive adjustment step restores strongly compatible tables. Across Bird, Spider, and MMQA, CORE-T improves table-selection F1 by up to 22.7 points while retrieving up to 42% fewer tables, improving multi-table execution accuracy by up to 5.0 points on Bird and 6.9 points on MMQA, and using 4-5x fewer tokens than LLM-intensive baselines.
翻译:现实的文本到SQL工作流通常需要连接多张表格。因此,准确检索相关表格集合成为端到端性能的关键瓶颈。我们研究一种开放场景:查询必须基于从多源汇集的大型异构表格集合进行回答,且缺乏数据库标识符等明确的范围界定信号。在此场景下,稠密检索(DR)能够实现高召回率,但会返回大量干扰项;而具备连接感知能力的替代方法通常依赖额外假设和/或产生较高的推理开销。我们提出CORE-T,一个可扩展、免训练的框架,该框架通过LLM生成的目标元数据对表格进行增强,并预计算轻量级的表格兼容性缓存。在推理阶段,DR返回前K个候选表格;通过单次LLM调用选取一个连贯且可连接的子集,再通过简单的加性调整步骤恢复强兼容表格。在Bird、Spider和MMQA数据集上的实验表明,CORE-T将表格选择F1值最高提升22.7个百分点,同时检索表格数量减少达42%,在Bird和MMQA上的多表执行准确率分别最高提升5.0和6.9个百分点,且比依赖密集LLM调用的基线方法减少4-5倍的令牌消耗。