Realistic text-to-SQL workflows often require joining multiple tables. As a result, accurately retrieving the relevant set of tables becomes a key bottleneck for end-to-end performance. We study an open-book setting where queries must be answered over large, heterogeneous table collections pooled from many sources, without clean scoping signals such as database identifiers. Here, dense retrieval (DR) achieves high recall but returns many distractors, while join-aware alternatives often rely on extra assumptions and/or incur high inference overhead. We propose CORE-T, a scalable, training-free framework that enriches tables with LLM-generated purpose metadata and pre-computes a lightweight table-compatibility cache. At inference time, DR returns top-K candidates; a single LLM call selects a coherent, joinable subset, and a two-step additive adjustment stage restores strongly compatible tables. Across Bird, Spider, MMQA, and Beaver, CORE-T improves over DR by up to 22.7 points in table-selection F1 while returning up to 40% fewer tables, and by up to 24.4 points in multi-table execution accuracy, and uses 1.64-4.20x fewer total selection tokens than LLM-intensive baselines.
翻译:真实的Text-to-SQL工作流通常需要连接多个表格。因此,准确检索相关表格集成为端到端性能的关键瓶颈。我们研究了一种开放场景,即查询必须基于来自多个来源的大型异构表格集合进行回答,且缺乏如数据库标识符等清晰的限定信号。在此场景下,密集检索(DR)虽能实现高召回率,但会返回大量干扰项;而考虑连接关系的替代方案往往依赖额外假设并/或产生高推理开销。我们提出CORE-T——一种可扩展且无需训练的框架,通过LLM生成的用途元数据丰富表格信息,并预先计算轻量级的表格兼容性缓存。推理时,DR返回前K个候选表格;单次LLM调用选择具有关联性且可连接的子集,再通过两步加法调整阶段恢复强兼容性表格。在Bird、Spider、MMQA和Beaver数据集上,CORE-T在表格选择F1值上较DR提升最高达22.7个百分点,同时返回表格数量减少最多40%;在多表执行准确率上提升最高达24.4个百分点,且总选择token消耗比LLM密集型基线方法减少1.64至4.20倍。