Existing Text-to-SQL generators require the entire schema to be encoded with the user text. This is expensive or impractical for large databases with tens of thousands of columns. Standard dense retrieval techniques are inadequate for schema subsetting of a large structured database, where the correct semantics of retrieval demands that we rank sets of schema elements rather than individual elements. In response, we propose a two-stage process for effective coverage during retrieval. First, we instruct an LLM to hallucinate a minimal DB schema deemed adequate to answer the query. We use the hallucinated schema to retrieve a subset of the actual schema, by composing the results from multiple dense retrievals. Remarkably, hallucination $\unicode{x2013}$ generally considered a nuisance $\unicode{x2013}$ turns out to be actually useful as a bridging mechanism. Since no existing benchmarks exist for schema subsetting on large databases, we introduce three benchmarks. Two semi-synthetic datasets are derived from the union of schemas in two well-known datasets, SPIDER and BIRD, resulting in 4502 and 798 schema elements respectively. A real-life benchmark called SocialDB is sourced from an actual large data warehouse comprising 17844 schema elements. We show that our method1 leads to significantly higher recall than SOTA retrieval-based augmentation methods.
翻译:现有Text-to-SQL生成器要求将整个数据库模式与用户文本共同编码。对于包含数万列的大型数据库而言,这种方法成本高昂且不切实际。标准稠密检索技术难以应对大型结构化数据库的模式子集划分问题,因为其正确的检索语义要求对模式元素集合而非单个元素进行排序。为此,我们提出一种两阶段流程以实现检索过程中的有效覆盖。首先,引导大语言模型幻觉生成足以回答查询的最小数据库模式。通过组合多次稠密检索的结果,利用该幻觉模式检索实际模式的子集。值得注意的是,通常被视为干扰项的幻觉机制,在此反而成为有效的桥梁连接手段。鉴于现有基准缺乏针对大型数据库模式子集划分的标准,我们引入三个基准数据集:其中两个半合成数据集源自SPIDER和BIRD两个著名数据集的模式联合,分别包含4502和798个模式元素;另一个真实场景基准SocialDB源自包含17844个模式元素的实际大型数据仓库。实验表明,我们的方法在召回率上显著优于基于检索的SOTA增强方法。