The Case for Text-to-SQL Friendly Logical Database Design

Logical database design has traditionally optimized database schemas, including tables, columns, keys, constraints, and views, for correctness, integrity, and human-written application queries. LLM-based Text-to-SQL changes the consumer: the schema is now often read as text by a language model, so design choices that preserve database semantics can still change SQL-generation accuracy. We argue that this creates a new design objective alongside the classical ones - LLM-friendly logical database design, the property that a schema is easy for a language model to map from natural language to correct SQL - and treat it as the optimization target of this paper. We instantiate this objective with three semantics-preserving schema transformations that re-purpose classical schema-design ideas: schema abstraction (+A: logical views that materialize recurring join paths), schema partitioning (+P: workload-aware logical partitions that prune irrelevant context), and schema renaming (+R: descriptive identifiers that improve downstream column linking and predicate construction). The three operators compose, and each preserves the underlying database semantics. When historical question-SQL pairs are available, they guide both partitioning and abstraction; in zero-shot settings, renaming applies directly, and abstraction falls back to an ad-hoc per-question variant. We evaluate the resulting schemas on BIRD-Union and Spider-Union across multiple Text-to-SQL pipelines and language model backbones, with gains of up to 4.2% in execution accuracy. The best transformation varies modestly across pipelines and models, with the full +A+P+R consistently improving; multiple operator combinations are competitive on each pipeline. These results show that LLM-friendly logical design is a practical and underexplored database-side optimization target, complementary to existing Text-to-SQL pipelines.

翻译：逻辑数据库设计传统上优化数据库模式（包括表、列、键、约束和视图）以追求正确性、完整性和人工编写的应用查询。基于大语言模型的文本到SQL技术改变了消费者：模式现在通常作为文本被语言模型读取，因此保留数据库语义的设计选择仍可改变SQL生成的准确性。我们认为，这催生了与经典设计目标并列的新目标——LLM友好的逻辑数据库设计，即模式易于语言模型从自然语言映射到正确SQL的特性——并将其作为本文的优化目标。我们通过三种保留语义的模式变换来实例化该目标，这些变换重新利用了经典模式设计理念：模式抽象（+A：物化重复连接路径的逻辑视图）、模式分区（+P：基于负载感知的逻辑分区以剪枝无关上下文）和模式重命名（+R：改进下游列链接和谓词构建的描述性标识符）。这三种算子可组合使用，且各自保留底层数据库语义。当历史问题-SQL对可用时，它们可同时指导分区和抽象；在零样本场景中，重命名可直接应用，而抽象则退化为针对每个问题的临时变体。我们在BIRD-Union和Spider-Union数据集上，跨多个文本到SQL流水线和语言模型主干评估所得模式，执行准确率提升最高达4.2%。最佳变换在不同流水线和模型间存在适度差异，其中完整的+A+P+R组合效果持续稳定；每个流水线上均有多种算子组合具有竞争力。这些结果表明，面向LLM友好的逻辑设计是一个实用且被低估的数据库端优化目标，与现有文本到SQL流水线互为补充。