EGREFINE: An Execution-Grounded Optimization Framework for Text-to-SQL Schema Refinement

Text-to-SQL enables non-expert users to query databases in natural language, yet real-world schemas often suffer from ambiguous, abbreviated, or inconsistent naming conventions that degrade model accuracy. Existing approaches treat schemas as fixed and address errors downstream. In this paper, we frame schema refinement as a constrained optimization problem: find a renaming function that maximizes downstream Text-to-SQL execution accuracy while preserving query equivalence through database views. We analyze the computational hardness of this problem, which motivates a column-wise greedy decomposition, and instantiate it as EGRefine: a four-phase pipeline that screens ambiguous columns, generates context-aware candidate names, verifies them through execution-grounded feedback, and materializes the result as non-destructive SQL views. The pipeline carries two structural properties: column-local non-degradation, ensured by the conservative selection rule in the verification phase, and database-level query equivalence, ensured by the view-based materialization phase. Together they make the resulting refinement safe by construction at the column level, with cross-column and prompt-level interactions handled empirically rather than analytically. Across controlled schema-degradation, real-world, and enterprise benchmarks, EGRefine recovers accuracy lost to schema naming noise where applicable and correctly abstains where the underlying task exceeds current Text-to-SQL capabilities, with refined schemas transferring across model families to enable refine-once, serve-many-models deployment. Code and data are publicly available at https://github.com/ai-jiaqian/EGRefine.

翻译：Text-to-SQL使非专家用户能够以自然语言查询数据库，但现实世界的模式常因命名约定存在歧义、缩写或不一致而降低模型准确率。现有方法将模式视为固定输入，并在下游环节处理错误。本文提出将模式优化建模为带约束的最优化问题：寻找一个重命名函数，在通过数据库视图保持查询等价性的前提下，最大化下游Text-to-SQL的执行准确率。我们分析了该问题的计算复杂性，由此提出逐列贪婪分解策略，并将其实现为EGRefine——一个四阶段流水线：筛选歧义列、生成上下文感知候选名称、通过执行反馈验证候选名称，最终以非破坏性SQL视图形式物化优化结果。该流水线具有两个结构特性：验证阶段的保守选择策略确保了列级局部无退化性，基于视图的物化阶段确保了数据库级查询等价性。两者共同使优化结果在列级别天然安全，而跨列及提示层面交互则通过经验性而非分析性方式处理。在可控模式退化、真实世界及企业级基准测试中，EGRefine能在适用场景下恢复由模式命名噪声导致的准确率损失，并在基础任务超出当前Text-to-SQL能力范围时正确选择放弃优化。优化后的模式可跨模型族迁移，实现"一次优化，多模型服务"的部署模式。代码与数据公开于https://github.com/ai-jiaqian/EGRefine。