Benchmarking Text-to-Python against Text-to-SQL: The Impact of Explicit Logic and Ambiguity

While Text-to-SQL remains the dominant approach for database interaction, real-world analytics increasingly require the flexibility of general-purpose programming languages such as Python or Pandas to manage file-based data and complex analytical workflows. Despite this growing need, the reliability of Text-to-Python in core data retrieval remains underexplored relative to the mature SQL ecosystem. To address this gap, we introduce BIRD-Python, a benchmark designed for cross-paradigm evaluation. We systematically refined the original dataset to reduce annotation noise and align execution semantics, thereby establishing a consistent and standardized baseline for comparison. Our analysis reveals a fundamental paradigmatic divergence: whereas SQL leverages implicit DBMS behaviors through its declarative structure, Python requires explicit procedural logic, making it highly sensitive to underspecified user intent. To mitigate this challenge, we propose the Logic Completion Framework (LCF), which resolves ambiguity by incorporating latent domain knowledge into the generation process. Experimental results show that (1) performance differences primarily stem from missing domain context rather than inherent limitations in code generation, and (2) when these gaps are addressed, Text-to-Python achieves performance parity with Text-to-SQL. These findings establish Python as a viable foundation for analytical agents-provided that systems effectively ground ambiguous natural language inputs in executable logical specifications. Resources are available at https://anonymous.4open.science/r/Bird-Python-43B7/.

翻译：尽管文本到SQL仍是数据库交互的主流方法，但现实世界的数据分析日益需要通用编程语言（如Python或Pandas）的灵活性，以处理基于文件的数据和复杂分析工作流。尽管需求不断增长，但与成熟的SQL生态系统相比，文本到Python在核心数据检索方面的可靠性仍未得到充分探索。为填补这一空白，我们引入了BIRD-Python——一个专为跨范式评估设计的基准测试。我们系统性地优化了原始数据集以减少标注噪声并统一执行语义，从而建立了用于比较的一致标准化基线。我们的分析揭示了一个根本性的范式差异：SQL通过其声明式结构利用隐式数据库管理系统行为，而Python需要显式的过程逻辑，使其对用户意图的未明确指定高度敏感。为应对这一挑战，我们提出了逻辑补全框架（LCF），该框架通过将潜在领域知识融入生成过程来解决模糊性问题。实验结果表明：（1）性能差异主要源于缺失的领域上下文，而非代码生成的内在局限性；（2）当这些差距被填补时，文本到Python能够达到与文本到SQL相当的性能水平。这些发现确立了Python作为分析智能体可行基础的地位——前提是系统能够有效将模糊的自然语言输入锚定至可执行的逻辑规范中。相关资源可通过 https://anonymous.4open.science/r/Bird-Python-43B7/ 获取。