Evaluating Retrieval-Augmented Generation Variants for Natural Language-Based SQL and API Call Generation

Enterprise systems increasingly require natural language interfaces that can translate user requests into structured operations such as SQL queries and REST API calls. While large language models (LLMs) show promise for code generation [Chen et al., 2021; Huynh and Lin, 2025], their effectiveness in domain-specific enterprise contexts remains underexplored, particularly when both retrieval and modification tasks must be handled jointly. This paper presents a comprehensive evaluation of three retrieval-augmented generation (RAG) variants [Lewis et al., 2021] -- standard RAG, Self-RAG [Asai et al., 2024], and CoRAG [Wang et al., 2025] -- across SQL query generation, REST API call generation, and a combined task requiring dynamic task classification. Using SAP Transactional Banking as a realistic enterprise use case, we construct a novel test dataset covering both modalities and evaluate 18 experimental configurations under database-only, API-only, and hybrid documentation contexts. Results demonstrate that RAG is essential: Without retrieval, exact match accuracy is 0% across all tasks, whereas retrieval yields substantial gains in execution accuracy (up to 79.30%) and component match accuracy (up to 78.86%). Critically, CoRAG proves most robust in hybrid documentation settings, achieving statistically significant improvements in the combined task (10.29% exact match vs. 7.45% for standard RAG), driven primarily by superior SQL generation performance (15.32% vs. 11.56%). Our findings establish retrieval-policy design as a key determinant of production-grade natural language interfaces, showing that iterative query decomposition outperforms both top-k retrieval and binary relevance filtering under documentation heterogeneity.

翻译：企业系统日益需要能够将用户请求转换为结构化操作（如SQL查询和REST API调用）的自然语言接口。尽管大型语言模型在代码生成方面展现出潜力[Chen et al., 2021; Huynh and Lin, 2025]，但其在特定领域企业环境中的有效性仍未得到充分探索，尤其是在需要同时处理检索与修改任务的场景中。本文对三种检索增强生成变体[Lewis et al., 2021]——标准RAG、Self-RAG [Asai et al., 2024]与CoRAG [Wang et al., 2025]——在SQL查询生成、REST API调用生成以及需要动态任务分类的复合任务上进行了全面评估。以SAP Transactional Banking作为真实企业用例，我们构建了一个涵盖两种模态的新型测试数据集，并在纯数据库、纯API及混合文档三种语境下评估了18种实验配置。结果表明检索机制至关重要：无检索时所有任务的精确匹配准确率均为0%，而引入检索后执行准确率（最高达79.30%）与组件匹配准确率（最高达78.86%）均获得显著提升。关键的是，CoRAG在混合文档环境中表现出最强的鲁棒性，在复合任务中实现了统计学显著的改进（精确匹配率10.29% vs. 标准RAG的7.45%），这主要得益于其更优的SQL生成性能（15.32% vs. 11.56%）。我们的研究确立了检索策略设计作为生产级自然语言接口的关键决定因素，表明在文档异构环境下，迭代式查询分解策略优于top-k检索与二元相关性过滤方法。