Plan with Code: Comparing approaches for robust NL to DSL generation

Planning in code is considered a more reliable approach for many orchestration tasks. This is because code is more tractable than steps generated via Natural Language and make it easy to support more complex sequences by abstracting deterministic logic into functions. It also allows spotting issues with incorrect function names with the help of parsing checks that can be run on code. Progress in Code Generation methodologies, however, remains limited to general-purpose languages like C, C++, and Python. LLMs continue to face challenges with custom function names in Domain Specific Languages or DSLs, leading to higher hallucination rates and syntax errors. This is more common for custom function names, that are typically part of the plan. Moreover, keeping LLMs up-to-date with newer function names is an issue. This poses a challenge for scenarios like task planning over a large number of APIs, since the plan is represented as a DSL having custom API names. In this paper, we focus on workflow automation in RPA (Robotic Process Automation) domain as a special case of task planning. We present optimizations for using Retrieval Augmented Generation (or RAG) with LLMs for DSL generation along with an ablation study comparing these strategies with a fine-tuned model. Our results showed that the fine-tuned model scored the best on code similarity metric. However, with our optimizations, RAG approach is able to match the quality for in-domain API names in the test set. Additionally, it offers significant advantage for out-of-domain or unseen API names, outperforming Fine-Tuned model on similarity metric by 7 pts.

翻译：在众多编排任务中，基于代码的规划被视为一种更可靠的方法。这是因为相较于通过自然语言生成的步骤，代码具有更强的可处理性，并且能够通过将确定性逻辑抽象为函数来轻松支持更复杂的序列。同时，借助可在代码上运行的解析检查，更容易发现错误函数名的问题。然而，代码生成方法的进展目前仍局限于C、C++和Python等通用编程语言。对于领域特定语言中的自定义函数名，大型语言模型仍面临挑战，导致更高的幻觉率和语法错误。这在自定义函数名（通常是规划的一部分）中尤为常见。此外，使大型语言模型及时更新以支持新函数名也是一个难题。这为涉及大量API的任务规划场景带来了挑战，因为此类规划通常以包含自定义API名称的领域特定语言表示。本文聚焦于机器人流程自动化领域的工作流自动化，将其作为任务规划的一个特例进行研究。我们提出了将检索增强生成与大型语言模型结合用于领域特定语言生成的优化方案，并通过消融实验比较了这些策略与微调模型的性能。结果表明，微调模型在代码相似度指标上得分最高。然而，通过我们的优化，检索增强生成方法在处理测试集中领域内API名称时能够达到同等质量水平。更重要的是，对于领域外或未见过的API名称，该方法展现出显著优势，在相似度指标上超过微调模型7个百分点。