Despite recent success in large language model (LLM) reasoning, LLMs struggle with hierarchical multi-step reasoning tasks like generating complex programs. For these tasks, humans often start with a high-level algorithmic design and implement each part gradually. We introduce Parsel, a framework enabling automatic implementation and validation of complex algorithms with code LLMs. With Parsel, we automatically decompose algorithmic tasks into hierarchical natural language function descriptions and then search over combinations of possible function implementations using tests. We show that Parsel can be used across domains requiring hierarchical reasoning, including program synthesis and robotic planning. We find that, using Parsel, LLMs solve more competition-level problems in the APPS dataset, resulting in pass rates over 75\% higher than prior results from directly sampling AlphaCode and Codex, while often using a smaller sample budget. Moreover, with automatically generated tests, we find that Parsel can improve the state-of-the-art pass@1 performance on HumanEval from 67\% to 85\%. We also find that LLM-generated robotic plans using Parsel are more than twice as likely to be considered accurate than directly generated plans. Lastly, we explore how Parsel addresses LLM limitations and discuss how Parsel may be useful for human programmers. We release our code at https://github.com/ezelikman/parsel
翻译:摘要:尽管大型语言模型(LLM)在推理方面近期取得成功,但其在处理生成复杂程序等分层多步推理任务时仍面临挑战。对于此类任务,人类通常先进行高层级算法设计,再逐步实现各模块。我们提出Parsel框架,使代码型LLM能够自动实现并验证复杂算法。利用Parsel,我们自动将算法任务分解为层级化的自然语言函数描述,再通过测试搜索可能的函数实现组合。研究表明,Parsel可应用于需要层级推理的多个领域,包括程序合成与机器人规划。我们发现,使用Parsel的LLM在APPS数据集中解决了更多竞赛级问题,其通过率比直接采样AlphaCode和Codex的先前结果高出75%以上,且通常使用更小的采样预算。此外,借助自动生成的测试,Parsel将HumanEval基准上的pass@1最优性能从67%提升至85%。我们还发现,使用Parsel生成的机器人规划方案被认定为准确的概率是直接生成方案的两倍以上。最后,我们探讨Parsel如何克服LLM的局限性,并讨论其对人类程序员的潜在价值。相关代码已开源至https://github.com/ezelikman/parsel。