When writing programs, people have the ability to tackle a new complex task by decomposing it into smaller and more familiar subtasks. While it is difficult to measure whether neural program synthesis methods have similar capabilities, we can measure whether they compositionally generalize, that is, whether a model that has been trained on the simpler subtasks is subsequently able to solve more complex tasks. In this paper, we characterize several different forms of compositional generalization that are desirable in program synthesis, forming a meta-benchmark which we use to create generalization tasks for two popular datasets, RobustFill and DeepCoder. We then propose ExeDec, a novel decomposition-based synthesis strategy that predicts execution subgoals to solve problems step-by-step informed by program execution at each step. When used with Transformer models trained from scratch, ExeDec has better synthesis performance and greatly improved compositional generalization ability compared to baselines. Finally, we use our benchmarks to demonstrate that LLMs struggle to compositionally generalize when asked to do programming-by-example in a few-shot setting, but an ExeDec-style prompting approach can improve the generalization ability and overall performance.
翻译:在编写程序时,人类能够通过将复杂新任务分解为更小、更熟悉的子任务来应对挑战。虽然难以直接衡量神经程序合成方法是否具备类似能力,但可以通过检验其组合泛化能力来评估——即模型在简单子任务上训练后,能否泛化求解更复杂的任务。本文刻画了程序合成中若干期望的组合泛化形式,构建了一个元基准,并利用该基准为RobustFill和DeepCoder两个流行数据集创建了泛化测试任务。我们提出ExeDec——一种基于分解的新型合成策略,通过预测执行子目标,并结合每一步的程序执行状态逐步求解问题。当与从头训练的Transformer模型配合使用时,ExeDec相较于基线方法展现出更优的合成性能和显著增强的组合泛化能力。最后,我们利用基准实验证明:在少样本设置下,大语言模型通过示例编程时难以实现组合泛化,而采用类似ExeDec的提示策略可有效提升其泛化能力与整体性能。