The First Prompt Counts the Most! An Evaluation of Large Language Models on Iterative Example-based Code Generation

The capabilities of Large Language Models (LLMs) in code generation, particularly for implementing target functionalities from natural language descriptions, have been extensively studied. As an alternative form of natural language, input-output examples (I/O examples) provide an accessible, unambiguous, and flexible way to describe functionalities, but the diversity, sparseness, and incompleteness of I/O examples also place challenges on understanding and implementing requirements. Therefore, generating code from input-output examples (i.e., example-based code generation) provides a new perspective, allowing us to evaluate LLMs' capability to infer target functionalities from limited information and to process new-form requirements. However, related research about LLMs in example-based code generation remains largely unexplored. To fill this gap, this paper presents the first comprehensive study on example-based code generation using LLMs. To address the incorrectness caused by the incompleteness of I/O examples, we adopt an iterative evaluation framework and formalize the objective of example-based code generation as two sequential sub-objectives: generating code conforming to given examples and generating code that successfully implements the target functionalities from (iteratively) given examples. We assess six state-of-the-art LLMs using a new benchmark of 168 diverse target functionalities. The results demonstrate that when requirements were described using iterative I/O examples rather than natural language, the LLMs' score decreased by over 60%, indicating that example-based code generation remains challenging for the evaluated LLMs. More interestingly, the vast majority (even over 95%) of successfully implemented functionalities are achieved in the first round of iterations, suggesting that the LLMs struggle to effectively utilize the iteratively supplemented requirements.

翻译：大语言模型（LLMs）在代码生成方面的能力，特别是根据自然语言描述实现目标功能，已得到广泛研究。作为自然语言的替代形式，输入输出示例（I/O示例）提供了一种易于理解、明确且灵活的功能描述方式，但I/O示例的多样性、稀疏性和不完整性也为需求理解和实现带来了挑战。因此，基于输入输出示例的代码生成（即基于示例的代码生成）提供了新的视角，使我们能够评估LLMs从有限信息推断目标功能及处理新型需求的能力。然而，关于LLMs在基于示例的代码生成方面的相关研究仍处于探索阶段。为填补这一空白，本文首次对基于示例的代码生成进行了系统性研究。针对I/O示例不完整性导致的代码错误问题，我们采用迭代评估框架，并将基于示例的代码生成目标形式化为两个连续子目标：生成符合给定示例的代码，以及通过（迭代）给定的示例生成成功实现目标功能的代码。我们使用包含168个多样化目标功能的新基准测试评估了六种前沿LLMs。结果表明，当使用迭代I/O示例而非自然语言描述需求时，LLMs的得分下降超过60%，表明基于示例的代码生成对当前评估的LLMs仍具挑战性。更有趣的是，绝大多数（甚至超过95%）成功实现的功能均在首轮迭代中完成，这说明LLMs难以有效利用迭代补充的需求信息。