AI foundation models have the capability to produce a wide array of responses to a single prompt, a feature that is highly beneficial in software engineering to generate diverse code solutions. However, this advantage introduces a significant trade-off between diversity and correctness. In software engineering tasks, diversity is key to exploring design spaces and fostering creativity, but the practical value of these solutions is heavily dependent on their correctness. Our study systematically investigates this trade-off using experiments with HumanEval tasks, exploring various parameter settings and prompting strategies. We assess the diversity of code solutions using similarity metrics from the code clone community. The study identifies combinations of parameters and strategies that strike an optimal balance between diversity and correctness, situated on the Pareto front of this trade-off space. These findings offer valuable insights for software engineers on how to effectively use AI foundation models to generate code solutions that are diverse and accurate.
翻译:AI基础模型能够针对单一提示生成多种响应,这一特性在软件工程中生成多样化代码解决方案时极具价值。然而,这种优势在多样性与正确性之间引入了显著的权衡取舍。在软件工程任务中,多样性是探索设计空间和激发创造力的关键,但这些解决方案的实际价值高度依赖于其正确性。本研究通过HumanEval任务实验系统性地探究了这一权衡关系,探索了多种参数设置和提示策略。我们采用代码克隆领域的相似性度量指标评估代码解决方案的多样性。研究识别出位于该权衡空间帕累托前沿的参数与策略组合,这些组合能够在多样性与正确性之间实现最优平衡。研究结果为软件工程师如何有效利用AI基础模型生成兼具多样性与准确性的代码解决方案提供了宝贵洞见。