Thanks to unprecedented language understanding and generation capabilities of large language model (LLM), Retrieval-augmented Code Generation (RaCG) has recently been widely utilized among software developers. While this has increased productivity, there are still frequent instances of incorrect codes being provided. In particular, there are cases where plausible yet incorrect codes are generated for queries from users that cannot be answered with the given queries and API descriptions. This study proposes a task for evaluating answerability, which assesses whether valid answers can be generated based on users' queries and retrieved APIs in RaCG. Additionally, we build a benchmark dataset called Retrieval-augmented Code Generability Evaluation (RaCGEval) to evaluate the performance of models performing this task. Experimental results show that this task remains at a very challenging level, with baseline models exhibiting a low performance of 46.7%. Furthermore, this study discusses methods that could significantly improve performance.
翻译:得益于大型语言模型(LLM)前所未有的语言理解与生成能力,检索增强代码生成(RaCG)近来在软件开发人员中得到了广泛应用。尽管这提高了生产效率,但提供错误代码的情况仍频繁发生。特别是,对于用户提出的某些查询,仅凭给定的查询和API描述无法给出答案,但模型仍会生成看似合理实则错误的代码。本研究提出了一项评估可回答性的任务,旨在判断在RaCG中能否基于用户的查询和检索到的API生成有效答案。此外,我们构建了一个名为检索增强代码可生成性评估(RaCGEval)的基准数据集,用于评估执行此任务的模型性能。实验结果表明,该任务仍处于极具挑战性的水平,基线模型仅表现出46.7%的低性能。此外,本研究还探讨了可能显著提升性能的方法。