The integration of Large Language Models (LLMs) into Development Environments (IDEs) has become a focal point in modern software development. LLMs such as OpenAI GPT-3.5/4 and Code Llama offer the potential to significantly augment developer productivity by serving as intelligent, chat-driven programming assistants. However, utilizing LLMs out of the box is unlikely to be optimal for any given scenario. Rather, each system requires the LLM to be honed to its set of heuristics to ensure the best performance. In this paper, we introduce the Copilot evaluation harness: a set of data and tools for evaluating LLM-guided IDE interactions, covering various programming scenarios and languages. We propose our metrics as a more robust and information-dense evaluation than previous state of the art evaluation systems. We design and compute both static and execution based success metrics for scenarios encompassing a wide range of developer tasks, including code generation from natural language (generate), documentation generation from code (doc), test case generation (test), bug-fixing (fix), and workspace understanding and query resolution (workspace). These success metrics are designed to evaluate the performance of LLMs within a given IDE and its respective parameter space. Our learnings from evaluating three common LLMs using these metrics can inform the development and validation of future scenarios in LLM guided IDEs.
翻译:大语言模型(LLMs)与开发环境(IDEs)的集成已成为现代软件开发的核心焦点。诸如OpenAI GPT-3.5/4和Code Llama等大语言模型通过充当智能的对话式编程助手,具有显著提升开发者生产力的潜力。然而,直接使用未经过适配的LLMs难以在任意场景中达到最优效果。每个系统都需要根据其启发式方法对LLM进行精细调校,以确保最佳性能。本文提出Copilot评估框架:一套用于评估LLM引导的IDE交互的数据与工具,涵盖多种编程场景与语言。我们提出的度量指标相比现有最先进的评估系统更鲁棒且信息密度更高。针对开发者任务中的广泛场景(包括从自然语言生成代码的生成任务、从代码生成文档的文档任务、测试用例生成的测试任务、缺陷修复的修复任务,以及工作区理解与查询解析的工作区任务),我们设计并计算了基于静态分析与执行结果的成功度量指标。这些成功度量指标旨在评估LLM在特定IDE及其相应参数空间中的性能表现。通过应用这些指标对三种常见LLM进行评估所获得的经验,可为未来LLM引导型IDE场景的开发与验证提供参考。