GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models

The rapid evolution of software libraries presents a significant challenge for code generation models, which must adapt to frequent version updates while maintaining compatibility with previous versions. Existing code completion benchmarks often overlook this dynamic aspect, and the one that does consider it relies on static code prediction tasks without execution-based evaluation, offering a limited perspective on a model's practical usability. To address this gap, we introduce \textbf{\GitChameleon{}}, a novel, manually curated dataset comprising 116 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. \GitChameleon{} is designed to rigorously assess the ability of modern large language models (LLMs) to generate version-specific code that is not only syntactically correct but also functionally accurate upon execution. Our comprehensive evaluations reveal that state-of-the-art LLMs struggle with this task; for instance, \textbf{GPT-4o} achieves a pass@10 of only 39.9\% (43.7\% when provided with error feedback), highlighting the complexity of the problem and the limitations of current models. By providing an execution-based benchmark that emphasizes the dynamic nature of code libraries, \GitChameleon{} serves as a critical tool to advance the development of more adaptable and reliable code generation models. For facilitation for further exploration of version-conditioned code generation, we make our code repository publicly accessible at \url{https://github.com/NizarIslah/GitChameleon}.

翻译：软件库的快速演进对代码生成模型提出了重大挑战，这些模型必须适应频繁的版本更新，同时保持与先前版本的兼容性。现有的代码补全基准测试往往忽视了这一动态特性，而考虑此特性的基准测试仅依赖静态代码预测任务，缺乏基于执行的评估，从而对模型的实际可用性提供了有限的视角。为弥补这一空白，我们引入了**GitChameleon**，这是一个新颖的、人工精心构建的数据集，包含116个Python代码补全问题，每个问题均以特定库版本为条件，并附有可执行的单元测试。GitChameleon旨在严格评估现代大语言模型（LLMs）生成版本特定代码的能力，要求生成的代码不仅在语法上正确，而且在执行时功能准确。我们的全面评估表明，最先进的LLMs在此任务上表现不佳；例如，**GPT-4o**的pass@10仅为39.9%（在提供错误反馈时为43.7%），这凸显了问题的复杂性以及当前模型的局限性。通过提供一个强调代码库动态特性的、基于执行的基准测试，GitChameleon成为推动开发更具适应性和可靠性的代码生成模型的关键工具。为促进对版本条件化代码生成的进一步探索，我们将代码仓库公开于\url{https://github.com/NizarIslah/GitChameleon}。