Recent progress in large language models (LLMs) has improved code generation, but most evaluations still test isolated, small-scale code (e.g., a single function) under default or unspecified software environments. As a result, it is unclear whether LLMs can reliably generate executable code tailored to a user's specific environment. We present the first systematic study of Environment-Aware Code Generation (EACG), where generated code must be functionally correct and directly executable under arbitrary software configurations. To enable realistic evaluation, we introduce VersiBCB, a benchmark that is multi-package, execution-verified, and deprecation-aware, capturing complex and evolving environments that prior datasets often overlook. Using VersiBCB, we investigate three complementary adaptation axes: data, parameters, and cache, and develop representative strategies for each. Our results show that current LLMs struggle with environment-specific code generation, while our adaptations improve environment compatibility and executability. These findings highlight key challenges and opportunities for deploying LLMs in practical software engineering workflows.
翻译:近年来,大型语言模型(LLM)在代码生成方面取得了进展,但大多数评估仍针对孤立、小规模的代码(例如单个函数),并在默认或未指定的软件环境下进行测试。因此,尚不清楚LLM能否可靠地生成适应用户特定环境的可执行代码。我们首次对“环境感知代码生成”(EACG)进行了系统性研究,该任务要求生成的代码在任意软件配置下功能正确且可直接执行。为实现真实评估,我们提出了VersiBCB基准,该基准具备多包依赖、执行验证和过时感知特性,能够捕捉先前数据集常忽略的复杂且不断演化的环境。基于VersiBCB,我们研究了数据、参数和缓存三个互补的适应维度,并为每个维度开发了代表性策略。实验结果表明,当前LLM在处理环境特定代码生成方面仍面临困难,而我们的适应方法能有效提升环境兼容性与可执行性。这些发现揭示了在实际软件工程工作流中部署LLM所面临的关键挑战与机遇。