Coding in a Bubble? Evaluating LLMs in Resolving Context Adaptation Bugs During Code Adaptation

Code adaptation is a fundamental but challenging task in software development, requiring developers to modify existing code for new contexts. A key challenge is to resolve Context Adaptation Bugs (CtxBugs), which occurs when code correct in its original context violates constraints in the target environment. Unlike isolated bugs, CtxBugs cannot be resolved through local fixes and require cross-context reasoning to identify semantic mismatches. Overlooking them may lead to critical failures in adaptation. Although Large Language Models (LLMs) show great potential in automating code-related tasks, their ability to resolve CtxBugs remains a significant and unexplored obstacle to their practical use in code adaptation. To bridge this gap, we propose CtxBugGen, a novel framework for generating CtxBugs to evaluate LLMs. Its core idea is to leverage LLMs' tendency to generate plausible but context-free code when contextual constraints are absent. The framework generates CtxBugs through a four-step process to ensure their relevance and validity: (1) Adaptation Task Selection, (2) Task-specific Perturbation,(3) LLM-based Variant Generation and (4) CtxBugs Identification. Based on the benchmark constructed by CtxBugGen, we conduct an empirical study with four state-of-the-art LLMs. Our results reveal their unsatisfactory performance in CtxBug resolution. The best performing LLM, Kimi-K2, achieves 55.93% on Pass@1 and resolves just 52.47% of CtxBugs. The presence of CtxBugs degrades LLMs' adaptation performance by up to 30%. Failure analysis indicates that LLMs often overlook CtxBugs and replicate them in their outputs. Our study highlights a critical weakness in LLMs' cross-context reasoning and emphasize the need for new methods to enhance their context awareness for reliable code adaptation.

翻译：代码适配是软件开发中基础但具有挑战性的任务，要求开发者针对新环境修改现有代码。一个关键挑战在于解决上下文适配错误，即原本正确的代码在目标环境中违反约束条件。与孤立错误不同，上下文适配错误无法通过局部修复解决，需要跨上下文推理来识别语义不匹配。忽略此类错误可能导致适配过程中的严重故障。尽管大型语言模型在自动化代码相关任务中展现出巨大潜力，但其解决上下文适配错误的能力仍然是阻碍其在代码适配中实际应用的重要且尚未探索的障碍。为填补这一空白，我们提出CtxBugGen——一个用于生成上下文适配错误以评估大型语言模型的新型框架。其核心思想是利用大型语言模型在缺乏上下文约束时倾向于生成合理但脱离上下文的代码特性。该框架通过四步流程生成具有相关性和有效性的上下文适配错误：(1)适配任务选择，(2)任务特定扰动，(3)基于大型语言模型的变体生成，以及(4)上下文适配错误识别。基于CtxBugGen构建的基准测试，我们对四个前沿大型语言模型进行了实证研究。结果表明它们在解决上下文适配错误方面表现欠佳：性能最佳的Kimi-K2模型在Pass@1指标上仅达到55.93%，仅能解决52.47%的上下文适配错误。上下文适配错误的存在使大型语言模型的适配性能下降最高达30%。失败案例分析表明，大型语言模型常忽视上下文适配错误并在输出中复现这些错误。本研究揭示了大型语言模型在跨上下文推理方面的关键缺陷，并强调需要新方法来增强其上下文感知能力，以实现可靠的代码适配。