Model See, Model Do? Exposure-Aware Evaluation of Bug-vs-Fix Preference in Code LLMs

Large language models are increasingly used for code generation and debugging, but their outputs can still contain bugs, that originate from training data. Distinguishing whether an LLM prefers correct code, or a familiar incorrect version might be influenced by what it's been exposed to during training. We introduce an exposure-aware evaluation framework that quantifies how prior exposure to buggy versus fixed code influences a model's preference. Using the ManySStuBs4J benchmark, we apply Data Portraits for membership testing on the Stack-V2 corpus to estimate whether each buggy and fixed variant was seen during training. We then stratify examples by exposure and compare model preference using code completion as well as multiple likelihood-based scoring metrics We find that most examples (67%) have neither variant in the training data, and when only one is present, fixes are more frequently present than bugs. In model generations, models reproduce buggy lines far more often than fixes, with bug-exposed examples amplifying this tendency and fix-exposed examples showing only marginal improvement. In likelihood scoring, minimum and maximum token-probability metrics consistently prefer the fixed code across all conditions, indicating a stable bias toward correct fixes. In contrast, metrics like the Gini coefficient reverse preference when only the buggy variant was seen. Our results indicate that exposure can skew bug-fix evaluations and highlight the risk that LLMs may propagate memorised errors in practice.

翻译：大语言模型在代码生成与调试中的应用日益广泛，但其输出仍可能包含源自训练数据的缺陷。区分大语言模型是倾向于正确代码，还是更熟悉训练过程中接触过的错误版本，可能受其训练期间所接触内容的影响。本文提出一种暴露感知评估框架，用于量化先前接触缺陷代码与修复代码如何影响模型的偏好。基于ManySStuBs4J基准，我们采用Stack-V2语料库的数据肖像技术进行成员资格测试，以估计每个缺陷版本和修复版本在训练过程中是否被模型学习过。随后按暴露程度对样本进行分层，并通过代码补全以及多种基于似然性的评分指标来比较模型偏好。研究发现：多数样本（67%）的两种变体均未出现在训练数据中；当仅存在一种变体时，修复版本的出现频率高于缺陷版本。在模型生成过程中，模型重现缺陷代码行的频率远高于修复版本，其中暴露于缺陷的样本会放大这种倾向，而暴露于修复的样本仅显示边际改善。在似然性评分方面，最小与最大词元概率指标在所有条件下均稳定倾向于修复代码，表明模型存在对正确修复的稳定偏好。相比之下，基尼系数等指标在仅接触缺陷变体时会出现偏好反转。研究结果表明，训练暴露可能扭曲缺陷-修复评估结果，并凸显了大语言模型在实践中传播记忆性错误的风险。