Large language models (LLMs) increasingly assist software engineering tasks that require reasoning over long code contexts, yet their robustness under varying input conditions remains unclear. We conduct a systematic study of long-context code question answering using controlled ablations that test sensitivity to answer format, distractors, and context scale. Extending LongCodeBench Python dataset with new COBOL and Java question-answer sets, we evaluate state-of-the-art models under three settings: (i) shuffled multiple-choice options, (ii) open-ended questions and (iii) needle-in-a-haystack contexts containing relevant and adversarially irrelevant information. Results show substantial performance drops in both shuffled multiple-choice options and open-ended questions, and brittle behavior in the presence of irrelevant cues. Our findings highlight limitations of current long-context evaluations and provide a broader benchmark for assessing code reasoning in both legacy and modern systems.
翻译:大型语言模型(LLMs)越来越多地协助需要基于长代码上下文进行推理的软件工程任务,但其在不同输入条件下的鲁棒性仍不明确。我们通过受控消融实验系统研究了长上下文代码问答,测试了模型对答案格式、干扰项和上下文规模的敏感性。通过将LongCodeBench Python数据集扩展至包含新的COBOL和Java问答集,我们在三种设置下评估了最先进的模型:(i)打乱的多选选项,(ii)开放式问题,以及(iii)包含相关信息和对抗性无关信息的“大海捞针”式上下文。结果显示,模型在打乱的多选选项和开放式问题中均出现显著性能下降,且在存在无关线索时表现出脆弱性。我们的研究结果揭示了当前长上下文评估的局限性,并为评估传统与现代系统中的代码推理能力提供了更广泛的基准。