Code Correctness Signals in LLM Hidden States: Pre-Generation Probing and Repair Geometry

Large language models encode rich information in their hidden states. This work asks whether code correctness is legible in the hidden states of Qwen3-4B-Instruct-2507, before it generates and as it repairs a failed attempt, studied on 444 LiveCodeBench tasks. It reports two findings connected by a single confound-control tool: residualization. First, the correctness of the model's first-attempt code is linearly decodable from the prompt-final hidden state, with a leakage-free held-out AUC of 0.931 +/- 0.008 across 50 outer splits. After the linear effect of prompt length is removed from each hidden state dimension, the probe still reaches 0.911 +/- 0.010, well above a prompt-length baseline of 0.754 +/- 0.014. Second, on 236 cleaned cases where the model attempts to repair a failed first attempt, the hidden state shift from the failing attempt to its repair carries a statistically detectable contrastive direction, significant on both a magnitude and a split-half test against label-shuffled nulls. This direction does not survive a conditional residualization against repair-context covariates that differ between successful and failed repairs, marking it as a correlate of repair success driven by the repair context rather than an isolated repair-comprehension feature. The probe layer is selected by nested cross-validation, and the same residualization approach that upholds the pre-generation correctness result overturns the repair-direction interpretation. The contribution is as much methodological as empirical: a diagnostic honest enough to report a negative result alongside a positive one.

翻译：大语言模型在其隐状态中编码了丰富的信息。本文探究在Qwen3-4B-Instruct-2507模型生成代码前以及修复失败尝试时，代码正确性是否可从其隐状态中解读，研究基于444个LiveCodeBench任务展开。本文报告两项发现，并通过同一混淆控制工具——残差化（residualization）加以关联。首先，模型首次尝试生成代码的正确性可从提示词末尾的隐状态中线性解码，在50次外部分割中达到无泄漏的留存样本AUC为0.931±0.008。在从每个隐状态维度中剔除提示词长度的线性效应后，探测器的AUC仍达到0.911±0.010，远高于基于提示词长度的基线值0.754±0.014。其次，在236个经过清洗的案例中（模型尝试修复失败的首次尝试），从失败尝试到修复尝试的隐状态偏移携带了统计上可检测的对比方向，该方向在幅度检验和分割半样本检验中均显著优于标签随机打乱的零假设。然而，该方向在经过针对修复上下文协变量的条件残差化处理后便不再成立——这些协变量在成功修复与失败修复间存在差异，表明该方向是修复成功与修复上下文的关联特征，而非独立的修复理解特征。探测器的隐藏层通过嵌套交叉验证选定，而相同的残差化方法既支撑了生成前正确性结果的有效性，也推翻了修复方向解释的可靠性。本文的贡献既在于方法论层面，也在于实证层面：提供了一种诚实可靠的诊断方法，在报告正向结果的同时也能揭示负向结果。