From Brewing to Resolution: Tracing the Internal Lifecycle of Code Reasoning in LLMs

Standard accuracy metrics cannot explain why LLMs handle variable tracking but fail on semantically equivalent loops. We study an internal lifecycle of code reasoning in which models first brew the answer, making it linearly recoverable many layers before it becomes self-decodable, and then diverge into one of four resolution outcomes: Resolved, Overprocessed, Misresolved, or Unresolved. Understanding this lifecycle matters because similar task accuracies can mask fundamentally different failure modes that surface-level evaluation cannot detect. We introduce a dual diagnostic framework pairing layer-wise linear probing with Context-Stripped Decoding (CSD) and apply it to six code-reasoning task families across 16 models spanning Qwen, Llama, and DeepSeek architectures. All four outcomes carry substantial mass in every task family: overall Resolved is only 41.5%, with multiple tasks below 30%. Controlled sweeps over structure, depth, and operators expose task-specific failure bottlenecks: Function Call Resolved plunges from 61.1% to 2.5% as call depth increases from one to three. Across architectures and scales, the brewing scaffold remains stable, with normalized brewing duration 24-42% across all 16 models, while resolution success varies with capability. This indicates that the scaffold is a stable empirical regularity across the tested decoder-only Transformer families, whereas resolution success covaries with capability, scale, and training. Code: https://github.com/euyis1019/llm-brewing

翻译：标准准确率指标无法解释为何大语言模型（LLMs）能处理变量追踪任务，却在语义等价的循环结构上失败。我们研究了代码推理的内部生命周期：模型首先"酝酿"答案，使其在多个层级上具有线性可恢复性（早于其实现自我解码的阶段），随后分化为四种解析结果——已解析、过度处理、误解析、未解析。理解这一生命周期至关重要，因为相似的任务准确率可能掩盖表层评估无法检测的根本性差异错误模式。我们提出了一种双重诊断框架，将层级线性探测与上下文剥离解码（CSD）相结合，并应用于涵盖Qwen、Llama和DeepSeek架构的16个模型共六类代码推理任务族。所有四种结果在每一任务族中均占据显著比例：整体"已解析"率仅为41.5%，多个任务低于30%。通过系统性地调控结构、深度与算子，我们揭示了任务特定的失败瓶颈：当函数调用深度从一层增至三层时，"已解析"率从61.1%骤降至2.5%。在不同架构与规模下，酝酿支架保持稳定——全部16个模型的归一化酝酿时长介于24%-42%之间，而解析成功率随能力差异变化。这表明该支架是所测试的仅解码器Transformer系列中稳定的经验规律，而解析成功率则与能力、规模及训练条件共同变化。代码：https://github.com/euyis1019/llm-brewing