Large language model (LLM) reasoning is ephemeral: chains of thought vanish with the context window, pruned search branches leave no record, and memory buffers cannot be diffed, merged, or audited. Every other complex software process (code, infrastructure, data, experiments) is version-controlled; reasoning is not. We introduce GitOfThoughts, which stores an agent's reasoning tree as a git repository: every scored thought is a commit, scores are notes, outcomes are tags, and retrieval is "git log" over the agent's own history. This makes reasoning replayable, auditable, and mergeable across agents at near-zero engineering cost. We then ask the harder question: does memory, in any substrate, actually improve accuracy? Across five substrates (none, markdown, vector, graph, git), two benchmarks, two model scales, and pre-registered replications, the answer for novel problems is no. No memory format reliably helps, and a promising early result collapsed under its own pre-registered replication. Memory pays only above what we call the copyability threshold: when the retrieved case is a near-duplicate of the current problem (similarity >~ 0.8), accuracy jumps sharply; below it, nothing. The gain is answer retrieval, not method transfer: a 4.5x larger model doubles the near-duplicate payoff yet still cannot extract a transferable method from a worked example. The only general lever we find is test-time sampling. The case for git-as-substrate is therefore auditability, provenance, and mergeability at accuracy parity. We document a retracted result and a refuted hypothesis to model the evaluation standard we hold ourselves to.
翻译:大语言模型(LLM)的推理是转瞬即逝的:思维链随上下文窗口消逝,被剪枝的搜索分支不留记录,记忆缓冲区无法进行差分、合并或审计。所有其他复杂软件流程(代码、基础设施、数据、实验)都受版本控制,而推理却未能如此。我们提出GitOfThoughts,将智能体的推理树存储为git仓库:每个评分的思维视为一次提交,评分为备注,结果为标签,检索则是对智能体自身历史执行“git log”。这使得推理在近乎零工程成本下实现跨智能体重放、可审计和可合并。我们进而提出更艰难的问题:记忆在任何基质上是否真正提升了准确性?在五种基质(无记忆、markdown、向量、图、git)、两个基准测试、两种模型规模及预注册复现实验中,对于新问题的答案是否定的。没有任何记忆格式能可靠地提升性能,且一项具有前景的早期结果在其预注册复现中彻底崩塌。记忆仅在达到我们所谓的可复制阈值时才发挥作用:当检索到的案例与当前问题高度相似(相似度>~0.8)时,准确性急剧提升;低于该阈值则毫无效果。这种提升源于答案检索而非方法迁移:一个规模大4.5倍的模型使近重复案例的收益翻倍,但依然无法从工作示例中提取可迁移的方法。我们发现唯一的普适杠杆是测试时采样。因此,将git作为基质的核心价值在于:在准确性持平的前提下实现可审计性、溯源性和可合并性。我们记录了一项撤回结果和一个被证伪的假设,以此践行我们所坚守的评估标准。