Retrieval-augmented generation (RAG) has proven effective for knowledge-intensive tasks, but is widely believed to offer limited benefit for reasoning-intensive problems such as math and code generation. We challenge this assumption by showing that the limitation lies not in RAG itself, but in the choice of corpus. Instead of retrieving documents, we propose retrieving thinking traces, i.e., intermediate thinking trajectories generated during problem solving attempts. We show that thinking traces are already a strong retrieval source, and further introduce T3, an offline method that transforms them into structured, retrieval-friendly representations, to improve usability. Using these traces as a corpus, a simple retrieve-then-generate pipeline consistently improves reasoning performance across strong models and benchmarks such as AIME 2025--2026, LiveCodeBench, and GPQA-Diamond, outperforming both non-RAG baselines and retrieval over standard web corpora. For instance, on AIME 2025-2026, RAG with traces generated by Gemini-2-thinking achieves relative gains of +56.3%, +8.6%, and +7.6% for Gemini-2.5-Flash, GPT-OSS-120B, and GPT-5, respectively, even though these are more recent models. Overall, our results suggest that thinking traces are an effective retrieval corpus for reasoning tasks, and transforming them into structured, compact, or diagnostic representations unlocks even stronger gains. Code available at https://github.com/Narabzad/t3.
翻译:检索增强生成(RAG)在知识密集型任务中被证明有效,但普遍认为其对数学与代码生成等推理密集型问题的收益有限。我们通过论证这种局限性的根源在于语料库选择而非RAG本身,挑战了这一假设。我们提出检索思维轨迹(即问题求解过程中产生的中间推理路径)而非文档,并证明思维轨迹本身已是强大的检索源;进一步提出T3离线方法,将其转化为结构化、利于检索的表示以提升可用性。基于该轨迹语料库,简单的检索-生成流程在AIME 2025-2026、LiveCodeBench、GPQA-Diamond等强模型与基准上持续提升推理性能,不仅超越非RAG基线,且优于基于标准网络语料库的检索。例如在AIME 2025-2026基准上,基于Gemini-2-thinking生成的轨迹进行RAG,对Gemini-2.5-Flash、GPT-OSS-120B、GPT-5等更先进模型分别取得+56.3%、+8.6%和+7.6%的相对提升。总体结果表明,思维轨迹是推理任务的有效检索语料库,将其转化为结构化、紧凑或诊断性表示可释放更强增益。代码已开源:https://github.com/Narabzad/t3。