Algorithmic Primitives and Compositional Geometry of Reasoning in Language Models

How do latent and inference time computations enable large language models (LLMs) to solve multi-step reasoning? We introduce a framework for tracing and steering algorithmic primitives that underlie model reasoning. Our approach links reasoning traces to internal activations and evaluates algorithmic primitives by injecting them into residual streams and measuring their effect on reasoning steps and task performance. We consider four benchmarks: Traveling Salesperson Problem (TSP), 3SAT, AIME, and graph navigation. We operationalize primitives by clustering activations and annotating their matched reasoning traces using an automated LLM pipeline. We then apply function vector methods to derive primitive vectors as reusable compositional building blocks of reasoning. Primitive vectors can be combined through addition, subtraction, and scalar operations, revealing a geometric logic in activation space. Cross-task and cross-model evaluations (Phi-4, Phi-4-Reasoning, Llama-3-8B) show both shared and task-specific primitives. Notably, comparing Phi-4 with its reasoning-finetuned variant highlights compositional generalization after finetuning: Phi-4-Reasoning exhibits more systematic use of verification and path-generation primitives. Injecting the associated primitive vectors in Phi-4 induces behavioral hallmarks associated with Phi-4-Reasoning. Together, these findings demonstrate that reasoning in LLMs may be supported by a compositional geometry of algorithmic primitives, that primitives transfer cross-task and cross-model, and that reasoning finetuning strengthens algorithmic generalization across domains.

翻译：潜在计算与推理时计算如何使大型语言模型（LLM）能够解决多步推理问题？我们提出了一个用于追踪和调控支撑模型推理的算法基元的框架。该方法将推理轨迹与内部激活相关联，并通过向残差流中注入算法基元并评估其对推理步骤与任务性能的影响来检验这些基元。我们考察了四个基准任务：旅行商问题（TSP）、3SAT、美国数学邀请赛（AIME）以及图导航。我们通过聚类激活并使用自动化LLM流程标注其匹配的推理轨迹来具体化算法基元。随后，我们应用函数向量方法推导出可作为推理可复用组合构建块的基元向量。这些基元向量可通过加法、减法及标量运算进行组合，揭示了激活空间中的一种几何逻辑。跨任务与跨模型评估（Phi-4、Phi-4-Reasoning、Llama-3-8B）显示存在共享的与任务特定的基元。值得注意的是，通过比较Phi-4与其经过推理微调的变体，突显了微调后的组合泛化能力：Phi-4-Reasoning展现出更系统化地使用验证与路径生成基元。在Phi-4中注入相应的基元向量会诱导出与Phi-4-Reasoning相关的行为特征。综上所述，这些发现表明LLM中的推理可能由算法基元的组合几何结构所支撑，这些基元具有跨任务与跨模型的可迁移性，且推理微调能增强跨领域的算法泛化能力。