SOLAR: Scalable Optimization of Large-scale Architecture for Reasoning

Large Language Models (LLMs) excel in reasoning but remain constrained by their Chain-of-Thought (CoT) approach, which struggles with complex tasks requiring more nuanced topological reasoning. We introduce SOLAR, Scalable Optimization of Large-scale Architecture for Reasoning, a framework that dynamically optimizes various reasoning topologies to enhance accuracy and efficiency. Our Topological Annotation Generation (TAG) system automates topological dataset creation and segmentation, improving post-training and evaluation. Additionally, we propose Topological-Scaling, a reward-driven framework that aligns training and inference scaling, equipping LLMs with adaptive, task-aware reasoning. SOLAR achieves substantial gains on MATH and GSM8K: +5% accuracy with Topological Tuning, +9% with Topological Reward, and +10.02% with Hybrid Scaling. It also reduces response length by over 5% for complex problems, lowering inference latency. To foster the reward system, we train a multi-task Topological Reward Model (M-TRM), which autonomously selects the best reasoning topology and answer in a single pass, eliminating the need for training and inference on multiple single-task TRMs (S-TRMs), thus reducing both training cost and inference latency. In addition, in terms of performance, M-TRM surpasses all S-TRMs, improving accuracy by +10% and rank correlation by +9%. To the best of our knowledge, SOLAR sets a new benchmark for scalable, high-precision LLM reasoning while introducing an automated annotation process and a dynamic reasoning topology competition mechanism.

翻译：大型语言模型（LLM）在推理任务中表现出色，但仍受限于其思维链方法，该方法在处理需要更精细拓扑推理的复杂任务时存在不足。本文提出SOLAR（面向推理的大规模架构可扩展优化），这是一个动态优化多种推理拓扑结构以提升准确性与效率的框架。我们的拓扑标注生成系统实现了拓扑数据集构建与分割的自动化，从而改善了后训练与评估流程。此外，我们提出了拓扑缩放——一种奖励驱动的框架，通过对齐训练与推理的缩放过程，使LLM具备自适应、任务感知的推理能力。SOLAR在MATH和GSM8K数据集上取得了显著提升：通过拓扑调优准确率提升+5%，通过拓扑奖励提升+9%，通过混合缩放提升+10.02%。同时，针对复杂问题，其响应长度缩短超过5%，降低了推理延迟。为构建奖励系统，我们训练了一个多任务拓扑奖励模型，该模型能够单次自动选择最优推理拓扑与答案，无需对多个单任务拓扑奖励模型进行训练与推理，从而同时降低了训练成本与推理延迟。在性能方面，M-TRM超越了所有S-TRM，准确率提升+10%，排序相关性提升+9%。据我们所知，SOLAR通过引入自动化标注流程与动态推理拓扑竞争机制，为可扩展、高精度的LLM推理设立了新的基准。