Large Language Models (LLMs) excel in reasoning but remain constrained by their Chain-of-Thought (CoT) approach, which struggles with complex tasks requiring more nuanced topological reasoning. We introduce SOLAR, Scalable Optimization of Large-scale Architecture for Reasoning, a framework that dynamically optimizes various reasoning topologies to enhance accuracy and efficiency. Our Topological Annotation Generation (TAG) system automates topological dataset creation and segmentation, improving post-training and evaluation. Additionally, we propose Topological-Scaling, a reward-driven framework that aligns training and inference scaling, equipping LLMs with adaptive, task-aware reasoning. SOLAR achieves substantial gains on MATH and GSM8K: +5% accuracy with Topological Tuning, +9% with Topological Reward, and +10.02% with Hybrid Scaling. It also reduces response length by over 5% for complex problems, lowering inference latency. To foster the reward system, we train a multi-task Topological Reward Model (M-TRM), which autonomously selects the best reasoning topology and answer in a single pass, eliminating the need for training and inference on multiple single-task TRMs (S-TRMs), thus reducing both training cost and inference latency. In addition, in terms of performance, M-TRM surpasses all S-TRMs, improving accuracy by +10% and rank correlation by +9%. To the best of our knowledge, SOLAR sets a new benchmark for scalable, high-precision LLM reasoning while introducing an automated annotation process and a dynamic reasoning topology competition mechanism.
翻译:大型语言模型(LLM)在推理任务中表现出色,但仍受限于其思维链方法,该方法在处理需要更精细拓扑推理的复杂任务时存在不足。本文提出SOLAR(面向推理的大规模架构可扩展优化),这是一个动态优化多种推理拓扑结构以提升准确性与效率的框架。我们的拓扑标注生成系统实现了拓扑数据集构建与分割的自动化,从而改善了后训练与评估流程。此外,我们提出了拓扑缩放——一种奖励驱动的框架,通过对齐训练与推理的缩放过程,使LLM具备自适应、任务感知的推理能力。SOLAR在MATH和GSM8K数据集上取得了显著提升:通过拓扑调优准确率提升+5%,通过拓扑奖励提升+9%,通过混合缩放提升+10.02%。同时,针对复杂问题,其响应长度缩短超过5%,降低了推理延迟。为构建奖励系统,我们训练了一个多任务拓扑奖励模型,该模型能够单次自动选择最优推理拓扑与答案,无需对多个单任务拓扑奖励模型进行训练与推理,从而同时降低了训练成本与推理延迟。在性能方面,M-TRM超越了所有S-TRM,准确率提升+10%,排序相关性提升+9%。据我们所知,SOLAR通过引入自动化标注流程与动态推理拓扑竞争机制,为可扩展、高精度的LLM推理设立了新的基准。