Large Language Models (LLMs) face a persistent trade-off between inference cost and reasoning capability. While "Oracle" models (e.g., Llama-3-70B) achieve state-of-the-art accuracy, they are prohibitively expensive for high-volume deployment. Smaller models (e.g., 8B parameters) are cost-effective but struggle with complex tasks. In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary. By leveraging semantic agreement and confidence calibration among an ensemble of small models, our Router identifies "hard" problems with high precision. On the GSM8K benchmark, our system achieves 93.0% accuracy, effectively matching the Oracle baseline (98.0%) while reducing compute costs by 61%. We demonstrate that the system introduces negligible latency overhead (+0.82s) and allows for a tunable trade-off between performance and budget.
翻译:大型语言模型(LLM)始终面临推理成本与推理能力之间的权衡。虽然“预言家”模型(例如Llama-3-70B)实现了最先进的精度,但其高昂的成本使其难以大规模部署。较小的模型(例如80亿参数)具有成本效益,但在处理复杂任务时表现欠佳。本研究提出“金字塔MoA”,一种分层混合专家架构,它利用一个轻量级路由器仅在必要时动态升级查询。通过利用多个小模型集合中的语义一致性和置信度校准,我们的路由器能够以高精度识别“困难”问题。在GSM8K基准测试中,我们的系统达到了93.0%的准确率,有效匹配了预言家基线(98.0%),同时将计算成本降低了61%。我们证明该系统引入的延迟开销可忽略不计(+0.82秒),并允许在性能与预算之间进行可调节的权衡。