专家即所需：一种可组合的大语言模型推理框架 (Experts are all you need: A Composable Framework for Large Language Model Inference)

Large Language Models (LLMs) have achieved state-of-the-art accuracies in a variety of natural language processing (NLP) tasks. However, this success comes at the cost of increased model sizes which leads to additional computational burden. Mixture of Experts (MoEs) overcome this bottleneck by decoupling model capacity from computation by only activating a subset of parameters or "experts". However, these models require joint pretraining of these experts along with the router and do not model multi-step reasoning. In contrast, multi-agent frameworks improve reasoning by decomposing complex problems into modular subtasks. However, these frameworks rely on sequential "plan--act--observe" loops, which introduce significant latency. Our work, Comp-LLM, addresses these challenges by introducing a composable inference framework that enables cross-expert collaboration via an explicit sub-query dependency graph. Comp-LLM consists of three components: (1) A Sub-query Generator that decomposes an input query, assigns each sub-query to an appropriate expert using embedding similarity, and constructs a dependency graph; (2) A Query Executor that processes nodes in the graph and identifies opportunities for parallelism based on dependencies and resource constraints; and (3) A Response Aggregator that synthesizes intermediate expert responses into a coherent final answer. Across several benchmarks, Comp-LLM achieves up to 11.01% accuracy improvement over monolithic LLMs of similar size, while offering 1.67x--3.56x reduction in model size with no significant degradation relative to the largest model in its family. Additionally, Comp-LLM provides 1.1x--1.7x latency improvement compared to sequential sub-query processing.

翻译：大语言模型（LLMs）在多种自然语言处理（NLP）任务中取得了最先进的准确率。然而，这一成功伴随着模型规模增大带来的额外计算负担。专家混合模型（MoEs）通过仅激活部分参数或“专家”来解耦模型容量与计算，从而克服了这一瓶颈。但这些模型需要将专家与路由器联合预训练，且未建模多步推理。相比之下，多智能体框架通过将复杂问题分解为模块化子任务来改进推理，但依赖顺序的“规划-执行-观察”循环，引入了显著的延迟。我们的工作Comp-LLM通过引入一种可组合推理框架来解决这些挑战，该框架通过显式的子查询依赖图实现跨专家协作。Comp-LLM包含三个组件：（1）子查询生成器：分解输入查询，使用嵌入相似度为每个子查询分配适当的专家，并构建依赖图；（2）查询执行器：处理图中的节点，根据依赖关系和资源约束识别并行化机会；（3）响应聚合器：将中间专家响应合成为连贯的最终答案。在多个基准测试中，Comp-LLM相比同等规模的单体LLMs实现了高达11.01%的准确率提升，同时模型规模减少了1.67倍至3.56倍，且未相对于其家族中最大模型出现显著性能下降。此外，Comp-LLM相比顺序子查询处理提供了1.1倍至1.7倍的延迟改进。