Compute Where it Counts: Self Optimizing Language Models

Efficient LLM inference research has largely focused on reducing the cost of each decoding step (e.g., using quantization, pruning, or sparse attention), typically applying a uniform computation budget to every generated token. In practice, token difficulty varies widely, so static compression can over-compute on easy steps and under-compute on hard ones. We study dynamic budget allocation for autoregressive decoding: learning how much computation to spend per token from within a single model. Self-Optimizing Language Models (SOL) pair a frozen LLM with a lightweight policy network that reads the LLM hidden state and selects a discrete efficiency action at each decode step. Actions can jointly control (i) token-level attention sparsity, (ii) structured activation pruning in the MLP, and (iii) activation quantization bit-width, while leaving the base model weights unchanged. We train the policy with group-relative policy optimization on teacher-forced episodes: the token sequence is fixed, while we sample multiple compute schedules (i.e., "counterfactual" schedules that vary only the efficiency actions for the same token path) and compare their likelihoods under the same supervision. Our reward trades off language-model quality against soft penalties that encourage episode-average budget usage to match a requested target. Across model variants and compute regimes, SOL improves quality at matched budget over static allocation and strong random schedule search, offering a complementary axis for inference-efficiency optimization. SOL discovers a better quality-efficiency pareto-front across all our experiments and improves MMLU accuracy by up to 7.3% over uniform budget allocation strategies.

翻译：高效大语言模型（LLM）推理的研究主要集中于降低每个解码步骤的成本（例如，使用量化、剪枝或稀疏注意力），通常对每个生成的词元应用统一的计算预算。在实践中，词元难度差异很大，因此静态压缩可能在简单步骤上过度计算，而在困难步骤上计算不足。我们研究自回归解码的动态预算分配：学习从单个模型内部为每个词元分配多少计算量。自优化语言模型（SOL）将一个冻结的LLM与一个轻量级策略网络配对，该网络读取LLM的隐藏状态，并在每个解码步骤选择一个离散的效率动作。这些动作可以联合控制 (i) 词元级别的注意力稀疏性，(ii) 多层感知机（MLP）中的结构化激活剪枝，以及 (iii) 激活量化位宽，同时保持基础模型权重不变。我们使用基于组的相对策略优化（Group-relative Policy Optimization）在教师强制（teacher-forced）片段上训练该策略：词元序列是固定的，同时我们采样多个计算调度（即“反事实”调度，它们仅在相同词元路径上改变效率动作），并在相同监督下比较它们的似然性。我们的奖励在语言模型质量与软惩罚之间进行权衡，软惩罚鼓励片段平均预算使用量匹配请求的目标。在不同的模型变体和计算模式下，与静态分配和强随机调度搜索相比，SOL在匹配预算下提高了质量，为推理效率优化提供了互补的维度。SOL在所有实验中发现了更好的质量-效率帕累托前沿，并且与统一预算分配策略相比，将MMLU准确率提高了高达7.3%。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

[ICML 2026] SOL：让大模型把算力花在关键Token上：自优化语言模型

专知会员服务

7+阅读 · 5月12日

【伯克利博士论文】基于投机性解码的高效大语言模型系统

专知会员服务

16+阅读 · 1月4日

【伯克利博士论文】《通过高效和自动化系统赋能大型语言模型》，154页pdf

专知会员服务

20+阅读 · 2024年9月3日

更快更轻量的大型语言模型：当前挑战及未来发展路径综述

专知会员服务

42+阅读 · 2024年2月8日