Efficient LLM inference research has largely focused on reducing the cost of each decoding step (e.g., using quantization, pruning, or sparse attention), typically applying a uniform computation budget to every generated token. In practice, token difficulty varies widely, so static compression can over-compute on easy steps and under-compute on hard ones. We study dynamic budget allocation for autoregressive decoding: learning how much computation to spend per token from within a single model. Self-Optimizing Language Models (SOL) pair a frozen LLM with a lightweight policy network that reads the LLM hidden state and selects a discrete efficiency action at each decode step. Actions can jointly control (i) token-level attention sparsity, (ii) structured activation pruning in the MLP, and (iii) activation quantization bit-width, while leaving the base model weights unchanged. We train the policy with group-relative policy optimization on teacher-forced episodes: the token sequence is fixed, while we sample multiple compute schedules (i.e., "counterfactual" schedules that vary only the efficiency actions for the same token path) and compare their likelihoods under the same supervision. Our reward trades off language-model quality against soft penalties that encourage episode-average budget usage to match a requested target. Across model variants and compute regimes, SOL improves quality at matched budget over static allocation and strong random schedule search, offering a complementary axis for inference-efficiency optimization. SOL discovers a better quality-efficiency pareto-front across all our experiments and improves MMLU accuracy by up to 7.3% over uniform budget allocation strategies.
翻译:高效大语言模型(LLM)推理的研究主要集中于降低每个解码步骤的成本(例如,使用量化、剪枝或稀疏注意力),通常对每个生成的词元应用统一的计算预算。在实践中,词元难度差异很大,因此静态压缩可能在简单步骤上过度计算,而在困难步骤上计算不足。我们研究自回归解码的动态预算分配:学习从单个模型内部为每个词元分配多少计算量。自优化语言模型(SOL)将一个冻结的LLM与一个轻量级策略网络配对,该网络读取LLM的隐藏状态,并在每个解码步骤选择一个离散的效率动作。这些动作可以联合控制 (i) 词元级别的注意力稀疏性,(ii) 多层感知机(MLP)中的结构化激活剪枝,以及 (iii) 激活量化位宽,同时保持基础模型权重不变。我们使用基于组的相对策略优化(Group-relative Policy Optimization)在教师强制(teacher-forced)片段上训练该策略:词元序列是固定的,同时我们采样多个计算调度(即“反事实”调度,它们仅在相同词元路径上改变效率动作),并在相同监督下比较它们的似然性。我们的奖励在语言模型质量与软惩罚之间进行权衡,软惩罚鼓励片段平均预算使用量匹配请求的目标。在不同的模型变体和计算模式下,与静态分配和强随机调度搜索相比,SOL在匹配预算下提高了质量,为推理效率优化提供了互补的维度。SOL在所有实验中发现了更好的质量-效率帕累托前沿,并且与统一预算分配策略相比,将MMLU准确率提高了高达7.3%。