Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration

Akhiad Bercovich,Nir Ailon,Vladimir Anisimov,Tomer Asida,Nave Assaf,Mohammad Dabbah,Ido Galil,Amnon Geifman,Yonatan Geifman,Izhak Golan,Roi Koren,Itay Levy,Zach Moshe,Pavlo Molchanov,Najeeb Nabwani,Mostofa Patwari,Omri Puny,Tomer Ronen,Itamar Schen,Elad Segal,Ido Shahaf,Oren Tropp,Ran Zilberstein,Ran El-Yaniv

Reasoning-focused LLMs improve answer quality by generating longer reasoning traces, but the additional tokens dramatically increase serving cost, motivating inference optimization. We extend and apply Puzzle, a post-training neural architecture search (NAS) framework, to gpt-oss-120B to produce gpt-oss-puzzle-88B, a deployment-optimized derivative. Our approach combines heterogeneous MoE expert pruning, selective replacement of full-context attention with window attention, FP8 KV-cache quantization with calibrated scales, and post-training reinforcement learning to recover accuracy, while maintaining low generation length. In terms of per-token speeds, on an 8XH100 node we achieve 1.63X and 1.22X throughput speedups in long-context and short-context settings, respectively. gpt-oss-puzzle-88B also delivers throughput speedups of 2.82X on a single NVIDIA H100 GPU. However, because token counts can change with reasoning effort and model variants, per-token throughput (tok/s) and latency (ms/token) do not necessarily lead to end-to-end speedups: a 2X throughput gain is erased if traces grow 2X. Conversely, throughput gains can be spent on more reasoning tokens to improve accuracy; we therefore advocate request-level efficiency metrics that normalize throughput by tokens generated and trace an accuracy--speed frontier across reasoning efforts. We show that gpt-oss-puzzle-88B improves over gpt-oss-120B along the entire frontier, delivering up to 1.29X higher request-level efficiency. Across various benchmarks, gpt-oss-puzzle-88B matches or slightly exceeds the parent on suite-average accuracy across reasoning efforts, with retention ranging from 100.8% (high) to 108.2% (low), showing that post-training architecture search can substantially reduce inference costs without sacrificing quality.

翻译：专注于推理的大语言模型通过生成更长的推理轨迹来提高答案质量，但额外的令牌会显著增加服务成本，这促使了推理优化的需求。我们扩展并应用Puzzle——一种训练后神经架构搜索框架——到gpt-oss-120B模型，从而产生了gpt-oss-puzzle-88B，这是一个为部署优化的衍生模型。我们的方法结合了异构MoE专家剪枝、选择性用窗口注意力替换全上下文注意力、采用校准缩放因子的FP8 KV缓存量化，以及训练后强化学习以恢复精度，同时保持较短的生成长度。在每令牌速度方面，在8×H100节点上，我们在长上下文和短上下文设置中分别实现了1.63倍和1.22倍的吞吐量加速。gpt-oss-puzzle-88B在单个NVIDIA H100 GPU上也实现了2.82倍的吞吐量加速。然而，由于令牌数量可能随推理努力和模型变体而变化，每令牌吞吐量（令牌/秒）和延迟（毫秒/令牌）并不必然导致端到端的加速：如果推理轨迹增长2倍，那么2倍的吞吐量增益就会被抵消。相反，吞吐量增益可以用于生成更多的推理令牌以提高准确性；因此，我们提倡使用请求级效率指标，该指标通过生成的令牌数对吞吐量进行归一化，并描绘出跨不同推理努力程度的准确性与速度边界。我们证明，gpt-oss-puzzle-88B在整个边界上都优于gpt-oss-120B，实现了高达1.29倍的请求级效率提升。在各种基准测试中，gpt-oss-puzzle-88B在不同推理努力程度下的套件平均精度上匹配或略微超过了其父模型，保留率范围从100.8%（高努力）到108.2%（低努力），这表明训练后架构搜索可以在不牺牲质量的情况下大幅降低推理成本。