Language models are increasingly adopting smaller architectures optimized for consumer devices. In this setting, inference efficiency is the primary constraint. Meanwhile, vocabulary sizes continue to grow rapidly, making the classification head a critical bottleneck that accounts for up to 60\% of model parameters, and 50\% of inference compute. We introduce FlashHead, the first efficient drop-in replacement for the dense classification head that is training-free and hardware-friendly. FlashHead builds on principles from information retrieval, reframing that computation at the output head as a retrieval problem rather than a dense classification over the full vocabulary. FlashHead introduces four key innovations: (1) a balanced clustering scheme that structures vocabulary partitions into compact hardware-efficient tensors, (2) extending multiprobe retrieval to language model heads, enabling thousands of clusters to be scored in parallel, (3) a novel inference-time sampling mechanism that extends retrieval beyond top tokens, enabling probabilistic sampling across the full vocabulary, and (4) selective quantization, enabling effective low-bit computation in the head. Experiments on Llama-3.2, Gemma-3, and Qwen-3 show that FlashHead delivers model-level inference speedups of up to \textbf{1.75x} which maintaining output accuracy compared to the original head. By overcoming the classification head bottleneck, FlashHead establishes a new benchmark for efficient inference and removes a key barrier to developing smaller, capable models for consumer hardware.
翻译:语言模型正日益采用针对消费设备优化的更小型架构。在此背景下,推理效率成为主要约束。与此同时,词汇表规模持续快速增长,使得分类头成为关键瓶颈,其参数量可占模型总参数的60%,推理计算量占比达50%。我们提出了FlashHead,首个无需重新训练且对硬件友好的高效即插即用密集分类头替代方案。FlashHead基于信息检索原理,将输出头的计算重新定义为检索问题,而非在整个词汇表上进行密集分类。FlashHead引入了四项关键创新:(1) 平衡聚类方案,将词汇分区组织为紧凑的硬件高效张量;(2) 将多探针检索技术扩展至语言模型头部,支持并行评估数千个聚类;(3) 新颖的推理时采样机制,将检索范围扩展至非最高分词元,实现全词汇表的概率采样;(4) 选择性量化技术,支持在头部进行有效的低位宽计算。在Llama-3.2、Gemma-3和Qwen-3上的实验表明,FlashHead在保持输出精度的同时,可实现高达\textbf{1.75倍}的模型级推理加速。通过突破分类头瓶颈,FlashHead为高效推理树立了新标杆,并为开发适用于消费硬件的更小型高性能模型扫除了关键障碍。