This technical report introduces AngelSlim, a comprehensive and versatile toolkit for large model compression developed by the Tencent Hunyuan team. By consolidating cutting-edge algorithms, including quantization, speculative decoding, token pruning, and distillation. AngelSlim provides a unified pipeline that streamlines the transition from model compression to industrial-scale deployment. To facilitate efficient acceleration, we integrate state-of-the-art FP8 and INT8 Post-Training Quantization (PTQ) algorithms alongside pioneering research in ultra-low-bit regimes, featuring HY-1.8B-int2 as the first industrially viable 2-bit large model. Beyond quantization, we propose a training-aligned speculative decoding framework compatible with multimodal architectures and modern inference engines, achieving 1.8x to 2.0x throughput gains without compromising output correctness. Furthermore, we develop a training-free sparse attention framework that reduces Time-to-First-Token (TTFT) in long-context scenarios by decoupling sparse kernels from model architectures through a hybrid of static patterns and dynamic token selection. For multimodal models, AngelSlim incorporates specialized pruning strategies, namely IDPruner for optimizing vision tokens via Maximal Marginal Relevance and Samp for adaptive audio token merging and pruning. By integrating these compression strategies from low-level implementations, AngelSlim enables algorithm-focused research and tool-assisted deployment.
翻译:本技术报告介绍了AngelSlim——由腾讯混元团队开发的全面且多用途的大模型压缩工具包。通过整合包括量化、推测解码、令牌剪枝和蒸馏在内的前沿算法,AngelSlim提供了统一的流水线,简化了从模型压缩到工业级部署的过渡。为实现高效加速,我们集成了最先进的FP8和INT8训练后量化算法以及超低位宽领域的开创性研究,其中HY-1.8B-int2是首个工业可行的2比特大模型。在量化之外,我们提出了一个与多模态架构及现代推理引擎兼容的、面向训练对齐的推测解码框架,在不牺牲输出正确性的前提下实现了1.8倍至2.0倍的吞吐量提升。此外,我们开发了一个免训练的稀疏注意力框架,通过一种结合静态模式与动态令牌选择的混合方法,将稀疏核与模型架构解耦,从而减少了长上下文场景中的首令牌生成时间。针对多模态模型,AngelSlim集成了专门的剪枝策略,即用于优化视觉令牌的IDPruner(基于最大边际相关性)和用于自适应音频令牌合并与剪枝的Samp。通过从底层实现层面集成这些压缩策略,AngelSlim支持面向算法的研究和工具辅助的部署。