Large language models (LLMs) exhibit exceptional general language processing capabilities, but their memory and compute costs hinder deployment. Ternarization has emerged as a promising compression technique, offering significant reductions in model size and inference complexity. However, existing methods struggle with heavy-tailed activation distributions and therefore keep activations in high precision, fundamentally limiting end-to-end inference acceleration. To overcome this limitation, we propose TWLA, a post-training quantization (PTQ) framework that achieves 1.58-bit weight compression and 4-bit activation quantization while maintaining high accuracy. TWLA comprises three components: (1) Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ) minimizes layer-output error under weight ternarization via a two-stage optimization from Euclidean initialization to manifold relocation; (2) Kronecker Orthogonal Tri-Modal Shaping (KOTMS) applies a Kronecker-structured orthogonal rotation to reshape weights into ternary-friendly tri-modal distributions, while the shared rotation statistically suppresses activation outliers; and (3) Inter-Layer Aware Activation Mixed Precision (ILA-AMP) explicitly introduces adjacent-layer second-order interaction costs in bit allocation and jointly optimizes for the layer-wise disparity of activation quantization gains induced by the shared orthogonal transform, preventing cascades triggered by a few weak layers. Extensive experiments demonstrate that TWLA maintains high accuracy under W1.58A4, while delivering significant inference acceleration. The code is available at https://github.com/Kishon-zzx/TWLA.
翻译:大语言模型(LLMs)展现出卓越的通用语言处理能力,但其存储与计算成本阻碍了部署。三值化已成为一种有前景的压缩技术,可显著降低模型规模与推理复杂度。然而,现有方法难以处理重尾激活分布,因此不得不保持高精度激活值,从根本上限制了端到端推理加速。为克服这一局限,我们提出TWLA——一种训练后量化(PTQ)框架,在保持高精度的同时实现1.58比特权重压缩与4比特激活量化。TWLA包含三个组件:(1)欧几里得-流形非对称三值量化器(E2M-ATQ)通过从欧几里得初始化到流形重定位的两阶段优化,最小化权重三值化下的层输出误差;(2)Kronecker正交三模态整形(KOTMS)采用Kronecker结构正交旋转将权重重构为适应三值化的三模态分布,同时该共享旋转在统计上抑制激活值离群点;(3)层间感知激活混合精度(ILA-AMP)在比特分配中显式引入相邻层二阶交互代价,并联合优化由共享正交变换导致的激活量化增益逐层差异,从而防止少数薄弱层引发的级联效应。大量实验表明,TWLA在W1.58A4配置下保持高精度,同时实现显著推理加速。代码见https://github.com/Kishon-zzx/TWLA。