TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization

Large language models (LLMs) exhibit exceptional general language processing capabilities, but their memory and compute costs hinder deployment. Ternarization has emerged as a promising compression technique, offering significant reductions in model size and inference complexity. However, existing methods struggle with heavy-tailed activation distributions and therefore keep activations in high precision, fundamentally limiting end-to-end inference acceleration. To overcome this limitation, we propose TWLA, a post-training quantization (PTQ) framework that achieves 1.58-bit weight compression and 4-bit activation quantization while maintaining high accuracy. TWLA comprises three components: (1) Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ) minimizes layer-output error under weight ternarization via a two-stage optimization from Euclidean initialization to manifold relocation; (2) Kronecker Orthogonal Tri-Modal Shaping (KOTMS) applies a Kronecker-structured orthogonal rotation to reshape weights into ternary-friendly tri-modal distributions, while the shared rotation statistically suppresses activation outliers; and (3) Inter-Layer Aware Activation Mixed Precision (ILA-AMP) explicitly introduces adjacent-layer second-order interaction costs in bit allocation and jointly optimizes for the layer-wise disparity of activation quantization gains induced by the shared orthogonal transform, preventing cascades triggered by a few weak layers. Extensive experiments demonstrate that TWLA maintains high accuracy under W1.58A4, while delivering significant inference acceleration. The code is available at https://github.com/Kishon-zzx/TWLA.

翻译：大语言模型（LLMs）展现出卓越的通用语言处理能力，但其存储与计算成本阻碍了部署。三值化已成为一种有前景的压缩技术，可显著降低模型规模与推理复杂度。然而，现有方法难以处理重尾激活分布，因此不得不保持高精度激活值，从根本上限制了端到端推理加速。为克服这一局限，我们提出TWLA——一种训练后量化（PTQ）框架，在保持高精度的同时实现1.58比特权重压缩与4比特激活量化。TWLA包含三个组件：（1）欧几里得-流形非对称三值量化器（E2M-ATQ）通过从欧几里得初始化到流形重定位的两阶段优化，最小化权重三值化下的层输出误差；（2）Kronecker正交三模态整形（KOTMS）采用Kronecker结构正交旋转将权重重构为适应三值化的三模态分布，同时该共享旋转在统计上抑制激活值离群点；（3）层间感知激活混合精度（ILA-AMP）在比特分配中显式引入相邻层二阶交互代价，并联合优化由共享正交变换导致的激活量化增益逐层差异，从而防止少数薄弱层引发的级联效应。大量实验表明，TWLA在W1.58A4配置下保持高精度，同时实现显著推理加速。代码见https://github.com/Kishon-zzx/TWLA。