As large language models (LLMs) continue to scale, deployment is increasingly bottlenecked by the memory wall, motivating a shift toward extremely low-bit quantization. However, most quantization-aware training (QAT) methods apply hard rounding and the straight-through estimator (STE) from the beginning of the training, which prematurely discretizes the optimization landscape and induces persistent gradient mismatch between latent weights and quantized weights, hindering effective optimization of quantized models. To address this, we propose Hestia, a Hessian-guided differentiable QAT framework for extremely low-bit LLMs, which replaces the rigid step function with a temperature-controlled softmax relaxation to maintain gradient flow early in training while progressively hardening quantization. Furthermore, Hestia leverages a tensor-wise Hessian trace metric as a lightweight curvature signal to drive fine-grained temperature annealing, enabling sensitivity-aware discretization across the model. Evaluations on Llama-3.2 show that Hestia consistently outperforms existing ternary QAT baselines, yielding average zero-shot improvements of 5.39% and 4.34% for the 1B and 3B models. These results indicate that Hessian-guided relaxation effectively recovers representational capacity, establishing a more robust training path for 1.58-bit LLMs. The code is available at https://github.com/hestia2026/Hestia.
翻译:随着大语言模型(LLM)规模的持续扩大,其部署日益受到内存墙的制约,这推动了对极低比特量化方法的需求。然而,大多数量化感知训练(QAT)方法在训练初期便采用硬性舍入和直通估计器(STE),这过早地离散化了优化空间,并在潜在权重与量化权重之间引入了持续的梯度失配,从而阻碍了量化模型的有效优化。为解决此问题,我们提出了Hestia,一个面向极低比特LLM的Hessian引导可微QAT框架。该框架使用温度控制的softmax松弛替代了刚性的阶跃函数,以在训练早期维持梯度流动,同时逐步硬化量化过程。此外,Hestia利用张量级Hessian迹度量作为一种轻量级的曲率信号,驱动细粒度的温度退火,从而在模型内实现感知敏感度的离散化。在Llama-3.2上的评估表明,Hestia在1B和3B模型上分别取得了平均5.39%和4.34%的零样本性能提升,持续优于现有的三元QAT基线方法。这些结果表明,Hessian引导的松弛策略有效恢复了模型的表征能力,为1.58比特LLM建立了一条更稳健的训练路径。代码已发布于 https://github.com/hestia2026/Hestia。