OrpQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization

The deployment of Large Language Models (LLMs) and Vision Transformers (ViTs) on edge devices is significantly constrained by memory limitations and the critical timing bottlenecks introduced by dense Multiply-Accumulate (MAC) arrays. In the ultra-low bit regime, logarithmic Power-of-Two (PoT) quantization provides a hardware-efficient alternative by replacing MAC operations with bit-shifts. However, the non-uniform exponential lattice is inherently limited by a \textbf{Low Angular Resolution Regime}, a structural flaw that becomes particularly pronounced at sub-4-bit thresholds, leading to a notable degradation of high-dimensional feature manifolds. To address this geometric limitation, we propose Orthogonal Residual Projection (ORP), an algorithm-hardware co-design framework. By formulating quantization as a dual-basis geometric projection, ORP adaptively synthesizes a higher-resolution residual lattice using strictly shift-and-add operations. Furthermore, ORP's analytical solver offers a practical alternative to computationally intensive gradient-based optimization, reducing the full-model calibration time for LLaMA-2-7B to approximately \textbf{15 minutes}. Extensive evaluations demonstrate ORP's applicability across modalities and its hardware efficiency. Under the 3-bit (W3/A16) constraint, ORP achieves a perplexity of 6.10 on LLaMA-2-7B, comparing favorably to conventional MAC-intensive baselines like AWQ without relying on asymmetric scaling, while maintaining competitive accuracy in 4-bit scenarios. At the silicon level, standard-cell RTL synthesis at a 28nm node indicates that ORP effectively mitigates the timing bottlenecks associated with dense multiplier trees.

翻译：大语言模型（LLMs）与视觉Transformer（ViTs）在边缘设备上的部署受限于内存容量及密集乘累加（MAC）阵列引发的关键时序瓶颈。在超低位宽场景下，对数幂次（PoT）量化通过将MAC运算替换为移位操作提供了一种硬件高效的替代方案。然而，非均匀指数格点固有地受限于**低角度分辨率机制**——这一结构缺陷在低于4比特阈值时尤为显著，导致高维特征流形显著退化。为克服该几何限制，我们提出正交残差投影（ORP），一种算法-硬件协同设计框架。ORP将量化建模为双基底几何投影，通过严格移位-相加操作自适应合成高分辨率残差格点。此外，ORP解析求解器为计算密集的梯度优化方法提供了实用替代方案，使LLaMA-2-7B全模型校准时间缩减至约**15分钟**。广泛评估验证了ORP跨模态适用性与硬件效率：在3比特（W3/A16）约束下，ORP使LLaMA-2-7B困惑度达6.10，与依赖非对称缩放的传统MAC密集型基线（如AWQ）相比性能优越，同时在4比特场景保持竞争精度。在硅片级，28nm工艺节点标准单元RTL综合表明，ORP有效缓解了密集乘法器树的时序瓶颈。