The deployment of Large Language Models (LLMs) and Vision Transformers (ViTs) on edge devices is significantly constrained by memory limitations and the critical timing bottlenecks introduced by dense Multiply-Accumulate (MAC) arrays. In the ultra-low bit regime, logarithmic Power-of-Two (PoT) quantization provides a hardware-efficient alternative by replacing MAC operations with bit-shifts. However, the non-uniform exponential lattice is inherently limited by a \textbf{Low Angular Resolution Regime}, a structural flaw that becomes particularly pronounced at sub-4-bit thresholds, leading to a notable degradation of high-dimensional feature manifolds. To address this geometric limitation, we propose Orthogonal Residual Projection (ORP), an algorithm-hardware co-design framework. By formulating quantization as a dual-basis geometric projection, ORP adaptively synthesizes a higher-resolution residual lattice using strictly shift-and-add operations. Furthermore, ORP's analytical solver offers a practical alternative to computationally intensive gradient-based optimization, reducing the full-model calibration time for LLaMA-2-7B to approximately \textbf{15 minutes}. Extensive evaluations demonstrate ORP's applicability across modalities and its hardware efficiency. Under the 3-bit (W3/A16) constraint, ORP achieves a perplexity of 6.10 on LLaMA-2-7B, comparing favorably to conventional MAC-intensive baselines like AWQ without relying on asymmetric scaling, while maintaining competitive accuracy in 4-bit scenarios. At the silicon level, standard-cell RTL synthesis at a 28nm node indicates that ORP effectively mitigates the timing bottlenecks associated with dense multiplier trees.
翻译:大语言模型(LLMs)与视觉Transformer(ViTs)在边缘设备上的部署受限于内存容量及密集乘累加(MAC)阵列引发的关键时序瓶颈。在超低位宽场景下,对数幂次(PoT)量化通过将MAC运算替换为移位操作提供了一种硬件高效的替代方案。然而,非均匀指数格点固有地受限于**低角度分辨率机制**——这一结构缺陷在低于4比特阈值时尤为显著,导致高维特征流形显著退化。为克服该几何限制,我们提出正交残差投影(ORP),一种算法-硬件协同设计框架。ORP将量化建模为双基底几何投影,通过严格移位-相加操作自适应合成高分辨率残差格点。此外,ORP解析求解器为计算密集的梯度优化方法提供了实用替代方案,使LLaMA-2-7B全模型校准时间缩减至约**15分钟**。广泛评估验证了ORP跨模态适用性与硬件效率:在3比特(W3/A16)约束下,ORP使LLaMA-2-7B困惑度达6.10,与依赖非对称缩放的传统MAC密集型基线(如AWQ)相比性能优越,同时在4比特场景保持竞争精度。在硅片级,28nm工艺节点标准单元RTL综合表明,ORP有效缓解了密集乘法器树的时序瓶颈。