Harmonia: Algorithm-Hardware Co-Design for Memory- and Compute-Efficient BFP-based LLM Inference

Large Language Models (LLMs) are powerful but incur high memory and computation costs. Quantization is an effective solution, with INT weights and FP activations being widely adopted to preserve accuracy. Prior works further reduce FP overhead by using block floating point (BFP) activations in linear layers, but fail to extend BFP to attention layers due to severe accuracy degradation, limiting overall efficiency. To address this challenge, we propose Harmonia, an algorithm-hardware co-design framework that enables all-layer BFP activations with a configurable hardware architecture. First, we systematically explore BFP configurations to achieve a better trade-off between accuracy and activation compression across all layers. Second, to reduce KV-cache storage and computation in attention layers, we introduce an asymmetric bit-allocation strategy and computations in attention layers,we introduce an asymmetric bit-allocation strategy combined with a hybrid offline-online outlier smoothing technique. This allow aggressive KV-cache compression from FP16 to 4-bit-mantissa BFP with only 0.3% average accuracy loss. Third, to fully exploit all-layer BFP activations, we design dedicated hardware components, including a reconfigurable PE supporting mixed data formats (BFP-INT and BPF-BFP), a real-time FP16-to-BFP converter, and a tiling-aware dataflow to reduce memory traffic. We evaluate Harmonia on GEMM operations in both linear and attention layers across eight widely used LLMs. Compared with prior works, Harmonia achieves 3.84x (up to 5.05x) higher area efficiency, 2.03x (up to 3.90x) better energy efficiency, and 3.08x (up to 4.62x) speedup on average.

翻译：大语言模型（LLMs）功能强大，但存在高昂的内存与计算开销。量化是一种有效的解决方案，其中INT权重与FP激活值被广泛采用以保持精度。先前的研究通过在线性层中使用块浮点数（BFP）激活值进一步降低了FP开销，但由于严重的精度损失，未能将BFP扩展到注意力层，从而限制了整体效率。为应对这一挑战，我们提出了Harmonia，一种算法-硬件协同设计框架，通过可配置的硬件架构实现全层BFP激活。首先，我们系统地探索BFP配置，以在所有层中实现精度与激活压缩之间更好的权衡。其次，为减少注意力层中的KV缓存存储与计算，我们引入了一种非对称比特分配策略，并结合了混合离线-在线异常值平滑技术。这使得KV缓存能够从FP16被激进地压缩至4位尾数BFP，而平均精度损失仅为0.3%。第三，为充分利用全层BFP激活，我们设计了专用的硬件组件，包括支持混合数据格式（BFP-INT和BFP-BFP）的可重构处理单元、实时FP16到BFP转换器，以及一种基于分块感知的数据流以减少内存流量。我们在八个广泛使用的LLM的线性层和注意力层中的GEMM操作上评估了Harmonia。与先前工作相比，Harmonia平均实现了3.84倍（最高5.05倍）的面积效率提升、2.03倍（最高3.90倍）的能效提升以及3.08倍（最高4.62倍）的加速比。