ViM-Q: Scalable Algorithm-Hardware Co-Design for Vision Mamba Model Inference on FPGA

from arxiv, Accepted to IEEE International Symposium On Field-Programmable Custom Computing Machines (FCCM 2026). Code: https://github.com/shengzhelyu65/ViM-Q-FCCM-2026

Vision Mamba (ViM) models offer a compelling efficiency advantage over Transformers by leveraging the linear complexity of State Space Models (SSMs), yet efficiently deploying them on FPGAs remains challenging. Linear layers struggle with dynamic activation outliers that render static quantization ineffective, while uniform quantization fails to capture the weight distribution at low bit-widths. Furthermore, while associative scan accelerates SSMs on GPUs, its memory access patterns are misaligned with the streaming dataflow required by FPGAs. To address these challenges, we present ViM-Q, a scalable algorithm-hardware co-design for end-to-end ViM inference on the edge. We introduce a hardware-aware quantization scheme combining dynamic per-token activation quantization and per-channel smoothing to mitigate outliers, alongside a custom 4-bit per-block Additive Power-of-Two (APoT) weight quantization. The models are deployed on a runtime-parameterizable FPGA accelerator featuring a linear engine employing a Lookup-Table (LUT) unit to replace multiplications with shift-add operations, and a fine-grained pipelined SSM engine that parallelizes the state dimension while preserving sequential recurrence. Crucially, the hardware supports runtime configuration, adapting to diverse dimensions and input resolutions across the ViM family. Implemented on an AMD ZCU102 FPGA, ViM-Q achieves an average 4.96x speedup and 59.8x energy efficiency gain over a quantized NVIDIA RTX 3090 GPU baseline for low-batch inference on ViM-tiny. This co-design shows a viable path for deploying ViM models on resource-constrained edge devices.

翻译：视觉曼巴（ViM）模型利用状态空间模型（SSM）的线性复杂度，相较于Transformer展现出显著效率优势，但其高效部署于FPGA上仍面临挑战。线性层存在动态激活值异常点，导致静态量化失效，而均匀量化在低位宽下无法有效捕捉权重分布。此外，尽管关联扫描加速了GPU上的SSM计算，但其访存模式与FPGA所需的流式数据流不兼容。针对这些问题，我们提出ViM-Q——面向边缘端端到端ViM推理的可扩展算法-硬件协同设计方案。我们引入了一种硬件感知量化方案，结合动态逐令牌激活值量化与逐通道平滑技术以抑制异常点，并采用自定义4位逐块加性二次幂（APoT）权重量化。模型部署于运行时参数可配置的FPGA加速器上，该加速器包含线性引擎（利用查找表单元将乘法替换为移位加法操作）与细粒度流水线化SSM引擎（并行处理状态维度同时保持序列递推性）。关键的是，硬件支持运行时配置，可自适应适配ViM系列中不同维度与输入分辨率。在AMD ZCU102 FPGA上的实验表明，针对ViM-tiny低批量推理，ViM-Q相比量化后的NVIDIA RTX 3090 GPU基线实现平均4.96倍加速与59.8倍能效提升。该协同设计为在资源受限边缘设备上部署ViM模型提供了可行路径。