Masked diffusion enables region-specific image synthesis but suffers from computational redundancy, since the entire image is processed each timestep even though only the masked region requires generation. To address this, we introduce MASQ, a hardware-software co-designed accelerator for masked diffusion. Our approach performs stage-wise MXINT8/4/2 precision assignment that dynamically reflects spatial and semantic importance, complemented by timestep-aware scheduling and optimized non-matrix operations. MASQ features a block-wise multi-precision compute engine and mask management unit, efficiently handling our approach. It achieves up to 16.06x and 5.39x speedup and 4.18x and 4.93x energy-efficiency gain over A100 and Orin NX, respectively, while preserving quality.
翻译:掩码扩散能够实现区域特定图像合成,但由于每个时间步仍需处理整幅图像(即使仅需生成掩码区域),导致计算冗余。为解决该问题,我们提出MASQ——一种面向掩码扩散的软硬件协同加速器。该方法采用逐阶段MXINT8/4/2精度分配策略,动态反映空间与语义重要性,并辅以时间步感知调度与优化的非矩阵运算。MASQ基于块级多精度计算引擎与掩码管理单元,高效实现上述方法。与A100和Orin NX相比,MASQ在保持生成质量的同时,分别实现最高16.06倍和5.39倍加速比,以及4.18倍和4.93倍能效提升。